Open source business intelligence and OLAP for Healthcare

I’ve been working on implementing a data warehouse based on open source tools. The actual information system that produces the data runs on Microsoft Sql Server and other MS technologies. I have been asked to implement a minimum cost solution, zero if possible. It appears that some parts of the open source business intelligence domain are more mature than others, and zero cost might not be possible after all. KETTLE project from pentaho looks good enough for ETL. I’m using Mondrian (again from Pentaho) as an OLAP server, and even though it does not support complete mdx implementation, it looks promising for the moment. However, the front end is not promising. JRubik is no longer developed, and JPivot is in need of a face lift. Recent trends in web development raised the bar for UI requirements, and without even talking to the clients I can see that current solutions won’t cut it.

The overall process has been a very good experience, and it is still work in progress. I have high hopes for machine learning applied to output of the system, but that’ll be another post. The thing is: this project is not in healthcare domain, and I’ve been thinking about how things would go if it was actually in healhtcare.

Looking at the work I’ve done so far, and the trends in healthcare, I can see a big problem emerging. Traditional persistence methods based on relational databased have a set of mature tools for interfacing them with other domains like business intelligence, analysis etc. Emerging trends in healthcare seems to go towards new implementations for persistence of data, and these trends do not usually employ traditional relational database design even if they use databases for persistence. So we might be going towards a situation in which we have unusual persistence implementations, and connecting these implementations to analysis tools will be hard. ETL tools etc. focus on well known practices, and assume that you’ll be dealing with some tables with foreign keys, relations etc. A recent discussion in OpenEHR mail list gave me a very different impression for reference implementations of OpenEHR. So we’d better start thinking about custom ETL tools or interfacing mechanisms to connect the upcoming implementations to information processing tools. I suspect that EHR related standards might have a hard time proving their value in providing valuable information if they can not be connected to existing knowledge engineering tools.

This gives me the idea of a healthcare ETL (extract transform and load) tool / initiative. If the managers can’t see the same set of knowledge engineering opportunities that exist for traditional db apps for EHR implementations, current initiatives might face a lot of resistance and struggle.

Google web toolkit, one step away from revolution

Ok, let’s just accept it, web is not what it used to be anymore. I will not go into details but web user interfaces have accepted and embraced javascript. About five years ago, I was busy trying to build a user interface based on javascript tricks for an electronic claim processing system, and it was a nightmare. Firefox was not a major consideration back then, but internet explorer had threee version that ran on 3 different version of windows.  I have suffered a lot from inconsistent apis, and I had to invent a lot of tricks similar to basics of ajax today.

Now google has given us gwt (google web toolkit) and once you get how it works, you can see that it rocks. I a project of mine, I have to build some kind of component for web which has to have strong user interface features, fancy effects etc, and it has be available to a couple of backends. JSF, PHP, ASP.NET, Rails, you name it. Now if I had to generate that component for each backend technology, I’d be in trouble, not only in creating it, but also in maintaining it.  The problem with ajax and highly interactive user interfaces is that generating them from the backend technology is trouble. Debugging and cross browser issues are just nightmares.

GWT isolates you from all of it, it gives you the approach of swing or windows forms that can be integrated to any back end, and JSON is the key to that. However, somehow gwt docs and tutorials are weak when it comes to integrating it to other backends. ASP.NET web services in .NET 3.5 has some very neat features like making a web service use JSON instead of soap with just one attribute, but so far I have not completed the integration. Python, java, .net, rails and php all have some form of json support, which gives me hope. I might just be able to seperate web development from backend, much like a platform independent MVC for web development, where view is GWT. I’ll be working on this, more to come…

Pentaho, does it work?

I have a project in which I have to come up with a basic data warehouse implementation. I have to deal with all the basics, ETL, Cube design, etc, and on top of that I intend to build a naive Bayesian classifier generator for decision support. (I might consider ID3 or C 4.5, but I’m not sure they are free for this kind of use). Developing all of these from scratch is out of the question, after all why should I do it if I do not have to. Having a decent UI at least for some tasks would be nice though, and Pentaho might be the answer. I have been following Pentaho for quite some time now, and finally I need exactly what they provide, for a consultancy job. I guess we’ll see if they are up to the claims they make. Most of the parts of their product portfolio are based on well known tools like weka or mondrian, but they have been building solutions that use eclipse rcp to wrap these tools, and might be able to do a lot with their existing solutions. I’ll write a detailed summary of my experience, but for the moment Pentaho seems to be the only vendor that opens a free, open source solution. If I can reuse their work, that’d be a really very important base for my future plans, because I’ve always believed that business intelligence and/or analysis tools require knowledge in various areas like data mining, machine learning etc, to provide a real benefit. So money paid for any of these tools should actually be paid for the expert not the tool, since I can hardly imagine an off the shelf tools providing the real benefit of the mentioned concepts. Well, I guess we’ll see about that.

Looking for trouble? Try Bayesian Artificial Intelligence…

Ok, I’ll be honest, I’ve always been into probabilistic methods, for they somehow “fit” into my way of thinking. There is something about probabilistic methods, and probability theory; you are either suitable to work with it or not; you either love the field, or hate it.

I’m the kind of guy who has some love hate relationship with it. I certainly like the field, but the overall concept is so deep and abstract that I can get lost very easily. Something that makes perfect sense seems like Chinese the next day, but I still can’t let go.

On top of that, I’ve been working on integrating probability based methods to my work in data mining and decision support, and finally I found myself working on Bayesian AI. Trust me; it “is” hard. It requires you to cover a vast amount of subjects and even then there is always something missing. Still, I have not given up, and I’m about to reach a point where I can build simple but practical applications for medical informatics. Bayesian AI is basically probabilistic modeling for building (semi)autonomous systems. After Judea Pearl wrote the book Probabilistic Reasoning in Intelligent Systems, an army of researchers rushed to the field, but still the field seems much less crowded compared to well known AI, neural networks etc. If you’d like to have an idea of what I’m talking about; MIT has a very good web page in OpenCourseWare which you can find here . I have been looking around to find some frameworks which I can use, and the work in the field has few complete, well polished outcomes. Most of the projects seems to be dead or incomplete, but there are a few worth mentioning, but I’ll do that later.

Bayesian AI provides a set of very strong tools when you have a  heap of raw data and chaos, which is an acceptable definition of health informatics. I’m very clear about one thing; I’m tired of building things that somehow collect and save data. Most of the time what we call information is nothing more than a set of fancy reports, and we are quite far away from using existing data for decision making. I really believe that there exists a requirement for a new generation of tools that will be based on modeling of healthcare domain so that we can forecast the outcomes of our choices, at least at a primitive level. Even the simplest of such tools would make a huge difference. I should say that it is very, very hard to build them, but it seems like a more justified effort than building another version of an already existing EHR system or HIS. There are a lot of bright people working on these fields, why very few people choose to work on modeling and forecasting is a mystery to me.

Looking for trouble? Try Bayesian Artificial Intelligence…

Ok, I’ll be honest, I’ve always been into probabilistic methods, for they somehow “fit” into my way of thinking. There is something about probabilistic methods, and probability theory; you are either suitable to work with it or not; you either love the field, or hate it.

I’m the kind of guy who has some love hate relationship with it. I certainly like the field, but the overall concept is so deep and abstract that I can get lost very easily. Something that makes perfect sense seems like Chinese the next day, but I still can’t let go.

On top of that, I’ve been working on integrating probability based methods to my work in data mining and decision support, and finally I found myself working on Bayesian AI. Trust me; it “is” hard. It requires you to cover a vast amount of subjects and even then there is always something missing. Still, I have not given up, and I’m about to reach a point where I can build simple but practical applications for medical informatics. Bayesian AI is basically probabilistic modeling for building (semi)autonomous systems. After Judea Pearl wrote the book Probabilistic Reasoning in Intelligent Systems, an army of researchers rushed to the field, but still the field seems much less crowded compared to well known AI, neural networks etc. If you’d like to have an idea of what I’m talking about; MIT has a very good web page in OpenCourseWare which you can find here . I have been looking around to find some frameworks which I can use, and the work in the field has few complete, well polished outcomes. Most of the projects seems to be dead or incomplete, but there are a few worth mentioning, but I’ll do that later.

Bayesian AI provides a set of very strong tools when you have a  heap of raw data and chaos, which is an acceptable definition of health informatics. I’m very clear about one thing; I’m tired of building things that somehow collect and save data. Most of the time what we call information is nothing more than a set of fancy reports, and we are quite far away from using existing data for decision making. I really believe that there exists a requirement for a new generation of tools that will be based on modeling of healthcare domain so that we can forecast the outcomes of our choices, at least at a primitive level. Even the simplest of such tools would make a huge difference. I should say that it is very, very hard to build them, but it seems like a more justified effort than building another version of an already existing EHR system or HIS. There are a lot of bright people working on these fields, why very few people choose to work on modeling and forecasting is a mystery to me.