I attended another Data Science London meeting last week. As usual, it was a good one. Speakers talked about their experience with twitter feeds that includes foursquare check-ins and scraping data from web sites. Scraping is basically extracting information from the web pages, in a way simulating a human’s use of the web site to use the information provided by that web site.
Both talks were interesting, and both had something in common: the people who are trying to access data had no programmatic, well defined method of doing so, so they resorted to other methods. The case of Lyst was especially interesting. They’ve gone through a lot of trouble to set up a system that can collect data from lots and lots of online fashion retailers. They have an infrastructure that extracts information about tens (hundreds?) of thousands of products from lots of web sites, and as surprising as it may be, they are actually keeping things under control, and presenting a single site that allows people to access data as if it is presented by a single source. A question that was asked by someone in the audience was: “do you have any programmatic access to these sites?”. As in, do they give you web services? The answer was something in the lines of very few. It is usually a crawler that extracts information from the web site that does the job (though they are working with the consent of the web sites they’re parsing). I think it was also someone from Lyst (or maybe the audience, not so sure about it) who said it is pretty much the reality of the web we have today, despite all that hype about semantic web.
Let me be honest: I never thought semantic web would take over. Whenever I saw someone giving a talk about the future of the web, about how web sites would be talking to each other, about how RDF and OWL would let the web become a computable gigantic knowledge base, I thought: “sorry, you’ll never get there”. That is because I’ve spend a serious amount of time developing web applications. I started around 97, and did it seriously until 2004 or so. I got to learn the way things work, got to see the way trends are going, and after 2004 I still did a lot of development using web technologies though I did not really have to deliver anything that needed to be production quality. When you see the business and technology side of the web, you begin to develop a sense of what would take off in this domain and what would not.
I was sure semantic web would never get to where it was expected to go, because (drumroll) it simply offered no immediate, tangible business value to owners of web sites. It is the good old semantic interoperability problem we have in healthcare. Someone must justify the extra effort for making data computable by others, or that effort will not be funded. It is an investment into a scenario which would become reality if everybody committed into it. In healthcare, the need to integrate different sources of data has always been significant, so we had HL7, DICOM, openEHR, 13606 etc. For the web, the need is not really there. In fact, if you’re not selling an actual product, but you’re making money off information, it is better that you do not let anyone else process your information programmatically. It is your clicks, your ad revenue people will take away from you, so why let them do it?
RSS and web services may help islands of information to connect to each other, but they are mostly standards for the wire, that is, they help communicate information, they don’t offer content models, or support semantic interoperability. Just look at all those companies building businesses on top of extracting information from twitter.
I have put enormous effort into building a framework for better decision support in healthcare, and looking at where we are today, I think there is a future for computable health, so no need (yet) for me to cry for my lost years. The practice of medicine needs this. As hard as it may be, as long as the need is there, there will be effort to deliver. We should not be too relaxed though, because if the reward is good enough, someone will always build solutions on the interim approach, and if it is perceived as good enough, better may actually kill best.
There is a massive amount of tools, technologies and people out there, making the web work smarter even if the components of the web are not necessarily helping. I think healthcare will do better than web in terms of becoming computable, but it will get there much faster if we can offer an economic benefit to the system developers/builders, a reason to build computable health systems.