I attended another Data Science London meeting last week. As usual, it was a good one. Speakers talked about their experience with twitter feeds that includes foursquare check-ins and scraping data from web sites. Scraping is basically extracting information from the web pages, in a way simulating a human’s use of the web site to use the information provided by that web site.
Both talks were interesting, and both had something in common: the people who are trying to access data had no programmatic, well defined method of doing so, so they resorted to other methods. The case of Lyst was especially interesting. They’ve gone through a lot of trouble to set up a system that can collect data from lots and lots of online fashion retailers. They have an infrastructure that extracts information about tens (hundreds?) of thousands of products from lots of web sites, and as surprising as it may be, they are actually keeping things under control, and presenting a single site that allows people to access data as if it is presented by a single source. A question that was asked by someone in the audience was: “do you have any programmatic access to these sites?”. As in, do they give you web services? The answer was something in the lines of very few. It is usually a crawler that extracts information from the web site that does the job (though they are working with the consent of the web sites they’re parsing). I think it was also someone from Lyst (or maybe the audience, not so sure about it) who said it is pretty much the reality of the web we have today, despite all that hype about semantic web.Read More »