Sunday, January 20, 2008

Picks and shovels of the semantic Web want to be free

A lot of people have been asking me over the past week or so, during the beta launch of Tinfinger, what it is about and why we are doing it at all. At the same time, I have been watching a few strands of conversation across the blogosphere which have crystallised my answer to that question. So here it is.

The first hint came with the news the day before we launched that MetaWeb raised US$42 million in Series B funding, making a total investment of US$57 million and causing some industry incredulity. MetaWeb's Freebase is doing something that I hope to do with Tinfinger: creating a freely available semantic Web database of all the world's information (although with Tinfinger we're sticking to the people vertical).

MetaWeb's business model for their flagship product Freebase, stated as somewhat of a vague afterthought in their FAQ, is to charge large corporate users for access to that database, which is licensed as CC-BY meaning that it's free to use with attribution back to the source. Using CC-BY is sensible for some uses - indeed, Tinfinger will do the same for its data and 150-word profile articles - but to me it seems strange for MetaWeb because the economics are all wrong. It makes perfect sense for Wikipedia to use CC-BY, because although they don't allow money to change hands for the production of any of their content, the currency they operate in is PageRank, and CC-BY is arguably the finest PageRank-building mechanism known to man. If you are wondering why Wikipedia is in the top 10 results for just about every search term in Google, look no further than the CC-BY license, because they get links back from every page on the Web which reprints Wikipedia content, of which there are legion. But what does PageRank mean for MetaWeb and Freebase? Freebase is not a destination site. They have not shown the slightest inclination to build landing pages. They display no knowledge of SEO techniques. CC-BY is useless to them.

It is an old cliche that the people who make money out of gold rushes are those selling the picks and shovels. MetaWeb is endeavouring to be the goldpanning equipment vendor for the semantic Web, which is a respectable goal. But how can you turn a dime if there is a place next door which is giving away dynamite for free? Let us be honest about the origins of Freebase, Tinfinger, Google Base, Twine, Spock et al. All such attempts to build the semantic Web have used as the core of their proprietary/licensed database the freely available (or at least freely scrapeable) databases such as dbpedia, ISBNdb, IMDb, IBDb, ITDb, BASE, Cricinfo, all the way to Project Gutenberg. It is my opinion that the economics of the database industry are such that, eventually, most of the important databases will be made available for free online. After a somewhat moribund period in the 90s, storage hardware has been undergoing some very rapid Moore's-Law-style advancements this decade and it will not be long before we have highly affordable solid state drives which are Internet-ready. Cost will not be an issue. It's probably not really an issue right now anyway, it's just a matter of the politics of shoveling huge data silos like SEC filings out from behind corporate paywalls.

To my mind, on one side you have MetaWeb, LexisNexis, EDGAR Online and the rest of the cabal who are relying on siphoning micropaid profits from licensing of data when the semantic Web takes off. On the other side, you have... the entire Internet. How can the semantic Web take off when Big Companies are standing in its way? The Internet finds a way around. In this case, it finds how to create its own semantic database, which might not be perfectly crafted or 100% reliable, but in the words of Rich Skrenta "the cheap rickety thing wins in the end".

You may still ask, "But isn't that what Wikipedia is for too?" Wikipedia is a fabulous resource for prose text, but in the area of tagging, MediaWiki was not really built from the ground up to handle it to the extent that a fully-fledged semantic Web application would require. Wikipedia's tag system is ad hoc, bootstrapped and too prone to user error.

This is where I hope Tinfinger can help. I didn't talk up our tagging features very much at launch because the tag data in our system now is mostly adapted from Wikipedia and thus not of the greatest quality - as was rightly pointed out already - but I think that's where the eventual power of Tinfinger will lie, once we implement the full system.

I think it's only fitting that a site such as Tinfinger which builds on top of public domain data sources contributes back to the public domain to the extent that economics allow. We will try to publish as much of our structured data as possible in ways that can help you with your own projects, the same way as Wikipedia allows you to add instant content to your Web pages. With attribution, of course. ;)

0 Comments:

Post a Comment

<< Home