Monday, January 08, 2007

News 2.0 alpha algo mojo au-go-go

Today I had one of those moments of conceptual breakthrough that make this caper worthwhile, at least in this pre-launch alpha larval limbo stage we're in at Tinfinger. After reading about the widespread pooh-poohing of Daylife a few days ago, I had been worrying about several of the criticisms of Daylife and whether Tinfinger would itself stand accused of same when it officially launches. On one of those points, I think we're safe: the criticism, characterised best by Scott Karp, that Daylife's editorial algorithm offers nothing new or trustworthy.

Current providers in what has been called the News 2.0 sector have traditionally used one of three methods to sort their news items and figure out which ones should be given most prominence. The simplest is to use humans, either your users (Digg), paid editors (Netscape) or a combination of the two (Yahoo!). The second method, made most successful by Google News, is a fully automated set of algorithms giving rankings to various arbitrary concepts like relevance, recency, frequency of certain search terms, article length and so on. This ranking method can only be employed with an extremely imposing amount of processing power since the algorithms are very complex in their execution: last time I heard, Google updated their front pages only every 15 minutes. I don't know how much RAM that Topix.net, Newsvine, Inform.com and Gather.com throw at their algos, but it must be in a similar range to GN.

The third method, popularised by Memeorandum and Techmeme and the other sites in the Gabe Rivera family, is based on hyperlinks between articles, creating clusters of linked stories. This cluster method has the advantage of being relatively simple in database terms so that it can be operated on one-man-budget startups, such as those of Gabe, Kevin Burton of Tailrank and Matthew Chen of Megite.

The problem with both of these methods is they tend to produce samey results, leading to problems for the market followers to find differentiation. The ranking methodologists have the big G sitting there with all its toys giving it enough scale to comfortably keep its lead, and the Memeoclones have to suffer through forever being compared to Gabe. My guess is that Daylife, despite the VC dollars, doesn't have enough money to spend on iron big enough to handle GN-scale processing either, but they look like they are using that method along with Newsvine, Gather.com, Inform.com and the other wannabes.

Thus with the market winners already decided, so it seems, the smart operators have decided to take a different tack. Topix has already chucked in its lot with local newspapers, and is doing quite well from all reports. Matthew Chen is doing some interesting things with personalisation and has a link up on Megite for licensing. I'm sure Kevin Burton is dreaming up something in that humungous brain of his, though from the looks of it Tailrank might be moving away slightly from the Memeo link model and towards ranking algos.

So, where does Tinfinger fit in with all this? That was the subject of my revelation today. The Tinfinger algorithm, which goes by the name of tinscore, has been getting its first real workout over the past fortnight or so as we added a few more blogs, and it will get even more meat to chew on once our site discovery script is unleashed. For the first time, I can see how the equations resolve into reality with a non-pre-alpha-sized database to work with. And the results, I am happy to say, are different. They identify many of the same topic clusters as the other News 2.0 sites, but the stories which are chosen to get top billing are often not the primary or original breaking news story. They are the more thoughtful, longer, in-depth feature articles. That's the way that the algo works, unlike the status quo: timeliness is not much of a factor, as long as the story appeared in the last 24 hours; links mean nothing; all sites are ranked equally. What matters is the prominence given to the names, their frequency in the story, the amount of other people mentioned in the story, and the length of the article.

In some ways, Tinfinger will be different because it's not News 2.0, it's Feature 2.0. Substantial MSM feature articles, expansively-argued opinion pieces and passionately wordy blog rants which would on other aggregators get shifted to the bottom of the pile (or ignored entirely) because they would be considered as secondary sources are instead pushed in front of the footlights in our system, and used as the basis for their own clusters with names as the connecting factor.

Because of this thought, despite having previously architected the system to harvest external links from all articles and waiting until now to incorporate them into the Tinfinger algorithm, I am seriously considering not including links in our algo at all. That would defeat the purpose of having something distinctive. The algo's simplicity, its dumbness if you will, is what makes it unique. Hopefully it will prevent people throwing the same darts at Tinfinger on our launch day as were aimed at Daylife.

0 Comments:

Post a Comment

Links to this post:

Create a Link

<< Home