Saturday, October 29, 2005

The untapped archive of non-RSS feeds

One of the most valid criticisms of Web 2.0 is that yes, it's a good thing in general that entry costs are much lower these days with businesses able to reach profitability with only a six-figure investment - but that means that even if you have a good idea, it's highly likely that at least 20 other spotty-faced entrepreneurs with at least as much technical nous as you have had the same idea, so you're probably screwed anyway. RSS/Atom aggregators are perhaps the prime example of this (sorry Ben, Feedtagger qualifies!).

Tinfinger (you remember, the "human search engine") is going to aggregate RSS feeds amongst other functions, so we are as much of a target of this criticism as anyone. However, the original inspiration for Tinfinger came from something which had no RSS in it at all. The site's "proof of concept" was a feature on our AFL fantasy football site FanFooty called Media Street, which indexes stories from 35 news sites.

However, unlike those two services, none of the sites indexed on Media Street have RSS feeds. You might ask why we went ahead and built an indexing service. What a pain in the arse that was! Sure, we could have waited until at least some of those sites produced RSS feeds, but I guess we'd be waiting a long time and footy fans want their news NOW. To build that service, Tai and I had to code up an indexing script that would look through each site's story archive page for new stories and then index those new pages when they appeared. And it updates every five minutes via a script running on my home PC. Craziness!

I don't have any actual figures on how many news sites have RSS feeds at the moment, but I'm guessing it's a minuscule portion of the whole, say 5% (© PNOOMA Research). What about the other 95% of sites? Are their contributions to the Internet to be cast aside? I certainly couldn't build an automated news aggregator for specialist Australian rules football content using only RSS sources, because the ratio of non-RSS publishers to RSS converts in that sector runs at exactly 35:0.

This gets me back to my starting point of how all businesses in the RSS aggregator industry are the same. Why? Because they all rely on the same feed of RSS/Atom feeds from Weblogs.com and the Feedmesh, and the vast majority of content in those two services is blogs, not news or other types of feeds. So, a business opportunity exists for companies to go out to the un-RSSified Web and find these feeds, right?

Maybe. The reason RSS is so good is that it is a far more efficient method of updating new content, so that bandwidth is only used when there is new content, rather than the old way (and the Media Street way) of scraping the index page every X minutes. Why go through all that pain in the short term when Information Age predicts 100 billion RSS feeds within six years? Because users want that content NOW. But can you make enough money to offset the increased bandwidth and hardware costs? And how to find all these lost proto-feeds? How to get non-geek users interested in entering in details of index page URLs, link identifiers, header and footer strings, site metadata, and other such technical bits and pieces?

These are the issues I'm mulling over. When Tinfinger debuts, hopefully you'll see what I'm talking about.

1 Comments:

Anonymous Gustavo said...

that's a great idea. Is there a way you can share this script. i'd like to bring in to my PC feeds from sources that don't have RSS feeds such as the small local newspapers in my area.

Thanks!

1:54 am, September 06, 2007  

Post a Comment

Links to this post:

Create a Link

<< Home