Tinscore and other ways to clone Memeorandum
Following on from my Decisions, decisions post last week here is what I've worked out. (Yes I'm spamming Memeorandum's name, so sue me! :P)
On the size of the reading lists, I've decided on quality. I understand Topix has an automated Webcrawler to discover new content, but we're not at that level yet so we'll have to plod along tortoise-style with our ugly forms entering in details for each site by hand. Hopefully that will mean better-quality whitelists of sites for each category. Nevertheless, Tai's next job is working on an automated crawler, or at least something that will aid humans in figuring out each HTML site's details more speedily.
On tinscore, the working name for our ranking algorithm: I'm going with arithmetic for now and we'll see how that goes. The line in our PHP code to work out the algorithm at the moment looks like this for each person mentioned in each story:
Intitle is either 0 or 1 depending on whether the person is mentioned by name in the title. freqscore is the number of times the person's surname is mentioned in the body text divided by 10, to a maximum of 2 (i.e. 20 mentions). Prominence is a number from 0 to 2 representing how early the person's full name appears in the body text, so that if it is right at the start then the prominence is 2, halfway through is 1 and at the end is 0, with non-integer amounts allowed in between. Share is freqscore for this person as a percentage of the total freqscore for all people mentioned in the story multiplied by 100, so that if a person is mentioned 5 times but other people are mentioned 10 times in total, their share is 33. Storysize is the length of the body text as a fraction of 2500 characters, with an upper limit of 1 if the article is over 2500 characters long. The upper limit of tinscore is 500 ( (1 + 2 + 2) * 100 * 1). The highest score I've seen from our test data is 462.
So that's the tinscore for the "snippet" where each person mentioned in each story. For the purposes of ranking the level of buzz over each person for display pages, the collective tinscores are modified by recency, meaning tinscores are marked down over time. And I haven't even gotten to adding blogs in there yet. That's my next job, with Tai away in Adelaide for a few days. It should be much easier than scraping HTML pages (O, the pain, the pain!).
Back to the decisions. On the size of the result set: everyone in our database gets indexed if their name is found in a relevant story. There are still some code to write to streamline the process, of course. At the moment with our display page for AFL players I'm doing a minimum of 16 database calls (usually 25+) with orders on a table with more than 45,000 rows (four months of archives of Media Street) and even on our test box on our local LAN I have time to go and get a cup of tea while the results trickle onto the page. As we did with Media Street, I always knew that pre-populating Tinfinger results was inevitable. I'm sure that's what Gabe Rivera at Memeorandum does too, that's why he "publishes" every five minutes and only shows that published data. Not that I'm criticising at all, that's the correct thing to do if you haven't got a server farm the size of Google's. As Tinfinger's range of categories grow even that prepopulated table structure might prove problematic if we stick results for all categories in the same table, as I will do at first. We'll cross that bridge when we come to it.
On the structure of human metadata: I guess it has to be keyword-based tags, not plain English with spaces. Tags are undeniably useful for a range of applications, although we sacrifice some usability for people not familiar with the way tags work (but even that can be solved with judicious use of plain English metadata attached to the tags).
Finally, on disambiguation... well, it's still a bitch. Having said that, it will be worth the effort, since I would hate for Tinfinger to have as many false positives as Zoominfo seems to have from my (admittedly limited) sampling. I think for a human search engine to have any credibility, it should strive for as close to zero false positives as possible. I'd rather leave out 99 accurate items than include a wrong one. Perfection may be forever elusive, as with Wikipedia, but if you set up the site's processes so as to concentrate your best energies towards the quest for accuracy then that's the best anyone can do.
On the size of the reading lists, I've decided on quality. I understand Topix has an automated Webcrawler to discover new content, but we're not at that level yet so we'll have to plod along tortoise-style with our ugly forms entering in details for each site by hand. Hopefully that will mean better-quality whitelists of sites for each category. Nevertheless, Tai's next job is working on an automated crawler, or at least something that will aid humans in figuring out each HTML site's details more speedily.
On tinscore, the working name for our ranking algorithm: I'm going with arithmetic for now and we'll see how that goes. The line in our PHP code to work out the algorithm at the moment looks like this for each person mentioned in each story:
tinscore = ( intitle + freqscore + prominence ) * share * storysize
Intitle is either 0 or 1 depending on whether the person is mentioned by name in the title. freqscore is the number of times the person's surname is mentioned in the body text divided by 10, to a maximum of 2 (i.e. 20 mentions). Prominence is a number from 0 to 2 representing how early the person's full name appears in the body text, so that if it is right at the start then the prominence is 2, halfway through is 1 and at the end is 0, with non-integer amounts allowed in between. Share is freqscore for this person as a percentage of the total freqscore for all people mentioned in the story multiplied by 100, so that if a person is mentioned 5 times but other people are mentioned 10 times in total, their share is 33. Storysize is the length of the body text as a fraction of 2500 characters, with an upper limit of 1 if the article is over 2500 characters long. The upper limit of tinscore is 500 ( (1 + 2 + 2) * 100 * 1). The highest score I've seen from our test data is 462.
So that's the tinscore for the "snippet" where each person mentioned in each story. For the purposes of ranking the level of buzz over each person for display pages, the collective tinscores are modified by recency, meaning tinscores are marked down over time. And I haven't even gotten to adding blogs in there yet. That's my next job, with Tai away in Adelaide for a few days. It should be much easier than scraping HTML pages (O, the pain, the pain!).
Back to the decisions. On the size of the result set: everyone in our database gets indexed if their name is found in a relevant story. There are still some code to write to streamline the process, of course. At the moment with our display page for AFL players I'm doing a minimum of 16 database calls (usually 25+) with orders on a table with more than 45,000 rows (four months of archives of Media Street) and even on our test box on our local LAN I have time to go and get a cup of tea while the results trickle onto the page. As we did with Media Street, I always knew that pre-populating Tinfinger results was inevitable. I'm sure that's what Gabe Rivera at Memeorandum does too, that's why he "publishes" every five minutes and only shows that published data. Not that I'm criticising at all, that's the correct thing to do if you haven't got a server farm the size of Google's. As Tinfinger's range of categories grow even that prepopulated table structure might prove problematic if we stick results for all categories in the same table, as I will do at first. We'll cross that bridge when we come to it.
On the structure of human metadata: I guess it has to be keyword-based tags, not plain English with spaces. Tags are undeniably useful for a range of applications, although we sacrifice some usability for people not familiar with the way tags work (but even that can be solved with judicious use of plain English metadata attached to the tags).
Finally, on disambiguation... well, it's still a bitch. Having said that, it will be worth the effort, since I would hate for Tinfinger to have as many false positives as Zoominfo seems to have from my (admittedly limited) sampling. I think for a human search engine to have any credibility, it should strive for as close to zero false positives as possible. I'd rather leave out 99 accurate items than include a wrong one. Perfection may be forever elusive, as with Wikipedia, but if you set up the site's processes so as to concentrate your best energies towards the quest for accuracy then that's the best anyone can do.
2 Comments:
I'm sure you've noted the Alexa announcement, Paul. Would perhaps be a way to do all the hard work in terms of collecting data?
Sure, there are plenty of ways, even without paying for it as Alexa wants you to. It's possible to scrape results from any search engine if you put in enough work to clean up the raw HTML. I don't think large-scale crawling for discovery is the answer for Tinfinger though - it's more focused on smaller OPML-size personal reading lists that people will upload themselves.
Post a Comment
<< Home