Wikipedia Categories
June 30, 2008 – 6:50 pmAs RSS content flows through mSpoke’s data center, we tag most items with the Wikipedia categories to which they most closely relate. This was originally developed to help the personalization in our FeedHub app, but — not surprisingly — API access to these category assignments has proven itself to be valuable to other applications (for example, Jeff Nolan did a great blog post on how NewsGator uses this data).
One of the perks of doing the job we do is that we get to work with this data every day. Since it’s not private user-data and it’s all neatly arranged in a database, we can take a look at it whenever we want! It’s pretty interesting to see the pulse of the blogosphere fly by every day. If you’re into that kind of thing, here’s a quick sample:
This table shows the number of items that were put into each of the top-level Wikipedia categories during a several-hour snapshot:
| top-level category | Items |
|---|---|
| (non-English) | 55847 |
| (unclassified) | 22580 |
| Agriculture | 1201 |
| Applied_sciences | 4022 |
| Archaeology | 39 |
| Architecture | 2484 |
| Arts | 521 |
| Biology | 2589 |
| Chemistry | 529 |
| Computing | 25246 |
| Crafts | 524 |
| Culture | 2821 |
| Earth | 1041 |
| Economics | 6569 |
| Education | 2102 |
| Entertainment | 32778 |
| Events | 2231 |
| Film | 1067 |
| Geography | 951 |
| Health | 519 |
| History | 2264 |
| Humans | 5118 |
| Language | 773 |
| Law | 1508 |
| Literature | 5404 |
| Mathematics | 363 |
| Medicine | 4913 |
| Military | 398 |
| Music | 3511 |
| Nature | 383 |
| People | 10774 |
| Philosophy | 561 |
| Physics | 825 |
| Psychology | 681 |
| Radio | 1040 |
| Religion | 1313 |
| Science | 1201 |
| Society | 42207 |
| Technology | 8439 |
| Visual_arts | 1779 |
The items used here all come from feeds that were uploaded by FeedHub users, so this is a pretty good sample of the material flowing through popular feeds. One caveat is that this sample only includes English items right now, but we’re working on being able to classify other languages as well. If you feel strongly about support for a specific language, please let us know.
The top-level assignments are interesting, but the actual item-by-item assignments are even better (and they change more frequently). This table shows the 20 most common category assignments across FeedHub personalized feeds for the same time period.
| Web 2.0 | 4820 |
| Online social networking | 2026 |
| Photo sharing | 1691 |
| Politics | 1462 |
| Video games | 1425 |
| American culture | 1257 |
| Murder | 1136 |
| System administration | 1071 |
| Stock market | 935 |
| Occupations | 900 |
| Marketing | 891 |
| Management | 851 |
| Economics | 751 |
| Music | 624 |
| Laptops | 585 |
| Mobile phones | 553 |
| Personal development | 492 |
| Software | 485 |
| Urban issues | 458 |
| Futurology | 435 |
Note: Flickr feed content goes into the “Photo sharing” category.
This is far from scientific, since our FeedHub users are probably atypical - they are primarily people who read several hundred RSS feeds and tend to be bleeding-edge, high-tech, totally cool, good-looking people with lots of friends.
Even if our sample set isn’t exactly McKinsey-level dependable data, it’s still really interesting to look at. Hopefully we’ll spill more of this type of anonymous data out in the future if other people find it as interesting as I do.
- Sean C


My colleague, Paul Ogilve, wrote a post going into great detail on the relevancy improvements in this release. Personally, I’m most excited about leveraging Wikipedia for the taxonomy of our category memes, and a new meme that recommends posts with significantly more comments than are typical for other posts from that source. Obviously, you”re the judge – but, in my internal testing, I have found both of these changes to dramatically improve the quality of the items being recommended to me.