What makes a blog post popular? Part II: subjectivity and polarity

November 24, 2008 – 4:06 pm

This post continues our series on investigating properties of popular feed items.  In our last exploration, we failed to discover any correlation between reading difficulty and the NewsGator attention score.  Now, we want to see how popularity is affected by subjectivity and polarity.  Subjectivity measures the degree to which the statements in the text are subjective (as opposed to objectively written text).  Polarity considers whether the subjective portions of the text express a positive or negative sentiment.

In the analysis below, I used the same feed items as in the previous post.  To compute subjectivity and polarity, I used a slightly modified version of the hierarchical polarity classifier in the LingPipe Sentiment Analysis Tutorial.  This tutorial demonstrates how to extract subjective sentences from text and estimate the polarity of the subjective portions of text.

Since my feed items are not in the same format as that of the training data, I had to make some modifications.  I used LingPipe’s IndoEuropeanSentenceModel to segment the documents into sentences.  To make the text of the items more comparable to that of the training data, I converted all of the text to lowercase and added a space around leading and trailing punctuation.  While these may seem like small details, it is important that the feed items be as similar in form to the training examples as possible.

To visualize the relationship between predicted polarity and NewsGator attention score, I created a kernel density plot for the NewsGator scores of items predicted positive or negative:

Popularity vs polarity

These lines are practically on top of each other, so we can conclude that the polarity predictions are not predictive of the NewsGator attention score.  I also performed a similar analysis on the strength of the polarity estimate and came to the same conclusion.  Those data were harder to visualize because most of the predicted values were near the extremes, so I haven’t included a graph for that analysis.

A more basic question is whether or not the presence of subjectivity, regardless of polarity, is correlated with NewsGator attention scores.  The following plot shows the fraction of sentences identified as subjective by the LingPipe classifier vs the NewsGator attention score:

 Popularity vs subjectivity

The items with a high NewsGator attention score tend to have a greater percentage of sentences identified as subjective.  There’s still pretty wide variance, so subjectivity is a weak predictor at best.  This trend also only applies to items with a NewsGator attention score of at least 5, corresponding to the top 3.7% of items in this particular dataset.  Nevertheless, we can conclude that there is a tendency for posts that receive a lot of attention to have more subjective sentences than those receiving less attention.

What makes a blog post popular? Part I: Comparing popularity and reading difficulty

November 5, 2008 – 4:16 pm

One of the beautiful things about the Internet is the ease with which anyone can become an author and publisher.  Unfortunately, the sheer volume of information out there makes it challenging to get your voice heard.  This post is the first of a series trying to tease apart what aspects of feed items correlate with their popularity.  I’m a fairly sporadic author, so don’t be surprised if there’s a substantial gap between posts.  The nature of this series will be a fairly exploratory data analysis. In this first post, I want to examine whether well-written feed items are more likely to receive attention than poorly-written ones.

To start, we’ll need a bunch of feed items and measures of popularity and writing quality.  For the popularity measure, I will use the feed item’s NewsGator attention score.  While Newsgator has done some additional work to renormalize the scores to range from 0 to 10 since this post,  it should give you a good idea about what factors into an item’s score.  Put simply, the larger the NewsGator attention score, the more popular the feed item.

I sampled 1,000 feeds collected by FeedHub between Thursday, October 16, 2008 and Thursday, October 23, 2008.  I discarded feeds that did not have an item with a non-zero NewsGator attention score.  I also filtered out feeds where less than 75% of the items in the feed in the date range were non-English or had less than 1,000 bytes of unformatted text.  I did this to focus the feed selection on full-text feeds written in English.

Measuring whether a feed item is well-written is difficult.  As a crude proxy, I use the php-text-statistics package to compute the Flesch-Kincaid Grade Level.  This measure has been around for years and looks at the number of words per sentence and the number of syllables per word to estimate the number of years of education expected for a reader to understand the text.  I also look at the length in bytes of the posts because that is easy to compute.

I use box-and-whisker plots to compare the Flesch-Kincaid Grade Level and feed item length to the NewsGator attention score.  Before we look at those plots, we should understand the distribution of NewsGator attention scores in my dataset.  A simple histogram shows the distribution:

Histogram of NewsGator Attention Scores

The small counts of items with NewsGator attention scores above 7 suggests that we might not want to trust the box-and-whisker plots in that data range.  Comparing the NewsGator attention score to the Flesch-Kincaid Grade Level reveals no correlation between these two measures:

Grade Level Distribution Plotted by NewsGator Attention Score

In the box-and-whisker plot, the box covers the middle 50% of the feed items in the bucket, the bold horizontal line shows the median value, and the circles show outliers.  While there is no correlation between the NewsGator attention score and Flesch-Kincaid Grade Level, it is interesting that the middle 50% of feed items have a grade level ranging from 6.7 to 10.8 with a median grade level of 8.7.

We find a similar lack of correlation when comparing length to the NewsGator attention score:

Item Length Plotted by NewsGator Attention Score

I used a log-scale on the vertical axis of this plot due to the skewed distribution of feed item length.  I won’t dwell on the length patterns shown here; I almost certainly introduced some bias in these numbers during feed selection.

It is also interesting to ask whether we see different patterns of popularity relative to a particular feed.  For example, are the more difficult posts more or less likely to be popular than other posts from that same feed?  To examine this question, I normalized the three measures into percentiles.  If an item has a NewsGator attention score percentile of 0.8, then we expect that the item has a score at least as large as 80% of the items in the feed.  This normalization process is a little noisy; many of our feeds only had a small number of items in the one week period I used to collect the data.  A histogram of the normalized NewsGator attention scores confirms this:
Histogram of Normalized NewsGator Attention Scores

If we had larger samples from each of the feeds, we’d expect this histogram to be a little more uniformly distributed.  When we compare the NewsGator attention score percentiles to the percentiles for Flesh-Kincaid Grade Level or feed item length, these data look remarkably uncorrelated.  There is little to suggest that feed item length or reading difficulty is predictive of item popularity within a feed.
Normalized Grade Level Distribution Plotted by Normalized NewsGator Attention Score

Normalized Item Length Plotted by Normalized NewsGator Attention Score

When I set out to write this post, I hoped to find some interesting correlations between feed item popularity and other features of the feed items.  I wasn’t naive enough to believe I’d find strong correlations, but I was hoping to confirm some common sense wisdom.  This post looked into some crude surface features of reading difficulty and post length in an attempt to understand whether a “well-written” post is more likely to be popular than poorly written posts.  I failed to find any correlations between these features and popularity.  Does that mean that I personally believe that the writing of the post doesn’t matter?  Absolutely not.

-Paul

Celebrating Our Best Quarter!

October 9, 2008 – 9:39 am

Last week marked the end of another quarter at mSpoke.  Big deal – it was the end of the quarter for most companies.  But, since this is the  mSpoke corporate blog, we’re gonna tell you about our exciting end of the quarter!

Anyone who’s ever worked at a software company knows that the end of the quarter usually means a sprint to the finish.  Last week was no exception. The good news is that, when the dust settled at 12:01 AM on October 1, it settled on our best quarter ever by many metrics!

It took a herculean effort by the team, and I was really proud of the effort they made.  So we decided it was time for a little celebrating!
Celebration Toast
Step 1: Toast our success, with champagne compliments of one of our board members, Ed Engler.

Step 2: Tech talk by Paul Ogilvie, our principal scientist, on information retrieval techniques.  (Okay, we’re geeky – we thought it was fun.)

Step 3: Pizza and beer, before heading out to Arsenal Bowling Lanes for a little 10 pin action.  We all thought we were bringing similar skill levels to this outing (namely, zero).  But we discovered a ringer in our midst!  Turns out that Sean Colombo is a pretty amazing bowler, in addition to a solid programmer.  Here’s an action shot of Sean racking up another strike on his way to burying the Action Shotrest of us.

At the end of the day, I’m thinking the rest of the team is with me on keeping the scores confidential.  Even Sean C. is no “Deadeye” on his way to the PBA.  But we had a great time and got to blow off some steam after a crazy quarter.

And now we’re back at it, on our way to an even better quarter!

FeedHub Down for Maintenance

August 27, 2008 – 4:52 pm

5:25 pm EDT - The site is back up.  Please let us know if you have any  problems accessing FeedHub.

4:50 pm EDT - We had to take FeedHub down quickly due to some database issues.  We will update this post as we make progress and know more.

Sorry for any inconvenience…

- the mSpoke Team

A New Product Launched with NewsGator: Related Content Widget

August 15, 2008 – 11:11 am

nglogo.jpgAt mSpoke, we’ve been collaborating lately on a number of initiatives with NewsGator.  It has been fun to work with the talented team at NewsGator, but now it gets really exciting as we start rolling these offerings out.

The first was announced on the NewsGator Widget Blog yesterday - a Related Content Widget.  As the name implies, this recommends related content from a publisher defined set of sources.  While not a new use-case, the post highlights why we think our approach will recommend better content.

Obviously, if you’re interested in learning more, please don’t hesitate to contact us.

Wikipedia Categories

June 30, 2008 – 6:50 pm

As RSS content flows through mSpoke’s data center, we tag most items with the Wikipedia categories to which they most closely relate. This was originally developed to help the personalization in our FeedHub app, but — not surprisingly — API access to these category assignments has proven itself to be valuable to other applications (for example, Jeff Nolan did a great blog post on how NewsGator uses this data).

One of the perks of doing the job we do is that we get to work with this data every day. Since it’s not private user-data and it’s all neatly arranged in a database, we can take a look at it whenever we want! It’s pretty interesting to see the pulse of the blogosphere fly by every day. If you’re into that kind of thing, here’s a quick sample:

This table shows the number of items that were put into each of the top-level Wikipedia categories during a several-hour snapshot:

top-level category Items
(non-English) 55847
(unclassified) 22580
Agriculture 1201
Applied_sciences 4022
Archaeology 39
Architecture 2484
Arts 521
Biology 2589
Chemistry 529
Computing 25246
Crafts 524
Culture 2821
Earth 1041
Economics 6569
Education 2102
Entertainment 32778
Events 2231
Film 1067
Geography 951
Health 519
History 2264
Humans 5118
Language 773
Law 1508
Literature 5404
Mathematics 363
Medicine 4913
Military 398
Music 3511
Nature 383
People 10774
Philosophy 561
Physics 825
Psychology 681
Radio 1040
Religion 1313
Science 1201
Society 42207
Technology 8439
Visual_arts 1779

The items used here all come from feeds that were uploaded by FeedHub users, so this is a pretty good sample of the material flowing through popular feeds. One caveat is that this sample only includes English items right now, but we’re working on being able to classify other languages as well. If you feel strongly about support for a specific language, please let us know.

The top-level assignments are interesting, but the actual item-by-item assignments are even better (and they change more frequently). This table shows the 20 most common category assignments across FeedHub personalized feeds for the same time period.

Web 2.0 4820
Online social networking 2026
Photo sharing 1691
Politics 1462
Video games 1425
American culture 1257
Murder 1136
System administration 1071
Stock market 935
Occupations 900
Marketing 891
Management 851
Economics 751
Music 624
Laptops 585
Mobile phones 553
Personal development 492
Software 485
Urban issues 458
Futurology 435

Note: Flickr feed content goes into the “Photo sharing” category.

This is far from scientific, since our FeedHub users are probably atypical - they are primarily people who read several hundred RSS feeds and tend to be bleeding-edge, high-tech, totally cool, good-looking people with lots of friends. ;)

Even if our sample set isn’t exactly McKinsey-level dependable data, it’s still really interesting to look at. Hopefully we’ll spill more of this type of anonymous data out in the future if other people find it as interesting as I do.
- Sean C

FeedHub New Version Updates

June 3, 2008 – 6:48 am

Today, 6/3/08, we will be releasing a new version of FeedHub, which includes relevancy and performance improvements.  Check back here for further updates.

6/3/08 2:30pm Update

The latest release of FeedHub is up-and-running.

Extension updated & ready for FireFox 3

May 19, 2008 – 12:07 pm

Over the weekend, FireFox 3 RC1 was released.  That means that the final version of FireFox 3 is right around the corner.

We pushed out an auto-update of the FeedHub Feedback extension which has been tested with FF3 and appears to be working just fine.

When we sent out the new version for FF3RC1 compatibility, we also included an update which added the commonly-requested feature of keyboard-shortcuts for Google Reader. Now, if you want to send feed back for the currently-selected item, just type “,” for a thumbs-up or “.” for thumbs-down. These keys were chosen to work well with j/k navigation.  If you forget these shortcuts and don’t want to dig up this blog entry, you can check out all of Google Reader’s keyboard shortcuts by typing “?” at any time.

In addition, if you want to be more specific about why you liked or disliked an item, there is now a “tell us more” link that shows up after you click the thumbs which allows you to tell us how you felt about the memes that helped select that item for you.

If you’re excited (or at least intrigued) by these changes and don’t want to wait for FireFox to do it’s auto-update checks, go to the “Tools” menu, select “Add Ons” and then click the “Find Updates” button to get the newest version of the extension.

As always, if something is on your mind - we love to hear your feedback!
- Sean C

TextMate Freemarker Bundle

March 18, 2008 – 11:01 am

Being a both a Mac user and a web developer, I’ve become a big fan of TextMate for just about everything except straight Java (It’s pretty tough to beat Intellij IDEA for Java!). A while back we made the decision to move from JSP to Freemarker. Unfortunately, there doesn’t seem to be as much support for Freemarker as there is for JSP in TextMate.

So, I started a TextMate Freemarker bundle. It’s fairly basic at the moment, but does have a decent Language syntax definition that plays well with HTML, and a few snippets for Freemarker tags. It even has a few snippets for Spring macros.

If you use TextMate and Freemarker, head over to Google Code and check it out. It carries an Apache License, so no worries. Comments and suggestions are most certainly welcome!

- Brian

FeedHub Feedback extension updated

March 11, 2008 – 1:06 pm

We’ve made some updates to the “FeedHub Feedback” FireFox extension.

I first alluded to its release a couple of months ago and shortly thereafter we announced the extension as part of a significant release.  Since that time, FireFox development has plunged forward and we’ve been releasing updates alongside the new versions of the FireFox betas.

With the release of FireFox 3 beta 4, we’ve released yet another update.  We missed the boat by a little less than a day with beta 4 (purely my fault), and InformationWeek called us on it.

You can rest assured that by the time FireFox 3 is officially released your transition should be completely seamless.

If you ever experience any problems or have any suggestions, please contact us at support@feedhub.com and we’ll make sure your comments find their way to the right people.  We love hearing from you!

Thanks for your time,
- Sean C