November 5, 2008 – 4:16 pm
One of the beautiful things about the Internet is the ease with which anyone can become an author and publisher. Unfortunately, the sheer volume of information out there makes it challenging to get your voice heard. This post is the first of a series trying to tease apart what aspects of feed items correlate with their popularity. I’m a fairly sporadic author, so don’t be surprised if there’s a substantial gap between posts. The nature of this series will be a fairly exploratory data analysis. In this first post, I want to examine whether well-written feed items are more likely to receive attention than poorly-written ones.
To start, we’ll need a bunch of feed items and measures of popularity and writing quality. For the popularity measure, I will use the feed item’s NewsGator attention score. While Newsgator has done some additional work to renormalize the scores to range from 0 to 10 since this post, it should give you a good idea about what factors into an item’s score. Put simply, the larger the NewsGator attention score, the more popular the feed item.
I sampled 1,000 feeds collected by FeedHub between Thursday, October 16, 2008 and Thursday, October 23, 2008. I discarded feeds that did not have an item with a non-zero NewsGator attention score. I also filtered out feeds where less than 75% of the items in the feed in the date range were non-English or had less than 1,000 bytes of unformatted text. I did this to focus the feed selection on full-text feeds written in English.
Measuring whether a feed item is well-written is difficult. As a crude proxy, I use the php-text-statistics package to compute the Flesch-Kincaid Grade Level. This measure has been around for years and looks at the number of words per sentence and the number of syllables per word to estimate the number of years of education expected for a reader to understand the text. I also look at the length in bytes of the posts because that is easy to compute.
I use box-and-whisker plots to compare the Flesch-Kincaid Grade Level and feed item length to the NewsGator attention score. Before we look at those plots, we should understand the distribution of NewsGator attention scores in my dataset. A simple histogram shows the distribution:

The small counts of items with NewsGator attention scores above 7 suggests that we might not want to trust the box-and-whisker plots in that data range. Comparing the NewsGator attention score to the Flesch-Kincaid Grade Level reveals no correlation between these two measures:

In the box-and-whisker plot, the box covers the middle 50% of the feed items in the bucket, the bold horizontal line shows the median value, and the circles show outliers. While there is no correlation between the NewsGator attention score and Flesch-Kincaid Grade Level, it is interesting that the middle 50% of feed items have a grade level ranging from 6.7 to 10.8 with a median grade level of 8.7.
We find a similar lack of correlation when comparing length to the NewsGator attention score:

I used a log-scale on the vertical axis of this plot due to the skewed distribution of feed item length. I won’t dwell on the length patterns shown here; I almost certainly introduced some bias in these numbers during feed selection.
It is also interesting to ask whether we see different patterns of popularity relative to a particular feed. For example, are the more difficult posts more or less likely to be popular than other posts from that same feed? To examine this question, I normalized the three measures into percentiles. If an item has a NewsGator attention score percentile of 0.8, then we expect that the item has a score at least as large as 80% of the items in the feed. This normalization process is a little noisy; many of our feeds only had a small number of items in the one week period I used to collect the data. A histogram of the normalized NewsGator attention scores confirms this:

If we had larger samples from each of the feeds, we’d expect this histogram to be a little more uniformly distributed. When we compare the NewsGator attention score percentiles to the percentiles for Flesh-Kincaid Grade Level or feed item length, these data look remarkably uncorrelated. There is little to suggest that feed item length or reading difficulty is predictive of item popularity within a feed.


When I set out to write this post, I hoped to find some interesting correlations between feed item popularity and other features of the feed items. I wasn’t naive enough to believe I’d find strong correlations, but I was hoping to confirm some common sense wisdom. This post looked into some crude surface features of reading difficulty and post length in an attempt to understand whether a “well-written” post is more likely to be popular than poorly written posts. I failed to find any correlations between these features and popularity. Does that mean that I personally believe that the writing of the post doesn’t matter? Absolutely not.
-Paul
Posted in Research | 24 Comments »