If a new article posted to Digg gets 1000 digs in its first hour, is it going to be the next Big Thing? Or will it be another dud, cast to the vast piles of forgotten memes?
G. Szabo and B. Huberman of HP’s Social Computing lab sought the answer to this question in “Predicting the popularity of online content“. For their analysis, Szabo and Huberman tracked the popularity of 7,146 videos posted to Youtube and 1.3 million links submitted to Digg over the course of 30 days (for youtube) and six months (for digg).
Based on their analysis, Szabo and Huberman note several interesting trends. The core finding of the paper surrounds confirming, relatively unsurprisingly, a positive correlation between the number of votes a post receives within its first few hours (digg)/days (youtube) and the number of votes it ultimately receives during its lifetime. What is slightly less intuitive is that the significance of the (linear) correlation is significantly higher when the votes are log-transformed — that is, when the log of the number of votes a video or link receives in its first hours are compared to the log of votes it ultimately receives. (We guessed that this might be explained by the same causes that drive the Zipf power-law effects seen in web site visitation patterns). This observation drives the authors to build 3 linear estimators using log-transformed votes for extrapolating an item’s popularity based on its initial popularity, and to analyze the average-case performance of each on a held-out test set.
Although the majority of the paper explains itself clearly, the results section becomes thick in statistical detail and is difficult to interpret. By my interpretation, one of the models outperforms the others, but depending on what you’re trying to do, its performance may not be good enough. If your goal is to determine whether a link or video is going to be hot (within a certain confidence interval) you might be in luck; using these models might help. But, if you want to compare relative likelihood that one article will ultimately surpass another (e.g., rankings) the loose error bounds (and “multiplicative error”) suggest you’d better wait a little longer.
While discussing the main point, the paper also outlines a number of simpler but similarly interesting statistics: Digg articles typically “saturate” in popularity very quickly — typically within 5-10 hours (up to a day), while Youtube videos tend to continue to grow in popularity over weeks. A different observation surrounds how the amount of participation on these sites fluctuates cyclicly by hour of the day and day of week - with activity decreasing in evenings and week-ends. The authors come up with a clever technique to cancel out these cycles from their predictions by re-defining time as “the interval of time required for the all articles collectively to accrue a certain number of votes”.
During our tea discussion, we arrived at a number of questions the authors outline as future work: why does the popularity of articles exhibit this behavior? How much of this is determined by the popularity versus the items themselves? Asked another way, if one (hypothetically) artificially inflates the number of initial votes on a particular item chosen at random, how much will its ultimate popularity be affected by this initial inflation? Finally, what if we added content features to our predictors, could they fare better?
Since it’s hard to predict the answers to these questions, I guess we’ll just have to wait and see….