PHP Content Rating / Confidence

by Clay vanSchalkwijk on September 1, 2009

For those web masters dealing with user feedback looking to weight content finding the right algorithm can be challenging. From experience, there is going to be no out of the box solution since each site and the requirements will be unique. Getting started and putting a solid foundation is the first step and of course, refining over time to get just the right recipe. The following is a binomial proportion confidence interval (what?).  It is a PHP implementation using the Wilson Score Interval to weight the feedback.

class Rating
{
  public static function ratingAverage($positive, $total, $power = '0.05')
  {
    if ($total == 0)
      return 0;
 
    $z = Rating::pnormaldist(1-$power/2,0,1);
    $p = 1.0 * $positive / $total;
    $s = ($p + $z*$z/(2*$total) - $z * sqrt(($p*(1-$p)+$z*$z/(4*$total))/$total))/(1+$z*$z/$total);
    return $s;
  } 
 
  public static function pnormaldist($qn)
  {
    $b = array(
      1.570796288, 0.03706987906, -0.8364353589e-3,
      -0.2250947176e-3, 0.6841218299e-5, 0.5824238515e-5,
      -0.104527497e-5, 0.8360937017e-7, -0.3231081277e-8,
      0.3657763036e-10, 0.6936233982e-12);
 
    if ($qn < 0.0 || 1.0 < $qn)
      return 0.0;
 
    if ($qn == 0.5)
      return 0.0;
 
    $w1 = $qn;
 
    if ($qn > 0.5)
      $w1 = 1.0 - $w1;
 
    $w3 = - log(4.0 * $w1 * (1.0 - $w1));
    $w1 = $b[0];
 
    for ($i = 1;$i <= 10; $i++)
      $w1 += $b[$i] * pow($w3,$i);
 
    if ($qn > 0.5)
      return sqrt($w1 * $w3);
 
    return - sqrt($w1 * $w3);
  }
}
 

The function takes 3 parameters: the positive votes, total votes, and the power. The power can be adjusted, 0.10 to have a 95% chance that your lower bound is correct, 0.05 to have a 97.5% chance, etc.  Sample usage:

sample(1,0);
sample(100,50);
sample(250,100);
sample(1000,500);
 
function sample($p,$n)
{
  echo Rating::ratingAverage($p,$p+$n);
}
 

Output:

Positive Negative Score
1 0 0.20654931654388
100 50 0.58789756740385
250 100 0.6648317184611
1000 500 0.6424116916199

When dealing with sites like Reddit, Digg, and the like you have a certain "freshness" element. The above solution might be a working model for the entire span of the site, but for that front page element you will need to implement some form of "gravity". This can be done by taking the raw score and decaying it over time, like so:

 
class Rating
{
  ...
  public static function gravityRating($positive, $total, $time, $power = '0.05')
  {
    if ($total == 0)
      return 0;
    return (Rating::ratingAverage($positive, $total, $power) / pow($time,0.5));
  }
  ...
}
 
sample(100,50,'0.5');
sample(100,50,'1');
sample(100,50,'4');
sample(100,50,'8');
sample(100,50,'24');
 
function sample($p,$n,$time)
{
  echo Rating::gravityRating($p,$p+$n,$time)."\n";
}
 

In the example above, $time represents the age (in hours) and you can see the decay in the output:

0.83141271310867
0.58789756740385
0.29394878370192
0.20785317827717
0.12000408843024

My recommendation would be to "cap" the time to stop decay after a fixed period such as 12 or 24 hours to stop the initial boost of fresh content and let it normalize quickly. The rate of decay of course, can be adjusted as fast or as slow as you want and again the individual weighting you want to apply will vary from site to site. Depending on the volatility of your content, a front page "freshness" that will encompass a week would not merit a 12 hour decay, but rather a week long decay. Hopefully the above code is enough to get started with content rating and making better use of user feedback and can help lead web masters to making a more intelligent calculation of their content beyond the traditional "5-Star Rating".

[ c9maji2tvz ]

7 comments

Just a quick note: if you have the stats extension installed you can replace the pnormaldist() function with a call to the built-in function stats_cdf_normal($power, 0, 1, 2).

In my tests it has proven to be slightly more accurate than the pnormaldist() implementation.

by Alix Axel on April 15, 2011 at 10:50 am. Reply #

What is the significance of the numbers that you populate into the $b array? I don’t understand where they come from.

by i-g on October 24, 2009 at 4:31 am. Reply #

Ok, that answers my question. Thanks for all the information! I have another question though:

What if I just want to add the gravity part to the MySQL query? For example, if I already have my score variable as $score, and i have the date as add_date, how do I weight newer content to show up higher in the results and then slowly fall back into place over time?

by Daniel Errante on October 3, 2009 at 8:45 pm. Reply #

You could port the formula to a stored procedure but if you’re concerned about the time it takes to process the result set into a cache, doing it on the fly on every record each time will be far worse.

My recommendation would be to just have a hook into your voting to recalculate just the row being voted on and process any record within your decay time through cron.

After the decay time, the score should not change until a vote is issued so there is no point to re-calculate this on the fly and the newer records will represent a smaller subset of your overall data which should reduce the number of records you need to process.

If you do want to port over the code above, all the math functions are available within SQL:

http://dev.mysql.com/doc/refman/5.0/en/mathematical-functions.html

http://dev.mysql.com/doc/refman/5.0/en/create-procedure.html

by Clay vanSchalkwijk on September 18, 2009 at 12:23 pm. Reply #

Is there a way that the formula could be written to support real time calculating? I need a one-liner solution that favors more popular items but also throws newer results in the top of the mix and decays those results over time, like you mentioned above. I could solve this problem by caching these values, but I would have to update a database of tens of thousands of records multiple times a day, correct? I am using the expression sorting method as mentioned here in my search engine: http://www.sphinxsearch.com/docs/current.html#sorting-modes

by danoph on September 18, 2009 at 1:57 am. Reply #

You do not want to do this calculation on the fly with MySQL. It makes more sense to cache the score and let MySQL handle the sorting.

Otherwise you are calculating scoring for each row on each page load. This will seriously hurt your application performance.

by Clay vanSchalkwijk on September 12, 2009 at 11:07 pm. Reply #

How would I go about implementing a MySQL version of the wilson score interval?

by Daniel Errante on September 9, 2009 at 5:40 am. Reply #

Leave your comment

Required.

Required. Not published.

If you have one.