Early on in the genesis of BeerGraphs, we knew that Beers Above Replacement (BAR) would be a cornerstone of the statistical analysis we wanted to perform. Much like Wins Above Replacement (WAR) in baseball, we wanted BAR to factor in as much meaningful information as possible about a beer and output a single value for its total “worth” to beer drinkers.
While the concept -- relate beers to a statistically-derived baseline -- was clear from day one, there were a lot of different potential directions to go with the execution. Similar to baseball, one of the challenges in creating BAR was determining what a “replacement level” beer looked like.
One key to replacement level was to take style into account. Ratings vary widely from one style to another, and we wanted BAR to reflect how a beer performed relative to its style, whether it be an Adjunct Lager or an Imperial Stout. The readily available, non-upsetting beer is different for every style.
Conceptually, it made sense to base replacement level off of the best beer of a given style that was readily available across the country. However, the limited number of nationally distributed beers made this a challenge, and the fact that some highly-rated beers (such as Sierra Nevada Pale Ale, with an average rating of 3.30 out of 5) have national distribution meant that the bar would be set too high for some styles.
For inspiration, we once again turned to baseball, where a hypothetical team of replacement level players would win 29.4% of its games. Therefore, for each style, we looked at every beer with at least five check-ins and found one that was better than 29.4% of the other beers. For American Pale Ales, the replacement level beer was Sam Adams Cascade Single Hops Pale Ale, with an average rating of 2.69. Conceptually, we wanted replacement level to be a beer that wasn’t horrible, but also wasn’t very good and you could usually find a better alternative, so this felt about right. Do 70% of the beers you drink add to your enjoyment? Do 30% upset you?
We can discuss that level going forward, but for now we'll use this to calculate “weighted Overall Beer Average Rating” (wOBAR), which is the average rating relative to the replacement level for the style:
wOBAR = Average Rating - Replacement Level
That gives you an IPA's rating just set against the backdrop of IPAs.
The next topic to tackle was the volume of check-ins. At first, we were considering simply scaling the rating to replacement level for BAR, but felt like we could do more.
In baseball, WAR is a counting stat; the more you play, the greater opportunity you have to accumulate (or lose) it. Therefore, a league average player who plays every day might be worth more than an excellent player who for some reason doesn’t see the field as often. We wanted this to be reflected in BAR, where a solid above-average beer with excellent distribution and large number of check-ins might be worth more than a beer with a higher rating but limited number of check-ins.
With beer ratings, the differences in check-in volume are much greater in baseball. A beer with national distribution could have tens of thousands of check-ins, while a limited release or draft-only beer might only have a dozen. In order to allow beers with a high volume of check-ins to accumulate BAR without making those with a few check-ins meaningless, we decided to use a logarithmic scale to the volume-based portion of the equation. This means that the first ten check-ins are essentially worth the same as the next 90, and the first 100 are worth the same as the next 9900.
Volume = Log10(# of check-ins)
Finally, we wanted to scale BAR to a number that made sense. The highest single-season WAR of all time belongs to Babe Ruth in 1923, where he was worth 15.0 wins. Therefore, we scaled a “perfect season” with a 5.0 average rating and 1,000 check-ins to 15.0 BAR. To do this, we use a scaling factor which is unique for each style:
Scale = (15.0 / Log10(1,000)) / (5 - Replacement Level)
This gives us the final equation for BAR:
BAR = wOBAR * Volume * Scale
Seems simple enough, right? But what does this look like in practice? Here is a graph showing how the check-in volume would affect accumulation of BAR for hypothetical American Pale Ales (APA) with different average ratings:
As you can see, a top-tier APA with a 4.5 rating will accumulate BAR very rapidly, reaching 6.65 BAR by 50 check-ins, and crossing the 10-BAR threshold by 500. An excellent APA (4.0) would accumulate BAR at a slower rate, but if it has great distribution can match the BAR of the 4.5 by amassing a greater number of check-ins, reaching 6.79 BAR after 250 check-ins. Meanwhile, an above-average-but-not-spectacular APA with a 3.5 rating would need over 5000 check-ins to match the marks of the 4.5 with 50 check-ins and the 4.0 with 250.
On the other hand, beers that are below replacement level can accumulate negative BAR. An APA with a 2.5 that is just slightly below replacement level would reach -1.0 BAR by 250 check-ins, but would have to be wildly popular (somehow) to ever reach -2.0 BAR (over 50,000 check-ins). Meanwhile, a really terrible APA with an average rating of 2.0 would rapidly accumulate negative BAR, reaching -3.0 by 100 check-ins. I’m not sure why people would keep on drinking this beer if it’s so terrible, but then again I also don’t understand why a manager would give Jeff Francoeur 600 plate appearances. Some things simply defy all logic.
Now, let's look at a few real-world examples. Here are the top American Pale Ales by BAR.
Brewery | Beer | wOBAR | BAR |
---|---|---|---|
Three Floyds Brewing Company | Zombie Dust | 1.09 | 10.75 |
Hill Farmstead Brewery | Edward | 0.96 | 6.82 |
Carton Brewing Company | Boat Beer | 0.7 | 4.57 |
Fat Head's Brewery | Simply Simcoe | 0.76 | 4.56 |
Lagunitas Brewing Company | Fillmore Fusion | 0.63 | 4.46 |
Three Floyds Brewing Company | Alpha King | 0.48 | 4.39 |
Russian River Brewing Company | Row 2, Hill 56 | 0.42 | 3.22 |
Microbrasserie Jukebox | Jukebox Blonde | 0.65 | 3.15 |
BrewBoys | Hoppapotamus | 0.64 | 3.12 |
Zombie Dust has it all. The high rating and excellent distribution make it an MVP-caliber beer with the ability to accumulate over 10 BAR in under two months. Hill Farmstead Edward is harder to find than the next four beers, but it still reins supreme.
There is still work to be done with BAR.
One question that remains involves classification of beers. Should each unique style have its own replacement level, or should there be umbrella categories? Should IPAs and Double IPAs be compared to the same standard?
Another issue relates to tracking BAR over time. Because of the logarithmic nature of the volume portion of the equation, it would not be constructive to compare BAR from one month to the next. The same number of check-ins with the same rating would produce a lower BAR in the second month, because it gets more difficult to accumulate BAR over time. One idea would be to measure BAR on a yearly basis. When the calendar turns to January 1, the clock resets, the previous year is in the books, and everything starts from scratch, similar to seasons in baseball. This would allow people to compare how trends in rating and consumption affected BAR from one year to the next. If you want to look at trends on a smaller scale, you would still have to use wOBAR.
As BeerGraphs grows, we will continue to tinker with BAR to produce a better measure of “beer value,” but for now, we hope you’ll have fun with it in its current form. What do you think about BAR? We’d be happy to hear any ideas you have in the comments section, and we look forward to refining our measures with your help.