About the Calculations
August 19, 2020
Alright, so I've noted that a lot of the calculations here are 'sketchy', 'made-up', or even 'witchcraft'. And well... none of those are wrong. So with this article I was hoping to explain the heuristics I created to make this website more useful to book readers. I'm hoping that laying it out helps people 1) understand the metrics and 2) understand why I created them.
Issues with Ratings
So coming into this exercise, I just wanted to find the best books in a given genre. Sounds easy, right? You just sort by average rating and - tada - the best books to read.
...but that's not how it worked out... For the sake of brevity, we'll skip the entire process of collecting data and just go to the analysis.
Now that we've magically collected all the data, we're going to take just the highest rated books:
First off, you'll notice there's a ton of books from one series - a deeper dive into that shows that we're going to want to get rid of anything that's a sequel or a prequel, since they tend to be rated much higher than the first book in a series. That logically follows, since the only people that typically continue reading a series are people who liked the first book. Or masochists. Anyways, the takeaway is to remove sequels and prequels. Let's see where we are after that adjustment:
Marginally better, but now you can see that all the books here have about 1000 reviews or so... That's not necessarily a bad thing, but the issue you run into is that small amounts of reviews can definitely be biased upwards for various reasons. Maybe most of the original reviewers were fans of a TV show (Fleabag: Scriptures) or band (Take Me Home Yearbook). But even apart from that, as a book goes more mainstream, the average rating tends to go down.
So what do we do about that problem?
Math! Or dumb math, at least.
The goal for me was to create a measure that found two types of books:
- Books that are very popular, but also have great ratings
- Books that have very high ratings, but more than the bare minimum number of reviews
The two metrics I came up with were:
- Adjusted Rating (Popular):
e^(Avg Rating) * ln(Rating Count)
- Adjusted Rating:
e^(Avg Rating) * ln(ln(Rating Count))
What do both do?
Adjusted Rating (Popular)
e^(Avg Rating) * ln(Rating Count)
e^(Avg Rating)- The higher rated a book, the more likely I am to want to read it. That relationship should be exponential in theory. A 4.1-rated book is ~10.5% better than a 4.0 rated book on this scale; a 4.2-rated book is also ~10.5% better than a 4.1 rated book.
ln(Rating Count)- A more popular book should receive a bonus, but that rating should be scaled down. In this model, 10k ratings is worth 33% more than 1k ratings. But 100k ratings is only worth 25% more than 10k ratings.
Alright, looks a bit better but now it skews towards super popular books. Given that's one of the things I was hoping to find (good, popular books), I'd say this was pretty successful... The next goal is to find some less well known books.
ln(Rating Count) -->
Why the double
ln() function? Just a way to control for popularity here. Nothing special, just a rough algorithm to get me to the place I want to be.
And there we go, also looks like a pretty solid list of results. Round two of ratings is complete-ish.
I realize all of these algorithms are rough and partially just... weird. But these aren't meant to do anything other than help you find a book. That's also the reason why I left the ability to sort by average rating & rating count on the site - those are much easier to understand than the witchcraft above. The goal of those odd metrics is to give you a simple heuristic that's useful - I think they achieve that, but who knows...