The Awfulness of Numeric Rating Systems : Minor Thoughts

May 10, 2017 11:33 AM

The Awfulness of Numeric Rating Systems

Caroline O'Donovan wrote at Buzz Feed about why the existing rating systems are awful.

“The rating system works like this: You start off as a five-star driver,” Don, a San Francisco Lyft driver told BuzzFeed News. “If you drop below a 4.6, then your career becomes a question. Uber or Lyft will reach out to you and let you know that you are on review probation. And if you continue to drop, then you're going to lose your job. They'll deactivate you."

…

But ratings are nonetheless a stressor for some drivers. Julian, who drives for both Uber and Lyft in San Francisco, said maintaining a good rating can be difficult because customers don’t really understand them. "They think that 3 is okay, and a 4 is like a B, and 5 is exceptional," he told BuzzFeed News. "Well, if you got a 4 every time, you’d be terminated. You have to maintain a 4.7, so anything less than a 5 is not okay.”

> …

This sort of rating anxiety extends well beyond Uber and Lyft. “The rating system is terrible,” said Ken Davis, a former Postmates courier, who noted that under the company's five-star rating system couriers who fall below 4.7 for more than 30 days are suspended. Said Joshua, another Postmates courier, “I really don’t think customers understand the impact their ratings have on us."

> …

Wendy and her son Brian, visiting San Francisco from Indiana and using Uber for their first time, were surprised to hear that most drivers consider four stars to be a bad rating. “I would have thought 5 is excellent, and 4 is good,” Wendy said. That revelation was equally shocking to Elnaz, a longtime Uber user visiting San Francisco from LA. “Four stars sucks," she said, incredulous. "Really?"

“Customers don't understand the impact ratings have on couriers at all,” said a former Postmates community manager, who requested anonymity while discussing her previous employer. “A customer might rate a delivery three stars, assuming that three stars is fine. Several three-star ratings could bring a courier’s rating down significantly, especially if they’re new. It could even get the courier fired.”

The biggest problem is that no two people have the same definition of what each of the ratings means.

Lyft says that five stars means “awesome,” four means “Ok, could be better,” and three means “below average.” But for Uber, five stars is “excellent,” four is “good,” and three is “OK.”

To that point, Goodreads has the following rating system:

★: Did not like it
★★: it was ok
★★★: liked it
★★★★: really liked it
★★★★★: it was amazing

But few people actually use that scale to rate their books. In fact, many people start or end their Goodreads reviews with a discussion of their own personal rating system. I'm guilty of this myself.

It's even worse than that. People give ratings differently from how they actually use ratings. When it comes to giving ratings, people are nuanced critics. Take hotel stays. We'll knock off a star for a room that's a little dingy or a shower that doesn't have the right water pressure. We'll give it back for friendly staff and a hot breakfast. The result is a 3.7 rating that we think accurately represents our "mostly good with a few minor downsides" experience at the hotel.

Given our own nuanced ratings, how many of us even bother to read star ratings with a similarly nuanced eye? We only want to stay at five star establishments. We'll consider a four star hotel, but anything lower than that makes us inherently suspicious. We know how we rate businesses and we know that an average rating of 5 should be impossible to achieve, if everyone rates like we ourselves do. But we read ratings with a highly critical eye anyway and hotels are reduced to begging for high ratings because anything less is the kiss of death.

O'Donovan recounts an anecdote that I find telling.

John Gruber, publisher of Daring Fireball, is among those who believe that five-star rating systems don’t produce particularly useful data, and that generally speaking, binary systems are better. “There’s no universal agreement as to what the different stars mean,” Gruber told BuzzFeed News. “But everybody knows what thumbs-up, thumbs-down means.”

A few years ago, during a trip to Orlando, Gruber had an experience that made him realize how this confusion over what the stars mean can impact individuals in ways customers don’t realize. After taking a ride in an Uber that had an overpoweringly strong smell of air freshener, Gruber gave the driver a four-star rating. The next day, he got a call from an Uber employee asking him to explain what the driver had done wrong.

“I was like, Holy shit!” Gruber said. “The guy was nice, I wish I hadn’t done this.”

When I read this, everything suddenly clarified. Exact, specific, nuanced ratings aren't useful to consumers. I only care about one thing: would you stay here again or would you avoid it? When I'm thinking back on my own stays, maybe the water pressure was too low, but if the overall experience was good then I'll book another room at the same hotel on my next trip. The crucial question really just boils down to: would I stay here again and would you recommend it to me. Everything else is just details.

Jason Snell came to the same conclusion, writing One for the thumbs.

Say you’re Netflix, which has allowed its users to apply five-star ratings to movies since its inception. Netflix offered user ratings because it’s always been focused on improving its own recommendation engine, so that it can look at your tastes and suggest other movies you might like—and use your ratings to feed the recommendation engine of viewers who share your tastes, too.

At some point, Netflix must have looked at its data and realized that their five-star rating system wasn’t really improving its recommendations. It was just adding noise. Does knowing that one user gave a movie four stars while another one gave it five stars really provide more information? The answer is clearly no, because Netflix eliminated star ratings and now only seeks a thumbs up or a thumbs down, just like YouTube did in 2009. In the end, you can obsess over whether a movie deserves three or four of your precious personal stars, but Netflix doesn’t care. It just wants to know if you liked the movie or not, because that’s all that really matters.

Take it from Gene Siskel, via that same Roger Ebert piece:

Gene Siskel boiled it down: “What’s the first thing people ask you? Should I see this movie? They don’t want a speech on the director’s career. Thumbs up—yes. Thumbs down—no.”

Or as John Gruber succinctly put it, star ratings are garbage—“thumbs-up/thumbs-down is the way to go—everyone agrees what those mean.”

I think a numeric rating system only makes sense for purely personal use. For instance, in family meal planning. My family uses a 4-star system for rating meals. After trying each recipe, we ask our daughters to rate it using this four point scale.

★ Never make this again.
★★ I didn't care for it but I'll eat it without a tantrum if you do make it again.
★★★ Make this a part of our standard list of meals.
★★★★ Make this every week.

It helps that it's a simple system. But the main reason is it works is that everyone in the family knows the definition and uses it in a consistent way. When my wife plans the meals, we include some recipes rated 3 or 4. The 2-star recipes may get used sparingly, if one family member happens to love them, since the rest are willing to tolerate them. The 1-star recipes are kept around purely as a reminder of what not to make. It works, but it's a system that would break down entirely if we tried to share our recipe database with another family.

I would be happy to see numeric rating systems disappear entirely from public websites and apps. Let's stick to a simple recommended / not recommended binary choice for everything that we're not personally curating for our own personal use.

This entry was tagged. Review News

Minor Thoughts from me to you

The Awfulness of Numeric Rating Systems