Friday, January 14, 2005

Scaling Beauty

What does it mean to rate something on a scale of 1 to 10? Say, for instance, that a man rates a woman’s attractiveness at 9. What does that really mean? How attractive is she relative to other women?

One interpretation is that each one-unit interval represents a decile of the population. E.g., the range from 9 to 10 represents the most attractive 10% of all women. For this approach to make sense, it must be possible to score between 0 and 1; otherwise, we’d have nine intervals instead of ten. The problem with this interpretation is that it implies far too low a standard for the high numbers, given the way most people use them. Would you say that 9’s are just as common as 5’s? Would you give a score of 9 (or above) to one out of every ten people you see? I think not. 9’s and 10’s should be rare occurrences.

Another interpretation is that each one-unit interval represents a standard deviation of a normal distribution. Because of the distribution’s concentration in the center, we should observe many more 4’s, 5’s, and 6’s than 1’s and 10’s. To be more specific, about 68% of the population should score between 4 and 6, and about 95% should score between 3 and 7. However, this interpretation seems to make the high scores too rare, again given the way most people use them. To get an 8, you’d have to be more attractive than 99.9% of the population; to get a 9, you’d have to be more attractive than 99.99% of the population. To put it another way, you’d only see one person in every 10,000 with a score of 9 or higher. Aside from being rather harsh, this scale is probably not very useful, because the ends of the scale would almost never be used. What’s the point in having a score of 9 if you never use it?

Another problem with the normal distribution interpretation is that the scale should not be limited to the 0 – 10 range, since a normal curve stretches indefinitely far in both directions. It should be possible for someone to score an 11 or a –1. Then again, given the sheer rarity of such individuals, maybe that’s not a serious objection.

The normal distribution interpretation could be tweaked to correspond more closely to actual use of the 1-10 scale. Perhaps each unit on the scale corresponds to one-half a standard deviation. In that case, 68% of the public would score between 3 and 7, and 95% would score between 1 and 9. A person with a 9 would be more attractive than almost 98% of the population, someone with a 10 more attractive than 99%. That sounds about right. But in this case, those outside-the-boundaries scores might be needed after all. Someone with an 11 would be more attractive than 99.9% of the population – i.e., a one-in-a-thousand looker.

There are as many other interpretations as there are varieties of frequency distribution. But I think most will have at least one of the defects above. In addition, any asymmetric distribution would have a mean that differs from the median and mode, which means the interpretation of a 5 score – which most people take to be both the average, middle, and most common score – would become problematic. With a positively-skewed distribution, for instance, if 5 were the average, there would be more 4’s than 5’s in the population, and the median individual would score between a 4 and a 5.

My best guess is that the people have in mind something like the normal distribution, but with each unit worth something less than a standard deviation. Extreme values are indeed more common than middle values (contra the decile interpretation), but not so uncommon that you can’t expect to see the occasion 9 or 10 (or 0 or 1). To deal with the problem of off-the-scale scores, the 0 and 10 scores act as “reservoirs” for the tails of the distributions, which means that not all 10’s are created equal.


Anonymous said...

It's an empirical question. Just collect 100 datapoints from and look and the distribution. It might actually mke a good exercise for freshmen statistics.


Anonymous said...

The other problem is that each individual has his own scale. Beauty is subjective. I have a friend who has a negatively skewed chart (and I'm not talking in statistical terms). In other words, from my point of view, he gives out 9's and 10's when I'd give out 4's and 5's.

Glen Whitman said...

JB -- I agree, but that leaves open what the distribution over those categories should be. What percentage of people are "Gorgeous"?

Anon re: subjectivity -- yes, you're correct. I was mainly getting at the question of what people *mean* when they say "she's a 9." I interpret that statement to mean, "I consider her more attractive than some large percentage of the public." But what percentage is that?

Gabriel -- interesting idea, and I may try it. But the problem is that the scores on represent the average scores given by many readers. The distribution of average scores will tend to be more concentrated in the center than the distribution of scores given by a single reader. For example, say I give a 9 to 1 person in 10. You also give a 9 to 1 person in 10. Then if you average the scores given by the two of us, how many people will get a 9? It will be *less* than 1 in 10, unless you and I have near-identical tastes. Also, there may be a selection bias on -- people who are more attractive may be more likely to submit their photos.

Anonymous said...

I've noticed that there is a bias in rating for girls who post pictures baring more skin (ie: bikini) tend to get higher scores (given that submitter already has an attractive face). It doesn't seem to work if the submitter is unattractive in the face but showing lots of skin.
But maybe it's not a bias b/c face and body is included in rating beauty.
Oh, and you'd be shocked how many of these narcissistic people submit their photos to these, attractive or not.


Anonymous said...

As everyone knows, the only scale that matters is binary. It's 0 if you wouldn't, 1 if you would. All the fine gradations are irrelevant.

Anonymous said...

In re: attractiveness. Subjective rating systems run into problems all the time, and they are difficult to rectify. I ran into this in determining airplane handling qualities. It's easy to measure how many pounds of force it takes to push a control, but determining if that is heavy or light depends on a number of exogenous variables. For instance, in level flight, pilots like nice light controls, but on landing, or in emergencies, most usually prefer much higher control forces. And this preference is mostly unconscious, that is, they think the controls got really light when in fact it took the same force to move them in the different circumstances.

So, as to attractiveness, I'd venture there is enough variability over time to make a statistical referencing invalid (i.e., 1% change or 1 std dev. or whatever you take as a point) especially if you're surveying a large group.

One of my college roomies developed his own attractiveness rating, based on Helen of Troy. Since she had a face that launched a thousand ships, he based the (very sparse) local population of available female companions in milliHelens, with one milliHelen being the beauty required to launch one ship. One thing we noticed with his milliHelen system was that there seemed to be mostly 200-300's and 700-800's with a relative paucity of the very low, very high, and average 'ratings.' This could possibly be explained by the variability. If I thought of a woman as sometimes somewhat attractive and sometimes as average, if asked I would probably rate her as a 7 rather than a 5, even if most of the time I thought of her as a 5. However, if I knew a very attractive woman that for some reason did not strike me as attractive even once, I think that would have also shown up as a downgrade in the ratings. (To tie that back to the airplane handling example, pilots generally rate an airplane with unattractive control characteristics for only a small percentage of the time as badly as an airplane that generally handles pretty badly, but is flyable.)