Wednesday, October 24, 2007

How to Score

This post will seem frivolous at first. It’s not. Wait for the punchline.

Suppose John Doe cares about two things in a woman: looks and personality. Moreover, he says these characteristics are equally important to him (that is, he places 50% weight on looks, 50% weight on personality).

The World Mating Association (WMA) would like to create a ranking of women according to John’s preferences. So the WMA assembles a group of women and marches them in front of John, and he scores each woman’s looks on a scale from 0 to 10. Then he interacts with each woman (from behind a screen, if you insist) and scores each woman’s personality on a scale from 0 to 10.

The scores John gives for looks range all over the map, from 0 to 10, while the scores he gives for personality are bunched together in the 6 to 8 range. (Maybe John is a relatively tolerant guy when it comes to personality, though he doesn’t think anyone is super-fantastic.)

To calculate composite scores, the WMA decides to rescale the personality scores. It calculates each woman’s personality score as follows: personality = 10 x (raw score – 6) / (8 – 6). In other words, it measures a woman’s score as the percentage of the distance between the lowest-scoring and highest-scoring women. A woman John gave a 7 would be rescaled to a 5, because she’s halfway between 6 and 8. A woman he gave a 6 would now be a 0, and a woman he gave an 8 would now be a 10.

Once the personality scores have been rescaled, the WMA computes each woman’s composite score by computing the weighted average of the scores, using the weights John provided (50% and 50%).

Does this method make sense? Well, let’s see. Take two women, Alma and Betsy. Alma got a 9 on looks and a 6 (now rescaled to 0) on personality. Betsy got a 6 on looks and a 7 (now rescaled to 5) on personality. So Alma’s and Betsy’s composite scores are 4.5 and 5.5 respectively. Betsy wins!

“But wait a minute,” John objects. “I said looks and personality were equally important to me. Alma’s three whole points better looking than Betsy. And while her personality is not quite as nice, it’s not that different. I thought they were both nice enough. All things considered, I’d give Alma a 7.5 and Betsy a 6.5. What I’m trying to say is, I like Alma better!”

The problem, obviously, is the rescaling. John said personality matters just as much as looks to him – but fortunately for him, he likes most women’s personalities. The WMA’s approach exaggerated the significance of personality to John by treating women whose personalities he liked somewhat (6’s) as women he didn’t like at all, and women whose personalities he liked a lot (8’s) as women he thought were flawless.

Okay, so the WMA’s approach is clearly flawed. The punchline is that the method I’ve just described is the method the World Health Organization (WHO) used to make its composite scores of healthcare system performance. These are the scores used to create the widely-cited rankings of nations’ healthcare systems.

As I’ve noted in previous posts, the WHO method calculates performance as an index of five different factors (such as health level, health responsiveness, and financial fairness). Some of these factors arguably shouldn’t be included at all, but set that aside. In order to assign weights to these five factors, they conducted a survey of “health experts” about their relative importance. This is equivalent to asking John the importance he attaches to looks and personality. Assume, for the sake of argument, that the resulting weights are sensible.

Nevertheless, the resulting composite scores are not sensible. Because the five component factors were measured on different scales to begin with, the WHO researchers had no choice but to scale them to make them comparable. When they scaled them, they used the approach described above: they measured a country’s factor score as the percentage of the distance between the lowest-scoring and highest-scoring countries for that factor.

As a result, a factor could have an exaggerated effect on the composite health performance scores if the raw scores for that factor were bunched more tightly than were other factors. If, for instance, if financial fairness ranged from 0.5 to 10.0 on the fairness scale, countries with fairness of 0.5 would be treated as having a fairness of 0. Essentially, a country that is somewhat fair would get treated as not fair at all. (This is assuming the raw fairness measure is meaningful to begin with, which I suspect it is not.)

How would fixing this problem affect the WHO rankings? Honestly, I don’t know. In fact, there may be no objective answer to that question. Since the five factors are on different scales, some rescaling is unavoidable. But as soon as you rescale, the meaning of factor weights is questionable at best. What if John had said he cared equally about two things in women – body mass index (BMI) and intelligence (IQ)? What would it even mean to give equal weight to BMI points and IQ points? Any rescaling of BMI and IQ to make them “comparable,” e.g., by using the range or standard deviation, would unavoidably be affected by the relative dispersion of women on these two scales. Unless John could tell us which BMIs were equivalent to which IQs, John’s 50-50 weighting could be swamped by differences in dispersion that John may know nothing about. The same is true of the factor weights used to construct the WHO healthcare rankings.

6 comments:

Ari said...

This is an interesting topic. I would suggest that, in fact, John does not place "equal weighting" on personality and looks, because the vast majority of the variability in the final score is due to looks rather than personality.

Rather than assigning weights for the raw scores, we should be looking at how much the variability in these components contributes to the total. This is, essentially, the usual "analysis of variance" (ANOVA) approach in statistics, which eliminates these sorts of rescaling paradoxes.

Glen Whitman said...

Ari -- I would agree with you, except for one thing: there is no objective "final score" or "total" to be explained. The whole point is that the final scores or totals are constructed by somebody (WHO in this case); they have no independent reality. A different methodology would lead to different final scores.

Now, there may be some independent entity out there to be explained -- say, John's actual "all things considered" judgments on women, or a health policy analyst's "all things considered" assessments of healthcare systems. But the performance index created by WHO is intended to be an objective measure that observers can then use to form their opinions about healthcare systems.

Anonymous said...

I'm not even sure that it's necessarily correct to calculate the total score as the sum of the feature scores, no matter what normalization technique is used. Linear combinations are used because they're simple and convenient, not because they're somehow more correct.

Suppose you prefer a mate with a balance of features. (You don't want someone who is gorgeous but completely obnoxious, nor someone who you find very entertaining but repulsive.) You'd want to compute the total score as the sum of the roots of their partial scores. I can certainly see this as being desirable in a health system, such that a system is penalized for having any one aspect be inadequate.

Then again, you might prefer a mate who is a real standout at one thing or the other, and then the logical choice is to calculate something like the sum of the squares of the features. I'm not sure this is something you'd want out of a medical system, but perhaps if you were ranking individual hospitals, or maybe universities.

I believe both of these procedures further complicate the normalization problems mentioned above, but my point is that there are important underlying assumptions even when you're doing something as simple as adding up the sub-scores.

Benoit Essiambre said...

I think the problem here is in the statement "these characteristics are equally important to him (that is, he places 50% weight on looks, 50% weight on personality)."

This is underspecified because we don't know the scales being weighted.

Assuming that the underlying scale is a percentile ranking or a normalized score of a global population is not unreasonable as there are not that many other objective scales to work with.

The fact that Mr. Doe did not give the same variance to the different characteristics signifies that on a percentile scale he either doesn't actually give the same weight to these characteristics and thus the weight is implicit in the variances that he gave, or that he happened to live in a region where the population is unusually homogeneous personality wise.

The third options is that the 50-50 value he gave was actually based on some other scale only he knew.

There are other scales that could seem intuitively more natural than normalized scores and that's what brings confusion to this situation. Maybe a percentile in a different population, (e.g wider or narrower age group?) or a scale that is based on the normal rate of development of a human being. For example, we could define one personality point as the average personality gain in one year of a developing teenager (Not unlike the IQ scale was originally defined) It could be anything but it should be specified.

Using a normalized score or a percentile ranking on a global population as a default is not that unreasonable but we should make sure the people who are asked for the weights understand this. The question that remains is whether humans are able to give objective and sincere answers to these questions when some other scale that is not standardized is more intuitive.

If we specify that normalized scales are to be weighted and then afterwards get answers like those of Mr. Doe, (on a global population) from the weighters, either they didn't follow the standardized scale assumption or they were inconsistent in their answers.

Anonymous said...

Is there even a metric on the space under consideration? That is to say, is there a mechanism for deciding the 'closeness' of two coordinates in the result set? If not then mathematics dictates that there is no ordering criterion that can be brought to bear to decide on the desirability of any particular location within the problem space. It's intrinsically an ill-formed problem.

Unknown said...

It's a qualitative problem that was evaluated with quantitative tools. To correlate BMI with IQ, John should assign a 0-10 value to a range of values for both factors. For example,
0 for BMI <15 or >35, IQ <20
1 for BMI <16 or >34, IQ <50 or IQ > 150
.
.
10 for BMI 21 to 22, IQ 110 to 125