This post will seem frivolous at first. It’s not. Wait for the punchline.
Suppose John Doe cares about two things in a woman: looks and personality. Moreover, he says these characteristics are equally important to him (that is, he places 50% weight on looks, 50% weight on personality).
The World Mating Association (WMA) would like to create a ranking of women according to John’s preferences. So the WMA assembles a group of women and marches them in front of John, and he scores each woman’s looks on a scale from 0 to 10. Then he interacts with each woman (from behind a screen, if you insist) and scores each woman’s personality on a scale from 0 to 10.
The scores John gives for looks range all over the map, from 0 to 10, while the scores he gives for personality are bunched together in the 6 to 8 range. (Maybe John is a relatively tolerant guy when it comes to personality, though he doesn’t think anyone is super-fantastic.)
To calculate composite scores, the WMA decides to rescale the personality scores. It calculates each woman’s personality score as follows: personality = 10 x (raw score – 6) / (8 – 6). In other words, it measures a woman’s score as the percentage of the distance between the lowest-scoring and highest-scoring women. A woman John gave a 7 would be rescaled to a 5, because she’s halfway between 6 and 8. A woman he gave a 6 would now be a 0, and a woman he gave an 8 would now be a 10.
Once the personality scores have been rescaled, the WMA computes each woman’s composite score by computing the weighted average of the scores, using the weights John provided (50% and 50%).
Does this method make sense? Well, let’s see. Take two women, Alma and Betsy. Alma got a 9 on looks and a 6 (now rescaled to 0) on personality. Betsy got a 6 on looks and a 7 (now rescaled to 5) on personality. So Alma’s and Betsy’s composite scores are 4.5 and 5.5 respectively. Betsy wins!
“But wait a minute,” John objects. “I said looks and personality were equally important to me. Alma’s three whole points better looking than Betsy. And while her personality is not quite as nice, it’s not that different. I thought they were both nice enough. All things considered, I’d give Alma a 7.5 and Betsy a 6.5. What I’m trying to say is, I like Alma better!”
The problem, obviously, is the rescaling. John said personality matters just as much as looks to him – but fortunately for him, he likes most women’s personalities. The WMA’s approach exaggerated the significance of personality to John by treating women whose personalities he liked somewhat (6’s) as women he didn’t like at all, and women whose personalities he liked a lot (8’s) as women he thought were flawless.
Okay, so the WMA’s approach is clearly flawed. The punchline is that the method I’ve just described is the method the World Health Organization (WHO) used to make its composite scores of healthcare system performance. These are the scores used to create the widely-cited rankings of nations’ healthcare systems.
As I’ve noted in previous posts, the WHO method calculates performance as an index of five different factors (such as health level, health responsiveness, and financial fairness). Some of these factors arguably shouldn’t be included at all, but set that aside. In order to assign weights to these five factors, they conducted a survey of “health experts” about their relative importance. This is equivalent to asking John the importance he attaches to looks and personality. Assume, for the sake of argument, that the resulting weights are sensible.
Nevertheless, the resulting composite scores are not sensible. Because the five component factors were measured on different scales to begin with, the WHO researchers had no choice but to scale them to make them comparable. When they scaled them, they used the approach described above: they measured a country’s factor score as the percentage of the distance between the lowest-scoring and highest-scoring countries for that factor.
As a result, a factor could have an exaggerated effect on the composite health performance scores if the raw scores for that factor were bunched more tightly than were other factors. If, for instance, if financial fairness ranged from 0.5 to 10.0 on the fairness scale, countries with fairness of 0.5 would be treated as having a fairness of 0. Essentially, a country that is somewhat fair would get treated as not fair at all. (This is assuming the raw fairness measure is meaningful to begin with, which I suspect it is not.)
How would fixing this problem affect the WHO rankings? Honestly, I don’t know. In fact, there may be no objective answer to that question. Since the five factors are on different scales, some rescaling is unavoidable. But as soon as you rescale, the meaning of factor weights is questionable at best. What if John had said he cared equally about two things in women – body mass index (BMI) and intelligence (IQ)? What would it even mean to give equal weight to BMI points and IQ points? Any rescaling of BMI and IQ to make them “comparable,” e.g., by using the range or standard deviation, would unavoidably be affected by the relative dispersion of women on these two scales. Unless John could tell us which BMIs were equivalent to which IQs, John’s 50-50 weighting could be swamped by differences in dispersion that John may know nothing about. The same is true of the factor weights used to construct the WHO healthcare rankings.