Wednesday, April 04, 2007

On Creating Bogus Data Sets

For a class I teach on economic statistics, I occasionally need to cook up a bogus data set for my students to analyze using regression analysis and other tools. This turns out to be a much more difficult task than you might expect.

The main problem is that almost any data set I create that actually represents a true underlying relationship (e.g., output is some function of labor and capital) comes out with insanely high levels of statistical significance (p-values in the hundred-thousands or millionths). And no, I’m not making the mistake of failing to include an error term; these high levels of statistical significance persist even when I include random errors with fairly high variances.

The significance levels do decrease with higher variance in the error term, but they’re still ridiculously high. By the time I’ve inserted enough variance to get more believable p-values, the y-values have become untenable (for instance, producing negative values for output). Those with econometrics experience will say this is just a problem of truncated variables – which is true, but the econometric methods needed to deal with truncated variables are far beyond the level of the class.

I’ve tried other methods of solving the problem, such as including additional explanatory variables when creating the y-values and then omitting those variables from the regression. This has been only moderately successful, especially since it’s hard to control how much effect your omitted variables will have on the regression results.

I have found one method, however, that allows me to create realistic-looking data sets and exercise a modicum of control of how large or small the significance levels are. It is this: I create the data set using a true underlying relationship as well as an error term. Then I replace some fraction q of the y-values with random numbers, in the same approximate range as the other y-values, but completely unaffected by the corresponding x-values. By changing q, I can approximately target the levels of significance that will pop out of the regression. I’ve found that a q between 20% and 40% of the observations will usually generate significance levels in the “interesting” range (p-values ranging from 0.001 to 0.30 or so).

As for what we can conclude about the real world and the statistical tools we use to analyze it, make of this what you will.

8 comments:

Jonathan said...

Are you generating normally distributed errors? If you are, then I'm not sure I see how you could have a problem. If I've understood what you've done, the resulting model should have substantial bias, since the errors on the largest observations of x will be predominantly negative (presuming a positive beta) and the errors on the smallest observations will be predominantly positive. Aren't your coefficients sharply biased towards zero by this method? This of course would explain why your p values increase.

Benoit said...

My guess is that you are using too many data points. Lowering the amount of points would have lowered the significance.

This is a perfect opportunity for me to push the ways of the Bayesians. We believe the problem you have stumbled upon is due to the nonsensical use of null hypothesis testing. You have made the same mistake than most scientists of the 20th century (and beyond) always do.

The Bayesian school of thought says that null hypothesis testing (t-tests, anovas, p-values) makes no sense. Especially when your hypothesis space is a continuous variable. It makes no sense because you are rejecting an hypothesis that has zero width. You are rejecting nothing.

With these tests, as long as you have enough data, you will _ALWAYS_ be able to find significant results. (Unless the effect is exactly null, which pretty much never happens in real life, even in the best cases, if you have enough data you'll find significance in the small bias of your measuring tools)

Your results aren't significant? No problem! Just get more individuals in your data! That's how easy they are to manipulate.

Bayesians tell us that we should ALWAYS use confidence intervals. The problems with that is that confidence intervals require the use of prior distribution, which some think are not objective and bias your result.

Don't listen to them though, Bayesians define maximum entropy priors that exist simply to be the less biased and the more objective possible.

I urge you to read E.T. Jaynes fascinating book.
http://omega.albany.edu:8008/JaynesBook.html
(you can also buy it on amazon)

There's also a neat intro here:
http://yudkowsky.net/bayes/technical.html

Glen Whitman said...

Are you generating normally distributed errors? If you are, then I'm not sure I see how you could have a problem.

Yes, I used normally distributed errors, and I still had the problem. By the time the variance was large enough to generate non-ridiculous p-values, the range of y-values was too wide (including negative numbers, for instance).

If I've understood what you've done, the resulting model should have substantial bias, since the errors on the largest observations of x will be predominantly negative (presuming a positive beta) and the errors on the smallest observations will be predominantly positive.

Good point. I hadn't thought of it that way, but it makes sense. And that does explain why this method works to increase my p-values. I was thinking the effect was from pure randomness; I didn't consider bias. Regardless, the method works fine for generating a bogus data set for my students to play with.

My guess is that you are using too many data points. Lowering the amount of points would have lowered the significance.

Good point. I usually was working with at least 30 and often 50 data points.

The Bayesian school of thought says that null hypothesis testing (t-tests, anovas, p-values) makes no sense. Especially when your hypothesis space is a continuous variable. It makes no sense because you are rejecting an hypothesis that has zero width. You are rejecting nothing.

Interesting point. I'd pondered that before, but I didn't know the Bayesians had something to say about it. However, the objection seems limited to two-tail tests. One-tail tests have a "less than or equal" (or "greater than or equal") null hypothesis, which does not have zero width. And any two-tail test can be reinterpreted as a one-tail test with a different significance level.

Bayesians tell us that we should ALWAYS use confidence intervals.

The problem with CI's is they require you to use a specific confidence level (say, 90%), as opposed to looking at a p-value, which allows you to know the maximum confidence level you can get away with. Also, a CI is isomorphic to a two-tail significance test, in that you'll reject the null if the hypothesized value (e.g., zero) falls outside your CI.

Ben said...

I think one-tailed test suffer from the same problem since you have to know beforehand that one half of the space is not possible. The test only disproves that additional but infinitely small slice where the value is null. I don't know if this makes sense to you. One tailed tests make even less sense to me than the standard null hypothesis tests because of all these "beforehand knowledge" requirements.

"CI is isomorphic to a two-tail significance test"

Not really because the CI also gives you an idea of the size of the effect. Consider an experiment where you are looking for some temperature difference due to a new green isolation product for houses. The results show that the CI is probably bounded between, 0.01 and 0.03 degrees of increase in warmth kept in. Both of these values are essentially zero but a null hypothesis test would still be significant!

The problem is that in real life effects are almost never exactly zero and null hypothesis tests disregard the size of the effect.

In this case there might be a small effect from the use of different thermometers or other not 100% controled variables. When looking at the CI this immediately shows. The new isolation does not work. With a null hypothesis test, the company could boast that their isolation is better with "p=whatever" confidence) However, by looking at the CI you immediately know that it's not better by any useful amount.

Ben said...

Oh and the reason bayesians have something to say about this is that calculating confidence intervals forces you to use a prior. Usually an implicit flat prior is assumed, but situations arise when using certain units and dimentions where the flat prior is not the most unbiased...

Ryan said...

Don't you just have non-stationarity data? You'd want to detrend it first.

niru said...

You can use the data from my lab and get no problem with ridiculously high p values.

Will said...

Handy. And you can keep those faked data sets in the draw for when you teach robust statistics, since that's the data generation assumption they fit.

Btw Bayesians don't suggest always using CIs. We only claim that all the relevant information is in the posterior distribution of your unknowns. If you feel that some of that information is best squished into a point estimate and CI, then that's fine, but it's only one way to show it.