For a class I teach on economic statistics, I occasionally need to cook up a bogus data set for my students to analyze using regression analysis and other tools. This turns out to be a much more difficult task than you might expect.
The main problem is that almost any data set I create that actually represents a true underlying relationship (e.g., output is some function of labor and capital) comes out with insanely high levels of statistical significance (p-values in the hundred-thousands or millionths). And no, I’m not making the mistake of failing to include an error term; these high levels of statistical significance persist even when I include random errors with fairly high variances.
The significance levels do decrease with higher variance in the error term, but they’re still ridiculously high. By the time I’ve inserted enough variance to get more believable p-values, the y-values have become untenable (for instance, producing negative values for output). Those with econometrics experience will say this is just a problem of truncated variables – which is true, but the econometric methods needed to deal with truncated variables are far beyond the level of the class.
I’ve tried other methods of solving the problem, such as including additional explanatory variables when creating the y-values and then omitting those variables from the regression. This has been only moderately successful, especially since it’s hard to control how much effect your omitted variables will have on the regression results.
I have found one method, however, that allows me to create realistic-looking data sets and exercise a modicum of control of how large or small the significance levels are. It is this: I create the data set using a true underlying relationship as well as an error term. Then I replace some fraction q of the y-values with random numbers, in the same approximate range as the other y-values, but completely unaffected by the corresponding x-values. By changing q, I can approximately target the levels of significance that will pop out of the regression. I’ve found that a q between 20% and 40% of the observations will usually generate significance levels in the “interesting” range (p-values ranging from 0.001 to 0.30 or so).
As for what we can conclude about the real world and the statistical tools we use to analyze it, make of this what you will.