Comments on Agoraphilia: On Creating Bogus Data Sets

Handy. And you can keep those faked data sets in ...

2007-08-23T09:05:00.000-07:00

Handy. And you can keep those faked data sets in the draw for when you teach robust statistics, since that's the data generation assumption they fit.

Btw Bayesians don't suggest always using CIs. We only claim that all the relevant information is in the posterior distribution of your unknowns. If you feel that some of that information is best squished into a point estimate and CI, then that's fine, but it's only one way to show it.

You can use the data from my lab and get no proble...

2007-04-12T22:43:00.000-07:00

You can use the data from my lab and get no problem with ridiculously high p values.

Don't you just have non-stationarity data? You'd ...

2007-04-06T19:57:00.000-07:00

Don't you just have non-stationarity data? You'd want to detrend it first.

Are you generating normally distributed errors? If...

2007-04-05T12:13:00.000-07:00

Are you generating normally distributed errors? If you are, then I'm not sure I see how you could have a problem.

Yes, I used normally distributed errors, and I still had the problem. By the time the variance was large enough to generate non-ridiculous p-values, the range of y-values was too wide (including negative numbers, for instance).

If I've understood what you've done, the resulting model should have substantial bias, since the errors on the largest observations of x will be predominantly negative (presuming a positive beta) and the errors on the smallest observations will be predominantly positive.

Good point. I hadn't thought of it that way, but it makes sense. And that does explain why this method works to increase my p-values. I was thinking the effect was from pure randomness; I didn't consider bias. Regardless, the method works fine for generating a bogus data set for my students to play with.

My guess is that you are using too many data points. Lowering the amount of points would have lowered the significance.

Good point. I usually was working with at least 30 and often 50 data points.

The Bayesian school of thought says that null hypothesis testing (t-tests, anovas, p-values) makes no sense. Especially when your hypothesis space is a continuous variable. It makes no sense because you are rejecting an hypothesis that has zero width. You are rejecting nothing.

Interesting point. I'd pondered that before, but I didn't know the Bayesians had something to say about it. However, the objection seems limited to two-tail tests. One-tail tests have a "less than or equal" (or "greater than or equal") null hypothesis, which does not have zero width. And any two-tail test can be reinterpreted as a one-tail test with a different significance level.

Bayesians tell us that we should ALWAYS use confidence intervals.

The problem with CI's is they require you to use a specific confidence level (say, 90%), as opposed to looking at a p-value, which allows you to know the maximum confidence level you can get away with. Also, a CI is isomorphic to a two-tail significance test, in that you'll reject the null if the hypothesized value (e.g., zero) falls outside your CI.

My guess is that you are using too many data point...

2007-04-04T17:40:00.000-07:00

My guess is that you are using too many data points. Lowering the amount of points would have lowered the significance.

This is a perfect opportunity for me to push the ways of the Bayesians. We believe the problem you have stumbled upon is due to the nonsensical use of null hypothesis testing. You have made the same mistake than most scientists of the 20th century (and beyond) always do.

The Bayesian school of thought says that null hypothesis testing (t-tests, anovas, p-values) makes no sense. Especially when your hypothesis space is a continuous variable. It makes no sense because you are rejecting an hypothesis that has zero width. You are rejecting nothing.

With these tests, as long as you have enough data, you will _ALWAYS_ be able to find significant results. (Unless the effect is exactly null, which pretty much never happens in real life, even in the best cases, if you have enough data you'll find significance in the small bias of your measuring tools)

Your results aren't significant? No problem! Just get more individuals in your data! That's how easy they are to manipulate.

Bayesians tell us that we should ALWAYS use confidence intervals. The problems with that is that confidence intervals require the use of prior distribution, which some think are not objective and bias your result.

Don't listen to them though, Bayesians define maximum entropy priors that exist simply to be the less biased and the more objective possible.

I urge you to read E.T. Jaynes fascinating book.
http://omega.albany.edu:8008/JaynesBook.html
(you can also buy it on amazon)

There's also a neat intro here:
http://yudkowsky.net/bayes/technical.html

Are you generating normally distributed errors? I...

2007-04-04T12:44:00.000-07:00

Are you generating normally distributed errors? If you are, then I'm not sure I see how you could have a problem. If I've understood what you've done, the resulting model should have substantial bias, since the errors on the largest observations of x will be predominantly negative (presuming a positive beta) and the errors on the smallest observations will be predominantly positive. Aren't your coefficients sharply biased towards zero by this method? This of course would explain why your p values increase.