A statistical hypothesis test does not do what people want it to do. People want to run some numbers, and have some complicated formula to tell them if they are right or wrong.

When one carries out a statistical hypothesis test, one specifies a null hypothesis about a statistic. In the context of parametric test, the null hypothesis specifies the entire distribution of the statistic. Given a value for the statistic measured from data, we can calculate the probability we would have observed it assuming that the null hypothesis is true. This is called the p-value. It is a function of your null hypothesis and data.

Before conducting a statistical test, one specifies a significance level which is a threshold. If the probability of observing what we observed assuming the null is true is smaller than the significance level we decided before carrying out the test, then we reject the null hypothesis. Otherwise, we fail to reject it.

We never say anything like the null hypothesis is true,

, the null hypothesis was accepted

etc, because, no matter how many white swans you see, you never know when a black (or, for that matter, pink) swan will decide to make an appearance.

Too many people are prone to interpreting the p-value as the probability that the null hypothesis is true. Scientifically speaking, such interpretations give me the heebie jeebies.

A statement about a population parameter is either true or false. We just do not know whether it is true or not.

And, this is the key thing to keep in mind so one can hold on to the requisite humility that is the basis of scientific inquiry, in most situations of interest, we will never find out **the truth** on the basis of some statistical test.

All we can do is decide whether what we saw was improbable enough assuming the null is true, and, if so, reject the null hypothesis—in this particular instance.

Similarly, a statistical test cannot tell you if a given sequence of numbers was randomly generated.

That is a question about process: Were all the balls in the cage identical? Were there systematic influences acting on the dice? Etc, etc, …

If the process satisfies appropriate criteria, then every sequence you get out of that process is randomly generated.

If you reject certain sequences coming out of a process as being "not random enough", you change the characteristics of the generator so that it may no longer satisfy the criteria you set at the outset.

Why am I going on about this?

Because, Ovid wrote this:

Let's say I roll a six-sided die 60 times and I get the following results for the numbers 1 through 6, respectively:

16, 5, 9, 7, 6, 17Is that really random? It doesn't look like it.

**You can't ask that question!**

You can only ask: Assuming the die we are using is fair, what is the probability of generating a frequency distribution that is as far away from the expected frequencies implied by rolling a fair die as the frequency distribution we observed?

Both the output from Statistics::ChiSquare and Ovid's discussion are further disheartening:

Unfortunately, that prints:

There's a >1% chance, and a <5% chance, that this data is random.As it turns out, the chi-square test says there's only a 1.8% chance of those numbers being "fair".

No, it says no such thing! The die is either fair or not. In this extremely simple case, you can physically inspect the die to ensure that it satisfies whatever criteria for fairness you set out. There is a reason gaming commissions list physical requirements for dice rather than having some machine roll them 60 or even 600 times.

No, a p-value of 1.8% gives the probability of observing this particular value of the χ^{2} assuming the frequencies were generated by rolling a fair die.

Is that improbable enough to reject the null? Well, that's up to you. If, on the basis of that p-value, you declare that the die is not "fair", you will commit a Type-I error approximately 1 out of 50 samples of die rolls. If dice are cheap, and if the integrity of the process is of paramount importance, you might want to have an even higher tolerance for erroneously throwing away dice.

This frequency distribution gives a χ^{2} = 13.6. The rest of the test is as easy as looking up a critical value in a table.

This is a common problem with Stats and Econ. The apparent simplicity of the arithmetic involved in intro classes lead smart people to assume the subjects themselves are as simplistic as their thinking.

You cannot test whether a given set of observations is "random". The null hypothesis for a goodness of fit test is simple: The observed distribution of values matches a distribution implied by a specific random variable.

Because of the underlying randomness, the observed frequencies will rarely match the expected frequencies exactly.

If the probability that the differences are due to chance assuming the null hypothesis is true is small enough, that provides evidence against the null hypothesis that the observations were produced by the hypothesized process. Statements such as this data is random

or the chi-square test says there's only a 1.8% chance of those numbers being "fair"

are simply *non-sensical*.

The author of the page Ovid linked to actually did make an effort to correctly phrase the conclusion:

We then compare the value calculated in the formula above to a standard set of tables. The value returned from the table is 1.8%. We interpret this as meaning that if the die was fair (or not loaded), then the chance of getting a χ

^{2}statistic as large or larger than the one calculated above is only 1.8%. In other words, there's only a very slim chance that these rolls came from a fair die.

Please read and re-read the Ovid's paraphrasing and the original as many times as necessary until the difference sinks in. The difference is subtle, might seem unimportant to you, but it is life & death in most circumstances (as an aside, please do not talk about government control of the health care and health insurance industries until you understand what Dallas Buyers' Club really is about).

The original text, correctly, states P(χ^{2} > 13.6 | die is fair) = 1.8%.

Based on that information, you cannot conclude P(die is fair) = 1.8%.

You have a responsibility to understand the relatively easier statistical concepts if you are going to dabble in them because your ability to make sense of what appears to be gibberish to other people makes them trust you with everything they mistake for gibberish.

Now, suppose we had obtained the frequencies `[9, 11, 10, 8, 12, 10]`

. These frequencies would have yielded a χ^{2} value of **1** with a corresponding p-value of about 96%.

If 100 people roll 100 fair dice 60 times each, 96 of them will generate χ^{2} statistic values of at least **1**. Therefore, rejecting the null on the basis of χ^{2} ≥ 1 would result in Type-I error 96 times out of 100.

That's all that probability means.

If you do roll a die 60 times and get the frequencies `[9, 11, 10, 8, 12, 10]`

, can you be sure the die rolls were random?

Of course not!

A process that *always* yields the sequence `1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 2, 2, 3, 5, 5, 6, 1, 2, 3, 5, 5, 6, 1, 2, 3, 4, 5, 6, … 1, 2, 3, 4, 5, 6`

will never fail this test.

That would not be a very random process, would it?

What if the process always generated the frequencies `[n, n, n, n, n, n]`

when ever you generated `6n` rolls? The result would lead you to never reject the null by always yielding a χ^{2} statistic of 0.

Pearson's goodness-of-fit test is useful. A small p-value does tell me that the frequencies I observed are not very likely if the null hypothesis is true. Failure to reject, however, cannot tell me that the null hypothesis is true. All it can tell me is that the frequencies I got are not inconsistent with the null.

If I have a data generating process, I can check if the sampling distribution of a specific statistic obtained from samples generated by this data generating process matches its theoretical sampling distribution. One can always use K-S, but I am going to be lazy and just check if the empirical distribution of χ^{2} values coming from my data generating process match the theoretical one at select critical points:

#!/usr/bin/env perl use 5.016; use warnings; use Carp; use List::Util qw( sum ); use Statistics::Descriptive::Weighted; use Text::Table::Tiny; run(10_000); sub run { my $samples = shift; my $stat = Statistics::Descriptive::Weighted::Full->new(); for (1 .. $samples) { $stat->add_data( [ chisq_fairdie(60, 6) ] ); } # https://people.richland.edu/james/lecture/m170/tbl-chi.html # right tail probabilities # df = 5 my @x = ( 0.412, 0.554, 0.831, 1.145, 1.610, 9.236, 11.070, 12.833, 15.086, 16.750, ); my @p = ( 0.995, 0.990, 0.975, 0.950, 0.900, 0.100, 0.050, 0.025, 0.010, 0.005, ); say Text::Table::Tiny::table( header_row => 1, rows => [ [ qw(critical p ecdf) ], map [ sprintf('%.3f', $x[$_]), sprintf('%.3f', $p[$_]), sprintf('%.3f', 1 - $stat->cdf($x[$_])), ], 0 .. $#x, ], ); } sub chisq_fairdie { my ($sample_size, $faces) = @_; my @freq = (0) x $faces; for (1 .. $sample_size) { $freq[ rand($faces) ] += 1; } if ($faces != @freq) { croak sprintf( 'Expected faces=%d, observed=%d', $faces, scalar @freq ); } my $e = $sample_size / 6; return sum( map +((($_ - $e)**2) / $e), @freq ); }

Sample output:

+----------+-------+-------+ | critical | p | ecdf | +----------+-------+-------+ | 0.412 | 0.995 | 0.992 | | 0.554 | 0.990 | 0.992 | | 0.831 | 0.975 | 0.970 | | 1.145 | 0.950 | 0.953 | | 1.610 | 0.900 | 0.885 | | 9.236 | 0.100 | 0.094 | | 11.070 | 0.050 | 0.048 | | 12.833 | 0.025 | 0.023 | | 15.086 | 0.010 | 0.008 | | 16.750 | 0.005 | 0.005 | +----------+-------+-------+

This is using ActiveState Perl 5.16.3 on my Windows XPSP3.

You wrote "You have a responsibility to understand the relatively easier statistical concepts if you are going to dabble in them"

ReplyDeleteWith this in mind, I would be glad if you could have a look at Curtis `Ovid' Poe's presentation "A/B testing: what your mother never told you" at FOSDEM 2014 last weekend ( link ).

I think the take home message of that presentation was "A/B testing enables you to conclude that (e.g.) "Web page with blue background generates significantly more click-thru's than web page with pink background. The difference was significant at the nn% level"".

My statistics is a bit rusty but this would seem to be at variance with your piece, as well as with what others say about signficance tests - e.g. "Why hypothesis and significance tests ask the wrong questions" by Rob Herbert ( link )

Richard H

I haven't seen the presentation, but the sentences you quoted are not necessarily problematic, so long as one strictly interprets the word "significantly" as "statistically significantly" which are really different beasts (and, almost orthogonal concepts).

DeleteIf the chosen significance level is 1%, that means they are OK with saying a blue background generates more clicks that a pink background incorrectly in 1 out 100 experiments.

I have never done A/B testing. I once asked another person who was writing about it how they decided on their sample size. I was told that they ran the experiment until their statistical tests declared a significant difference between the two treatment.

I hope that was an aberration. Because

THAT is a serious problem.Agreed that this is a serious issue. I touched on in briefly in the presentation and referenced http://blog.booking.com/is-your-ab-testing-effort-just-chasing-statistical-ghosts.html, a great blog entry describing the problem. Unfortunately, it's a problem endemic in A/B testing and most people get it wrong.

DeleteThanks for the clarifications, I appreciate it.

ReplyDelete