How to control your long run error rates

This page provides a few useful resources and web links for understanding and using standard (i.e. Neyman Pearson) statistics. You should read Chapter Three of Understanding psychology as a science first as an introduction to the issues. This webpage introduces some further technical details so you can apply the ideas to research more easily.

To test your intuitions concerning Bayesian versus Orthodox statistics try this quiz.

For more explication : Dienes, Z. (2011). Bayesian versus Orthodox statistics: Which side are you on? Perspectives on Psychological Sciences, 6(3), 274-290.

The aim in traditional (Neyman Pearson) statistics is to use decision procedures with known controlled long term error rates for accepting and rejecting hypotheses. Two hypotheses are formulated to pit against each other: the null and the alternative. The outcome of the decision procedure is to accept one hypothesis and reject the other. You can be in error by rejecting the null when it is actually true (Type I error) or accepting the null when it is actually false (Type II error). The error rates for both types of error must be controlled.

1. Controlling Type I errors.

Students are in general taught about controlling type I errors rates: Given the normal training of psychologists, it may seem the whole objective of any statistical procedure is to tell you whether p < .05 or not. It is this sentiment that has led to some of the frequent criticisms of significance testing. (See also Rozebooms' classic "the fallacy of the null hypothesis significance test".) To some extent these criticisms reflect the practice of only trying to control Type I errors and not employing the rest of the Neyman Pearson logic. Of course, many criticisms are not about the misuse of Neyman Pearson, but its very conceptual basis, as we discuss in the book.

When more than one significance test is conducted the question arises as to the long term error rate of the group ('family') of tests as a whole. The familywise error rate is the probability of falsely rejecting at least one null. Bonferroni is a generic way of controlling familywise error rate: If your family consists of k tests, use .05/k as the significance level for each individual test. For example if you were conducting three t-tests, you could reject any individual null only if its p< .05/3 = .017. With such a decision procedure, you would reject one or more of three true nulls no more than 5% of the time in the long run. Bonferonni controls familywise error rate to be no more than .05, but it does so more severely than needed: Type II error rates are higher than necessary. A decision procedure that controls familywise error just as well without increasing Type II as much as standard Bonferroni is this sequential Bonferroni procedure, which can be used for a family of independent significance tests (whether t-tests, correlations, chi squareds, etc). Take your k p-values and order them from smallest to largest. For example, if your p-values were .024, 001,.12, and .022, (so k = 4 in this example) you would order them:

p(1) = .001
p(2) = .022
p(3) = .024
p(4) = .12

Next, construct a threshold value for each p-value: p(1)'s threshold is.05/k, p(2)'s is 05/(k-1), p(3)'s is .05/(k-2). . . and so on, comparing p(k) with .05. So for our example:

Actual_________Threshold
p(1)=0.001_____ .05/4 = .0124
p(2)=0.022_____.05/3 = .017
p(3)=0.024_____.05/2 = .025
p(4)=0.12______.05/1 = .05

Next, start at the bottom of the table and check if the last value there is smaller than its threshold. It is not in this case, p(4) = .12 is greater than .05, so this test is non-signfiicant. Move up to the next level and check. Here p(3)= .024 is less than .025 so it is significant and all p-values above it in the table are automatically signfiicant too whether they exceed their threshold or not . For example, p(2) is significant even though it is higher than its threshold (p(2) = .022 > .017). p(2) would not have been significant if the other tests below it had not been. (Of course, it is a general property of Neyman Pearson testing that the rejection or acceptance of a hypothesis depends on other testing of no evidential relation to the hypothesis under consideration.) This decision procedure is guaranteed to control familywise error rate at .05 for a set of independent tests, but you see it can result in more tests being declared significant than Bonferroni, so it has a lower Type II error rate. With Bonferroni testing, only p(1) would be declared significant. Note it follows from the sequential procedure that if you have a set of k tests and the largest p is still significant at the .05 level, then you can reject all nulls and still control familywise error rate at 5%.

With a group of tests, there is no logical reason why it has to be familywise error rate one controls. Why control specifically the probability of at least one false rejection of a null? Benjamini and Hochberg (1995) recommended instead controlling the expected proportion of erroneous rejections amongst all rejections which they called the false discovery rate (FDR). (See Chapter Three p. 63 for why uncorrected individual significance tests do not control this rate!) FDR can be controlled by ranking one's p-values as before from i = 1 . . .k. Then test each one against i*.05/k, and, as for the sequential Bonferroni test described above, find the highest i such that p(i) is below its threshold. That p-value is significant as are all those smaller than it. So the procedure is the same as for the sequential Bonferroni above except the thresholds are different. For example, consider the four p-values: .001, .022, .031, .12. The thresholds are 1) 05/4, as for both standard and sequential Bonferroni procedures; 2) 2*.05/4 = .025; 3) 3*.05/4 = .0375; and 4) 4*.05/4 = .05 as for the sequential Bonferroni. Notice that the the thresholds for p(2) and p(3) are higher than with the sequential Bonferroni procedure. That is, controlling FDR produces less Type II errors than controlling familywise error. The difference in sensitvity becomes even greater as the number of tests increases, which is why in situations where very large number of tests are employed, like brain imaging with fMRI where many brain regions are considered at once, FDR is often used to control long run Type I error rates for groups of tests. With the current example p(3) is below its threshold so p(1) to p(3) are significant, whereas with the sequential Bonferroni test only p(1) would be significant. You can decide to control FDR rather than familywise error rate in all multiple testing situations.

In Neyman Pearson you can change the testing procedure according to whether you strongly predicted a direction of an effect in advance. With two tailed tests one rejects the null if the outcome is extreme in either direction. For example, for a normally distributed z-score in the long run a true null would produce a z score greater than 1.96 2.5% of the time and also less than -1.96 2.5% of the time. Thus, one can reject the null if the obtained z score is greater than 1.96, whatever its sign, and hence control alpha at 5%. That is a two tailed test becuase you consider both tails of the distribution in setting up your rejection region. If one only considered it plausible there could be a difference greater than zero, you could declare you will reject the null if z > 1.64 and accept the null otherwise. 5% of the area of a normal lies beyond a z-score of 1.64. This is a classic one-tailed test because only one tail is considered, in this case the positive tail. Deciding to perform a one-tailed test amounts to declaring one would not reject the null no matter how strongly the results came out in the other direction. Would you really just accept the null if you obtained a z of e.g. -5.6? One response is to define an asymmetric rejection region, but not one as extreme as 0% of the area in one tail and 5% in the other.For example, to allow a sufficiently extreme outcome in the wrong direction legitimating the conclusion there is an effect in that direction one could use 0.1% of the aea in the negative tail and 4.9% in the positive tail; then one rejects the null if z is less than -3.75 or greater than 1.66. Or one could use 1% of the area in the negative tail and 4% in the positive tail, and still control overall alpha at 5%. In the latter case, one rejects the null if z is less than -2.32 or greater than 1.75. If you were looking at output for e.g. t-tests or correlations which gave you two-tailed p-values, halve the given p-value to get the one-tailed area. Thus a displayed two-tailed p-value less than .08 for an effect in the right direction would be significant with the latter rejection rule.

2. Controlling Type II errors.

As we discuss in the book, the most common conceptual error in applying classic statistics is ignoring sensitivity. Sensitivity can be determined by power and confidence intervals. To calculate power, it is easy to download and use Gpower. A tutorial on its use is here. Most statistical packages (e.g. SPSS) will report confidence intervals; a site for calculating confidence intervals is here.

Power calculations involve determing the minimally interesting effect size in standard units. For a t-test, effect size can be measured by Cohen's d. For a unrelated t-test, Cohen's d is the difference between the means of the groups divided by their pooled standard deviation, SDp. If the standard deviation in group one is SD1 and in group two is SD2, then SDp is defined by SDp squared= 0.5*(SD1 squared + SD2 squared). For a related t-test, an equivalent measure (called dz in the related case) is the mean difference between conditions divded by the standard deviation of the difference scores. (For each subject find their difference in scores between the conditions. Make a column of these difference scores. Cohen's dz is the mean of this column divided by its standard deviation.) Note there are other measures of effect size, like correlation coefficients.

Based on his experience with psychology research at the time, Cohen (1988) gave rough arbitrary criteria for effect sizes, calling a d of 0.2 small, 0.5 medium, and 0.8 large for the betwen subjects case. Some papers report effect sizes as a matter of course for each test. If a paper has a between subjects design, it will generally report means and standard deviations so you can calculate effect size. (If it reports means and standard errors for each group, remember standard deviation = standard error*square root of number of subjects). A within subjects design can be more tricky, because you want the standard deviation of the differences, not the standard deviation of the score in each condition, and papers generally report only the latter. But if the authors of a paper report a t-test, t= dz*square root of number of subjects, so you can get dz. From dz you can also recover the standard deviation of the differences given you know the mean difference.

The main point in looking at other papers and their effect size is help give you an idea of the typical standard deviations - subject variability - in a certain domain of investigation and the sort of mean differences various manipulations produce. These facts should feed your intuitions concerning what size of differences can be expected for other manipulations, and what size of differences can be expected on different theories. There is a danger that a measure of effect size, like Cohen's d, will be used to bypass deep thinking. You could follow a mechanical procedure: Set up power to detect a medium effect size, and then follow this rule for whatever experiment you run. While such a rule is a step up from current mechanical procedure used by many (in which power is ignored altogether), it is a short cut that should be used only in the genuine absence of good prior information. What you really should be doing is getting to know your literature so that you can make an informed estimate of the sort of mean difference and variability you could expect given a theory under test and a type of manipulation.

Power is something you decide on before running an experiment in order to determine subject number. Some statistical programs give automatic power calculations along with significance tests. These power calculations are worthless. They determine a power to detect the effect size actually measured, and this is a straightforward function of the p value. You get no information from these calculations. Power should be calculated for the effect size you are interested in detecting. Further, it is a property of your decision procedure, which is decided in advance. Baguley (2004) is a good summary of power and its misuse.

The best way to determine the sensitivity of your experiment after you have collected data is with confidence intervals. They tell you the set of values you reject as possible population means and the set you still hold under consideration. In fact, Chapter Three (page 73) describes ways you can use the confidence interval to decide when to stop running subjects while controlling error rates. Confidence intervals naturally allow you to assess if the effects allowed by the data are larger than some minimal interesting amount.

Rather than contrasting two specific values against each other as if they were the only possible values under consideration (the logic of hypothesis testing, which requires at least one specific value), confidence intervals reject and accept whole intervals of possible values, which in general makes more sense. Geoff Cumming provides free software for calculating confidence intervals.

For a collection of quotes by Ronald Fisher and Fisher's 1925 book online.

For tutors: A lecture on Neyman Pearson concepts I gave our undergraduates this year, as an introduction to chapter three of the book. Please feel free to adapt for your own purposes. I ask students to discuss each question with the person sitting next to them. Lecture lasts about an hour.

An essay I set students is the following: " For a paper published this year, consider the extent to which the authors strictly followed the demands of Neyman-Pearson hypothesis testing. Discuss whether or not substantial conclusions drawn from the data were compromised by either not adhering to the Neyman-Pearson approach, or adhering to it too strictly."

Guidance: "Did the authors set out their pre-set alpha and beta error rates in advance? Authors in general can be taken to be using an alpha rate of 5% by default, but authors rarely state their acceptable beta rate. Did they indicate what size of effect they would expect if their theory were true - and then use it to determine power? Is it clear what stopping rule was used? Widely varying sample sizes for effects just reaching significance may indicate numbers were topped up in some experiments to get the results significant. Was power (or confidence intervals) taken into account in interpreting null results? If a set of tests were addressing a single theme was there a correction for mutliple testing? It is somewhat arbitrary whether a set of tests counts as a "family" for which one should control familywise error rate, but at least when a set of tests is conducted where any one of them being significant means a "yes" answer to a single question (e.g. "Does the drug work at any time point?"), then the set is a family. Other mistakes include using p vallues to measure or compare sizes of effects or interpreting p values to mean probability of hypotheses."

See also this assessment of several topics from the book.