This page provides a few useful resources and web links for
understanding and using standard (i.e. Neyman Pearson) statistics.
You should read Chapter Three of *Understanding psychology as a
science* first as an introduction to the issues. This webpage
introduces some further technical details so you can apply the ideas
to research more easily.

To test your intuitions concerning Bayesian versus Orthodox statistics try this quiz.

For more explication : Dienes,
Z. (2011). Bayesian versus Orthodox statistics: Which side are
you on?* Perspectives on Psychological Sciences, 6(3), *274-290.

The aim in traditional (Neyman Pearson) statistics is to use decision procedures with known controlled long term error rates for accepting and rejecting hypotheses. Two hypotheses are formulated to pit against each other: the null and the alternative. The outcome of the decision procedure is to accept one hypothesis and reject the other. You can be in error by rejecting the null when it is actually true (Type I error) or accepting the null when it is actually false (Type II error). The error rates for both types of error must be controlled.

Students are in general taught about controlling type I errors
rates: Given the normal training of psychologists, it may seem the
whole objective of any statistical procedure is to tell you whether p
< .05 or not. It is this sentiment that has led to some of the
frequent criticisms
of significance testing. (See also Rozebooms'
classic "the fallacy of the null hypothesis significance test".)
To some extent these criticisms reflect the practice of *only*
trying to control Type I errors and not employing the rest of the
Neyman Pearson logic. Of course, many criticisms are not about the
misuse of Neyman Pearson, but its very conceptual basis, as we
discuss in the book.

When more than one significance test is conducted the question arises as to the long term error rate of the group ('family') of tests as a whole. The familywise error rate is the probability of falsely rejecting at least one null. Bonferroni is a generic way of controlling familywise error rate: If your family consists of k tests, use .05/k as the significance level for each individual test. For example if you were conducting three t-tests, you could reject any individual null only if its p< .05/3 = .017. With such a decision procedure, you would reject one or more of three true nulls no more than 5% of the time in the long run. Bonferonni controls familywise error rate to be no more than .05, but it does so more severely than needed: Type II error rates are higher than necessary. A decision procedure that controls familywise error just as well without increasing Type II as much as standard Bonferroni is this sequential Bonferroni procedure, which can be used for a family of independent significance tests (whether t-tests, correlations, chi squareds, etc). Take your k p-values and order them from smallest to largest. For example, if your p-values were .024, 001,.12, and .022, (so k = 4 in this example) you would order them:

p(1) = .001

p(2) = .022

p(3) = .024

p(4) = .12

Next, construct a threshold value for each p-value: p(1)'s threshold is.05/k, p(2)'s is 05/(k-1), p(3)'s is .05/(k-2). . . and so on, comparing p(k) with .05. So for our example:

Actual_________Threshold

p(1)=0.001_____ .05/4 = .0124

p(2)=0.022_____.05/3 = .017

p(3)=0.024_____.05/2 = .025

p(4)=0.12______.05/1 = .05

Next, start at the bottom of the table and check if the last value
there is smaller than its threshold. It is not in this case, p(4) =
.12 is greater than .05, so this test is non-signfiicant. Move up to
the next level and check. Here p(3)= .024 is less than .025 so it is
significant and *all p-values above it in the table are
automatically signfiicant too* whether they exceed their
threshold or not . For example, p(2) is significant even though it is
higher than its threshold (p(2) = .022 > .017). p(2) would not
have been significant if the other tests below it had not been. (Of
course, it is a general property of Neyman Pearson testing that the
rejection or acceptance of a hypothesis depends on other testing of
no evidential relation to the hypothesis under consideration.) This
decision procedure is guaranteed to control familywise error rate at
.05 for a set of independent tests, but you see it can result in more
tests being declared significant than Bonferroni, so it has a lower
Type II error rate. With Bonferroni testing, only p(1) would be
declared significant. Note it follows from the sequential procedure
that if you have a set of k tests and the largest p is still
significant at the .05 level, then you can reject all nulls and still
control familywise error rate at 5%.

With a group of tests, there is no logical reason why it has to be
familywise error rate one controls. Why control specifically the
*probability of at least one false rejection of a null*?
Benjamini
and Hochberg (1995) recommended instead controlling the *expected*
*proportion of erroneous rejections amongst all rejections*
which they called the false discovery rate (FDR). (See Chapter Three
p. 63 for why uncorrected individual significance tests do not
control this rate!) FDR can be controlled by ranking one's p-values
as before from i = 1 . . .k. Then test each one against i*.05/k, and,
as for the sequential Bonferroni test described above, find the
highest i such that p(i) is below its threshold. That p-value is
significant as are all those smaller than it. So the procedure is the
same as for the sequential Bonferroni above except the thresholds are
different. For example, consider the four p-values: .001, .022, .031,
.12. The thresholds are 1) 05/4, as for both standard and sequential
Bonferroni procedures; 2) 2*.05/4 = .025; 3) 3*.05/4 = .0375; and 4)
4*.05/4 = .05 as for the sequential Bonferroni. Notice that the the
thresholds for p(2) and p(3) are higher than with the sequential
Bonferroni procedure. That is, controlling FDR produces less Type II
errors than controlling familywise error. The difference in
sensitvity becomes even greater as the number of tests increases,
which is why in situations where very large number of tests are
employed, like brain imaging with fMRI where many brain regions are
considered at once, FDR is often used to control long run Type I
error rates for groups of tests. With the current example p(3) is
below its threshold so p(1) to p(3) are significant, whereas with the
sequential Bonferroni test only p(1) would be significant. You can
decide to control FDR rather than familywise error rate in all
multiple testing situations.

In Neyman Pearson you can change the testing procedure according to whether you strongly predicted a direction of an effect in advance. With two tailed tests one rejects the null if the outcome is extreme in either direction. For example, for a normally distributed z-score in the long run a true null would produce a z score greater than 1.96 2.5% of the time and also less than -1.96 2.5% of the time. Thus, one can reject the null if the obtained z score is greater than 1.96, whatever its sign, and hence control alpha at 5%. That is a two tailed test becuase you consider both tails of the distribution in setting up your rejection region. If one only considered it plausible there could be a difference greater than zero, you could declare you will reject the null if z > 1.64 and accept the null otherwise. 5% of the area of a normal lies beyond a z-score of 1.64. This is a classic one-tailed test because only one tail is considered, in this case the positive tail. Deciding to perform a one-tailed test amounts to declaring one would not reject the null no matter how strongly the results came out in the other direction. Would you really just accept the null if you obtained a z of e.g. -5.6? One response is to define an asymmetric rejection region, but not one as extreme as 0% of the area in one tail and 5% in the other.For example, to allow a sufficiently extreme outcome in the wrong direction legitimating the conclusion there is an effect in that direction one could use 0.1% of the aea in the negative tail and 4.9% in the positive tail; then one rejects the null if z is less than -3.75 or greater than 1.66. Or one could use 1% of the area in the negative tail and 4% in the positive tail, and still control overall alpha at 5%. In the latter case, one rejects the null if z is less than -2.32 or greater than 1.75. If you were looking at output for e.g. t-tests or correlations which gave you two-tailed p-values, halve the given p-value to get the one-tailed area. Thus a displayed two-tailed p-value less than .08 for an effect in the right direction would be significant with the latter rejection rule.

As we discuss in the book, the most common conceptual error in applying classic statistics is ignoring sensitivity. Sensitivity can be determined by power and confidence intervals. To calculate power, it is easy to download and use Gpower. A tutorial on its use is here. Most statistical packages (e.g. SPSS) will report confidence intervals; a site for calculating confidence intervals is here.

Power calculations involve determing the minimally interesting
effect size in standard units. For a t-test, effect size can be
measured by *Cohen's d*. For a unrelated t-test, Cohen's d is
the difference between the means of the groups divided by their
pooled standard deviation, SDp. If the standard deviation in group
one is SD1 and in group two is SD2, then SDp is defined by SDp
squared= 0.5*(SD1 squared + SD2 squared). For a related t-test, an
equivalent measure (called dz in the related case) is the mean
difference between conditions divded by the standard deviation of the
difference scores. (For each subject find their difference in scores
between the conditions. Make a column of these difference scores.
Cohen's dz is the mean of this column divided by its standard
deviation.) Note there are other
measures of effect size, like correlation coefficients.

Based on his experience with psychology research at the time, Cohen (1988) gave rough arbitrary criteria for effect sizes, calling a d of 0.2 small, 0.5 medium, and 0.8 large for the betwen subjects case. Some papers report effect sizes as a matter of course for each test. If a paper has a between subjects design, it will generally report means and standard deviations so you can calculate effect size. (If it reports means and standard errors for each group, remember standard deviation = standard error*square root of number of subjects). A within subjects design can be more tricky, because you want the standard deviation of the differences, not the standard deviation of the score in each condition, and papers generally report only the latter. But if the authors of a paper report a t-test, t= dz*square root of number of subjects, so you can get dz. From dz you can also recover the standard deviation of the differences given you know the mean difference.

The main point in looking at other papers and their effect size is help give you an idea of the typical standard deviations - subject variability - in a certain domain of investigation and the sort of mean differences various manipulations produce. These facts should feed your intuitions concerning what size of differences can be expected for other manipulations, and what size of differences can be expected on different theories. There is a danger that a measure of effect size, like Cohen's d, will be used to bypass deep thinking. You could follow a mechanical procedure: Set up power to detect a medium effect size, and then follow this rule for whatever experiment you run. While such a rule is a step up from current mechanical procedure used by many (in which power is ignored altogether), it is a short cut that should be used only in the genuine absence of good prior information. What you really should be doing is getting to know your literature so that you can make an informed estimate of the sort of mean difference and variability you could expect given a theory under test and a type of manipulation.

Power is something you decide on before running an experiment in
order to determine subject number. Some statistical programs give
automatic power calculations along with significance tests. These
power calculations are worthless. They determine a power to detect
the effect size actually measured, and this is a straightforward
function of the p value. You get no information from these
calculations. Power should be calculated for the effect size you are
*interested* in detecting. Further, it is a property of your
decision procedure, which is decided in advance. Baguley
(2004) is a good summary of power and its misuse.

The best way to determine the sensitivity of your experiment *after*
you have collected data is with confidence intervals. They tell you
the set of values you reject as possible population means and the set
you still hold under consideration. In fact, Chapter Three (page 73)
describes ways you can use the confidence interval to decide when to
stop running subjects while controlling error rates. Confidence
intervals naturally allow you to assess if the effects allowed by the
data are larger than some minimal interesting amount.

Rather than contrasting two specific values against each other as if they were the only possible values under consideration (the logic of hypothesis testing, which requires at least one specific value), confidence intervals reject and accept whole intervals of possible values, which in general makes more sense. Geoff Cumming provides free software for calculating confidence intervals.

For a collection of quotes by Ronald Fisher and Fisher's 1925 book online.

**For tutors**: A
lecture on Neyman Pearson concepts I gave our undergraduates this
year, as an introduction to chapter three of the book. Please feel
free to adapt for your own purposes. I ask students to discuss each
question with the person sitting next to them. Lecture lasts about an
hour.

An **essay** I set students is the following: "
For a paper published this year, consider the extent to which the
authors strictly followed the demands of Neyman-Pearson hypothesis
testing. Discuss whether or not substantial conclusions drawn from
the data were compromised by either not adhering to the
Neyman-Pearson approach, or adhering to it too strictly."

Guidance: "Did the authors set out their pre-set alpha and beta error rates in advance? Authors in general can be taken to be using an alpha rate of 5% by default, but authors rarely state their acceptable beta rate. Did they indicate what size of effect they would expect if their theory were true - and then use it to determine power? Is it clear what stopping rule was used? Widely varying sample sizes for effects just reaching significance may indicate numbers were topped up in some experiments to get the results significant. Was power (or confidence intervals) taken into account in interpreting null results? If a set of tests were addressing a single theme was there a correction for mutliple testing? It is somewhat arbitrary whether a set of tests counts as a "family" for which one should control familywise error rate, but at least when a set of tests is conducted where any one of them being significant means a "yes" answer to a single question (e.g. "Does the drug work at any time point?"), then the set is a family. Other mistakes include using p vallues to measure or compare sizes of effects or interpreting p values to mean probability of hypotheses."

See also this assessment of several topics from the book.