FCS8: Statistical analysis of experiments

David Young, October 1997

This teach file gives an introduction to some techniques relevant to the analysis of the results of experiments on non-deterministic systems.

Introduction
Descriptive statistics
. . Graphs
. . Simple numerical statistics and correlation
. . The histogram
Hypothesis testing
. . Basic framework
. . A simple example
. . General methodology
. . Another example
. . Combining significance levels
. . Problems with hypothesis testing

Introduction

In traditional AI, it has been common for researchers to make their points by building systems that illustrate particular techniques or demonstrate particular competences. In some ways, this is rather like the approach of an engineer, in that the production of an object that performs a given task within given resources is sufficient to show an advance in the state of his or her art.

Increasingly, though, there is a kind of investigation that demands a different approach, more like that of a behavioural scientist. This occurs particularly when systems cease to be transparent; it is not enough to build such a system: its properties must also be explored. In addition, in artificial life and evolutionary systems simulations, such systems involve the use of random numbers to mimic environmental variability, whilst robotic systems that interact with the real world are subject to the genuine thing. Characterising systems which involve variability involves the use of statistical methods.

The use of statistics applies the theory of probability (see FCS7) to the description of processes which are subject to random variation. Various kinds of descriptive statistics are useful in general exploration of a system, and are the main method of trying to obtain some degree of understanding of it. More formal methods of statistical inference are used to draw quantitative conclusions or to attempt to determine specific properties of a model of the process.

Here I mention a few techniques of descriptive statistics, and discuss one particular approach to statistical inference, known as hypothesis testing.

Books aimed at psychologists are probably the most useful for an initial understanding of this material, and for practical help in applying it. Two that are widely used are "Learning to use statistical tests in psychology", bu J. Greene and M. D'Oliveira (Open U.P., 1982, in the library at QZ 210 Gre), and "Nonparametric statistics for the behavioural sciences", 2nd edition, by S. Siegel and N.J. Castellan (McGraw Hill, 1988, in the library at QD 8320 Sie).

Descriptive statistics

Graphs

When looking at results from an experiment, the first set of tools to turn to are graphical ones. Tools such as Matlab, Maple and AVS provide a wide variety of ways to display data graphically. The area is too large and complex to discuss here, and methods such as 2-D and 3-D graphs and bar charts are probably familiar already. The main point is that time spent producing graphical output is usually well spent, but that when data have multiple dimensions, it can be difficult to find the appropriate combinations to display. It is essential to spend time finding the right way to display data in order to reveal relationships which may be present.

Simple numerical statistics and correlation

The mean and standard deviation of a set of data have been discussed in FCS7. Calculating these statistics for results obtained when an experiment is repeated is often the first step in gaining a clear view of what is going on. The mean gives a measure of the location of the centre of some numerical data; the standard deviation gives a meaures of its spread.

An additional descriptive statistic, not introduced in FCS7, is the correlation coefficient between two sets of data, which can be used when the individual data values can be paired off between the two sets. This gives a measure of whether the two random variables being sampled vary together or are independent. For instance, in an experiment involving a simulated visual system, it may be interesting to look at whether the time to pick out some target varies with the number of distracting objects in the field of view.

If two random variables X and Y are being sampled, the correlation coefficient is defined as

    r  =  < (X - <X>) * (Y - <Y>) > / sqrt (Var(X) * Var(Y))

That is, it is the average of the products of the deviations of the variables from their means, normalised using the variances. It lies between -1 and +1, and either -1 or +1 means that there is a perfect linear relationship between the two variables, whilst 0 corresponds to no linear relationship betwen the two.

The histogram

The mean, standard deviation and other similar measures provide some indication of the distribution of a variable (such as the fitness of a population) which is being measured. A graphical way of looking at the distribution generally is to use the histogram of the values found.

The simplest way to produce a histogram is to create, in effect, a set of bins covering the range of values of the variable. Each bin initially contains the value zero. After each trial, the value of the variable being measured is used to pick out a bin, and the value held in the bin is incremented. For instance, in a simple case, a measure might range from 0 to 99. We create 10 bins, covering the ranges 0-9, 10-19, 20-29 and so on. If a trial yields the value 63, we increment the 60-69 bin, and so on. After a large enough number of trials, the values in the bins will be an approximation to the underlying probability distribution of the variable.

There is a trade-off between the number of bins and the accuracy of the probability estimate each one holds. A lot of bins gives a narrow range of values for each, giving a higher accuracy on the position of any feature of the distribution, but lower accuracy on the probability estimates because fewer votes will be cast for each bin.

There are more sophisticated ways of generating histograms which do not involve discrete bins. (Treating the data as a set of delta functions and convolving this with a smoothing kernel is one such method.) All of them, however, involve essentially the same trade-off.

Looking at graphs, descriptive numerical statistics, and histograms are all important ways of understanding a system. More formal methods are also sometimes called for, particularly in the context of statistical variability.

Hypothesis testing

Basic framework

Suppose you run a simulation and measure some outcome - say the average level of fitness in a population after a certain number of generations, or the number of times a robot succeeds in reaching its goal. You then make some adjustment, perhaps by varying a parameter of the simulation such as the mutation rate of a genetic algorithm or the rate of learning of a neural network, and repeat the simulation. If the outcome changes, how can you say whether this was a result of the adjustment you made, or simply a random fluctuation which might have been expected to occur regardless?

This kind of question is at the heart of the dominant statistical methodology of the behavioural, social and medical sciences. The question of whether a new drug has an effect on the outcome of a particular disease is, for example, a crucial one in medicine.

The method generally used is called hypothesis testing. The approach is to ask whether it is reasonable to attribute any differences observed to random fluctuations, assuming that the manipulation (application of the drug, change of the mutation rate, or whatever) has no effect. If the changes are too big for this to be reasonable, then the experiment is taken as evidence for a real effect. The reason for doing it this way round is that if there is no effect, then it is possible to calculate the probabilities associated with the measurements, and see how unlikely they are.

Some terminology is needed to set this up formally.

The null hypothesis, denoted by H0 (the 0 should be a subscript) is the hypothesis that the differences in conditions between the two runs of an experiment have no effect. The alternative hypothesis, H1, is that H0 is false, i.e. there is an effect of the manipulation. If we decide that the experiment shows an effect when in fact there is none, we have made a Type I error. Conversely, if we decide that there is no evidence for an effect, when in fact one exists, we have made a Type II error.

Usually, the differences between the experimental results are summarised in a single statistic. This might be something like the change in the success rate of the robot. We then calculate the probability, assuming the null hypothesis, of getting either the observed value of the statistic, or a more extreme value. This probability is always given the symbol P, and is known as a significance level. If P is low, then the result we have is unlikely under the null hypothesis.

A simple example

Suppose you conduct an experiment in which you run a simulation of a system, setting your pseudo-random number generator to a particular seed before you start, and using value a for some parameter you are interested in. You then change the parameter to b, reset the random number generator to the same seed as before, and rerun the experiment. You then look to see whether the performance is better or worse then it was before. You then repeat the pair of tests some number of times - say 10 - recording for each pair whether performance increased or decreased when the parameter was changed from a to b. Different random numbers are used in each pair of tests.

Suppose the performance gets better on 8 trials out of 10, and worse on 2 trials. How likely is this under the null hypothesis that changing the parameter from a to b produces no improvement in the peformance? Is the overall improvement attributable to the change in the parameter?

The null hypothesis says that changing the parameter has no effect, so the performance is equally likely to get better or worse; each trial is like tossing a coin. In this case, there are 2^10 = 1024 different equally likely ways the experiment can turn out (see FCS7). In one of these, performance will improve on all 10 trials, in 10 of them performance will improve on 9 trials, and in 45 performance will improve in 8 trials (you can check this by enumerating the different cases, or by using the binomial expression if you happen to know it). In other words, there are 1 + 10 + 45 = 56 cases that give the observed result or a better one in the sense of more improvements. If better is interpreted to be "more extreme", then it follows that P = 56/1024 or about 0.055. That is, on about 55 in 1000 repetitions of the whole sequence, you would expect to get 8 or more improvements, just by random fluctuations in the total.

You may ask whether a result of 2 or fewer improvements out of 10 would not be just as "extreme" as a result of 8 improvements. This depends on whether you simply want to test that the change had an effect of some sort, or whether you want to test that it produced an improvement. If the former, then these other cases would also have to count as "extreme", and the P value would double to 0.11. This is called a two-tailed test. If however, the alternative hypothesis is that the change does produces specifically an improvement, then exceptionally poor results are lumped in with the run-of-the-mill ones, only success rates greater than that observed are counted as more extreme, and the test is called one-tailed.

This kind of experimental design, incidentally, is called a related samples design. Within each pair of tests, everything is the same except for the parameter of interest. Thus every single run of the simulation with parameter a has its own control with parameter b. The tests come in matched pairs. An alternative way to do it would be to use new random numbers for every single trial. This is called an independent samples design; there is no natural pairing. One could still apply the test described above to arbitrary pairs, but an effect would be much more likely to be masked by random fluctuations in the results (that is, the test would not be very powerful). In an independent samples design, you would be more likely to adopt a statistical test in which you put the results into different classes before doing any comparisons between the two conditions.

General methodology

Does the result P = 0.055, obtained in the imaginary experiment above, mean that changing the parameter produces an improvement, or not?

The received version of how to answer this is as follows. Before doing the experiment, you decide on a critical value of P. This is called alpha. If P turns out to be less than alpha, you reject the null hypothesis (you accept the existence of an effect). Otherwise, you accept the null hypothesis.

How do you choose alpha? The alpha value is in fact the probability that you will make a Type I error - that you will think you have seen an effect when there isn't one. That is, if you decide to use alpha = 0.05, and you do lots of independent experiments, then if all the null hypotheses are true, on one experiment in 20 you will get P < alpha and you will decide that there is an effect that is not there. This follows directly from the definition of P. So if you do not mind making this kind of error one time in 20, you choose alpha = 0.05; if you want a stricter criterion, you might choose alpha = 0.01.

This approach means that one never has to calculate the exact value of P for a given experiment. What you do is to look up the value of the statistic you are using that would give P = alpha. Then if, when you do the experiment, the statistic is more extreme than this critical value, you reject H0; otherwise you accept H0. More extreme values are said to lie in the critical region or rejection region for the null hypothesis; the rest lie in the acceptance region. For the experiment described above the statistic is the number of times an improvement occurred. In the one-tailed test, values of 9 and 10 lie in the rejection region for alpha = 0.05, but 8 does not quite make it. Eight out of 10 improvements would not allow us to reject the null hypothesis at the 0.05 significance level.

You can picture these regions by drawing a graph of the distribution of a statistic. Suppose for this purpose that the statistic has continuous values. For commonly used statistics, the distribution will have a hump in the middle for the likely values and tail off for extreme values. For a one-tailed test, split the area under the curve into two parts: a part under one tail occupying 5% of the area and a part under the rest occupying 95% of the area. The 5% region represents the rejection region for alpha = 0.05. For a two-tailed test, the 5% has to be split across the two tails of the distribution.

Knowing the probability of a Type I error is useful, but of course the probability of a Type II error (not seeing an effect that is really there) is useful too. However, estimating this probability is generally quite messy and difficult. Intuitively, the more data you have, the lower the chance of a Type II error ought to be (for a given alpha), and indeed this is the case for any reasonable test. Different tests are compared on their power, which is 1 - P(Type II error), but working this out often involves making more detailed assumptions about the distribution of the data than does the null hypothesis. The reason for this is that to say something about Type II errors means that you have to say something about the distribution of the data when the null hypothesis is false - and that might be much harder to specify precisely.

Essentially, a significance test gives a good measure of the probability that an observed effect has occurred by chance. A low value for P can thus be a reliable indicator that something real is going on. On the other hand, if the null hypothesis is accepted because P is large, there is nothing simple to say about what the chance of a Type II error is. There might be a real effect which is not shown up, either because there is not enough data, or because the statistic used is not a good one for detecting the particular kind of difference that has occurred.

Another example

There are significance tests to cover many different situations. The books mentioned at the head of this file describe many different tests and give guidance on making an appropriate choice. In the event of your needing a test for an experiment, you will need to spend time analysing the nature of the measurements and the experimental design to find the correct one.

The test used above on the experiment with binary outcomes (improvement or non-improvement) is called the sign test. Here, I give one further example of a test to illustrate the general idea. This test is of quite wide applicability; it is used when you want to know if two independent (not matched) sets of data obtained under different conditions differ significantly. For each condition some outcome is observed in a number of trials; the outcome must be measured with a number (strictly speaking, it must be an ordinal measure). We want to know whether the outcome is significantly different in the two conditions. Since we have a number of trials in each condition, we have an indication of what the spread of likely values of the outcome measure is, so it seem reasonable to suppose that information about whether the conditions differ significantly is available without making further assumptions.

One test that will handle this situation is the Kolmogorov-Smirnov two sample test. The statistic that this uses is the maximum difference in the cumulative distributions of the two outcome measures. This is easiest to explain with an example.

Suppose we conduct a series of trials - say 10 - in one condition - say using one kind of crossover operator in a genetic algorithm. In each trial we measure, say, the number of generations to reach a particular state of the population, and get the following results:

    630 890 700 270 500 480 320 950 836 585

We then do the same thing in an independent set of trials (no pairing with the first set) using a different operator. Suppose this gives

    784 456 893 555 678 699 350 821 921 772

Is there a significant difference between these two sets of numbers? To find the statistic for the K-S test, imagine making a cumulative frequency graph for the first data set by counting the number of values that are less than any given value. It looks something like this:

  1.0 |                                                             ----
      |                                                             |
  0.9 |                                                        ------
      |                                                        |
  0.8 |                                                    -----
      |                                                    |
  0.7 |                                         ------------
      |                                         |
  0.6 |                                   -------
      |                                   |
  0.5 |                                ----
      |                                |
  0.4 |                         --------
      |                         |
  0.3 |                        --
      |                        |
  0.2 |          ---------------
      |          |
  0.1 |       ----
      |       |
    0 ------------------------------------------------------------------
       200     300     400     500     600     700     800     900

What this graph means is that, for example, 0.2 of the values are less than 400 (in fact the values 270 and 320), 0.4 of the values are less than 550, 0.7 of the values are less than 750, and so on. Now we superimpose the graph for the other dataset. That gives something like:

  1.0 |                                                           ------
      |                                                           | |
  0.9 |                                                        ------
      |                                                        |
  0.8 |                                                   ------
      |                                                   ||
  0.7 |                                         ------------
      |                                         |      |
  0.6 |                                   -------     --
      |                                   |           |
  0.5 |                                ----     -------
      |                                |    *   |
  0.4 |                         --------      ---
      |                         |             |
  0.3 |                        --   -----------
      |                        |    |
  0.2 |          --------------------
      |          |          |
  0.1 |       ----  ---------
      |       |     |
    0 ------------------------------------------------------------------
       200     300     400     500     600     700     800     900

The statistic needed for the K-S test is the largest vertical difference between the two graphs, which we will call K. You can see by inspecting them that the largest such difference is K = 0.3, near the asterisk, between values of the outcome from 630 to 678. In practice, one would calculate this statistic by ordering the two data sets independently, then comparing the ordering between them thus:

  270 320       480 500   585 630       700           836 890       950
         350 456       555       678 699   772 784 821       893 921

and finding the point in the sequence with the biggest difference in contributions from the two datasets to its left. It is straightforward to write a program to do this, but many packages will do it for you. If there are different numbers of trials in the two conditions, division by the number of trials has to be carried out in counting the fraction of trials to the left of any point in the sequence.

The statistic K = 0.3 can then be looked up in tables for the test. Not surprisingly, this turns out to be not significant for 10 trials in each dataset even at alpha = 0.05 - these data do not seem to show any non-random differences using this test. In fact, a difference of K = 0.6 would be needed (for 10 trials in each condition) before the results were significant at the 0.05 level.

The clever thing about this is that the distribution of the statistic K under the null hypothesis (both sets of data come from the same underlying distribution) is known independently of what the distribution of the data actually is. There is no assumption that the data come from a Gaussian distribution, or indeed any other distribution. A test with this property is known as a non-parametric test.

Tests that make assumptions about the distributions are known as parametric tests; typically they assume the distribution is Gaussian. A parametric test that could be applied to these data, if you were willing to make the necessary assumption, is the unrelated t-test, which uses as its statistic the difference in the means of the two data sets, normalised by an estimate of the standard deviation of the data. In general, parametric tests are more powerful and involve simpler calculations; but if the assumption of a Gaussian distribution of the data is incorrect, they can give misleading results.

There are numerous other significance tests that can be useful. One important one is the chi-square test, which is useful when some data need to be compared with expected frequencies.

Combining significance levels

Sometimes a hypothesis is tested in two experiments which yield independent P values. The best way to combine the results is to find a way of treating the two experiments as one, and finding an overall statistic that can be used in a test of significance. When this is not possible, it can be useful to know how to combine more than one significance level in a sensible way.

In particular, the correct way to combine them is not to take their product, or their maximum or minimum (though all of these are sometimes suggested). If the two significance levels are P1 and P2, the significance level of the two experiments taken together is in fact

    P  =  P1 * P1 * (1 - log(P1 * P2))

where the logarithm is to base e (a natural logarithm). Thus two experiments each significant at 0.05 yield a combined significance level of 0.017. The argument to reach this conclusion depends on a particular definition of "more extreme", to mean combinations of results that would have lower probability under the null hypothesis than the results actually obtained.

The generalisation of this formula to N experiments is

               N-1           r
    P  =  g * SIGMA (-log(g)) / r!
               r=0

where g is the the product of the N separate significance levels, and r! is the factorial function or r.

Problems with hypothesis testing

Hypothesis testing is a respectable and sometimes valuable way to assess the results of experiments. However, it has difficulties.

An important one of these is that if the methodology were taken literally, hypotheses about, say, the effectiveness of a new drug would be accepted or rejected when it was known that there was a definite probability that an error was being made. This problem is exacerbated by the asymmetry in the treatment of the null and alternative hypotheses, which means that probabilities of Type I errors are accurately controlled but probabilities of Type II errors have to be largely guessed.

In practice, the approach is not followed literally: common sense prevails. Rather than setting an alpha in advance and then acting accordingly, most researchers tend to treat the P value obtained for their data as a kind of standardised descriptive statistic. They report these P values, then let others draw their own conclusions; such conclusions will often be that further experiments are needed. The problem then is that there is no standard approach to arriving at a final conclusion: everything remains tentative. Perhaps this is how it should be; but it means that statistical tests are used as a component in a slightly ill-defined mechanism for accumulating evidence, rather than in the tidy cut-and-dried way that their inventors were trying to establish.

The rejection/acceptance paradigm also leads to the problem of biassed reporting. Usually, positive results are much more exciting than negative ones, and so it is tempting to use low P values as a criterion for publications of results. By definition, though, a P value below 0.05 will be found in roughly 1 experiment in 20 even when no real effects are present. If this experiment is reported and the others are not, it is clear that the publication will be misleading. This is a serious worry in the medical and psychological literature.

There are alternatives to significance tests. Bayesian techniques can be used to place confidence limits on the values of particular parameters, and an approach using likelihood reasoning has been proposed by A.W.F. Edwards, whose book "Likelihood" (Cambridge U.P., 1972, expanded edition Johns Hopkins U.P. 1992, in library at QD 8000 Edw) contains a trenchant attack on significance testing.

Despite these difficulties, those who seek rigorous analysis of experimental results will often want to see P values, and provided its limitations are borne in mind, the hypothesis testing methodology can be applied in useful and effective ways.

FCS8: Statistical analysis of experiments

Contents