Research Methods 1: Statistics Problem Sheet 1: (Means, S.D.'s and the Normal Distribution):

 

[These problems can be answered once you have covered the material on means, standard deviations and z-scores].

           

            1. Here are twenty scores on a statistics aptitude test:

 

            15, 7, 23, 19, 4, 10, 13, 17, 2, 30, 14, 17, 22, 15, 12, 18, 4, 18, 13, 21.

 

            (a) Calculate the mean and median for these scores. [Answers: 14.7, 15.0].

 

            (b) Calculate the standard deviation of these two scores, in two ways:

                        (i) as a description of this particular sample; [answer: 6.83].

                        (ii) using this sample in order to estimate the standard deviation of the population of scores from which it is presumed to have been taken. [Answer: 7.00].

 

            (c) If the scores were normally distributed, what proportion of the scores would you EXPECT to be within one standard deviation of the mean? [Answer: 68%].

 

            (d) What proportion of scores are ACTUALLY within one standard deviation of the mean (taking the s.d. as being 7.00)? [Answer: 65%].

 

 

            2. An experimenter times how long it takes normal adults to fall asleep during a Disney film.  Here are the times for sixteen adults (measured in seconds):

 

            400, 450, 504, 534, 486, 600, 615, 490, 558, 449, 998, 500, 408, 590, 586, 998.

 

            (a) Draw a graph of the grouped frequency distribution of  times, using a class interval width of 50 (starting at 351) .

 

            (b) Calculate the mean, median, mode and s.d. (using the n-1 formula for the s.d.) of the scores. [Answers: mean = 572.9, s.d. = 178.5, median = 519.0, mode = 998].

 

            (c) Redo (b), but omitting the two scores of  998. What do you notice? [Answers: mean = 512.1, s.d. = 70.7, median = 502.0, mode = incalculable].

           

           

            3. On a standard test of reading ability,  a group of 173 normal patients have a mean score of 86, with a standard deviation of 6.5. Following a head injury, patient X now scores 75 on the same test.

            (a) Convert 75 into a z-score. What would you conclude about patient X's performance, relative to the normal patients? [Answer: -1.69].

            (b) What proportion of the normal patients would be expected to score 75 or less? [Answer: 0.0455].

            (c) What proportion of the normal patients would be expected to score 86 or more? [Answer: .50]

            (d) What proportion of the normal population would be expected to score between 84 and 92? [Answer: 0.4429]. 

            (e) How MANY patients would be expected to score between 84 and 92? [Answer: 77 patients].

            (f) How MANY patients would be expected to score higher than patient X, i.e. 75 or more? [Answer: 165 patients].


 Research Methods 1: Worked Solutions to Statistics Problem Sheet 1: (Means, S.D.'s and the Normal Distribution):

 

            QUESTION 1:

            Part (a)

            To calculate the mean, add together all of the scores and divide by the number of scores: 294/20 = 14.7. To calculate the mean, arrange the scores in numerical order. If there is an odd number of scores, the median is the middle score - i.e., the score for which there are as many scores above it as below it. If there is an even number of scores, the median is the average of the  middle two scores. In this case, the two middle scores are 15 and 15, so the median is (15 + 15)/2 = 15.

 

            Part (b)

            The standard deviation is a measure of the variability of our data: the bigger the s.d., the more the scores are spread out. The s.d. can be used in two ways. It can be used purely as a description of the sample from which it is obtained. However, we often wish to go beyond this: we hope to make extrapolations from our handful of subjects to the whole of humanity. It is almost never possible to work out the actual mean and s.d. of our population, and so we have to make do with estimates of these population measurements; these estimates are made on the basis of our sample.

            The mean of a sample is a good estimate of the mean of the population from which the sample is derived, and so we can use the same formula in both cases. However, this is not the case when it comes to standard deviations: it can be shown that the sample standard deviation tends to underestimate the size of the true population standard deviation if it is used as an estimate of the latter. Therefore, we need to add a small "correction" to the sample s.d. formula in order to make it a better estimate of the population s.d. This correction consists of dividing by the number of scores minus one, rather than dividing by the number of scores. The effect of this bodge is to make the s.d. larger than it would otherwise have been. This is why the sample s.d. (using the "n" formula) is 6.87, but the estimated population s.d. (using the "n-1" formula) is 7.00.

            To calculate the sample s.d. by hand:

            (i) calculate the mean of the scores (14.7);

            (ii) subtract the mean from each of the scores, to get a set of 20 "remainders";

            (iii) square each of these remainders, to get rid of the minus signs;

            (iv) add up these numbers to get a grand total;

            (v) divide the total obtained in (iv) by either the number of scores or the number of scores minus one (depending on whether you want the sample s.d. or want to use the sample s.d. as an estimate of the population s.d.);

            (vi) you now have the variance; take the square root of this to get the standard deviation.

 

            In detail....

            step (ii)             step (iii)       step (iv)           step (v)              step (vi)

              0.3                  0.09            932.20        46.61 (using n=20)         6.83

             -7.7                  59.29                            49.06 (using n-1)            7.00

              8.3                  68.89

              4.3                  18.49

            -10.7                 114.49

             -4.7                  22.09

             -1.7                  2.89

              2.3                  5.29

            -12.7                 161.29

              15.3                234.09

              -0.7                 0.49

              2.3                  5.29

            7.3                    53.29

            0.3                    0.09

            -2.7                   7.29

            3.3                    10.89

            -10.7                 114.49

            3.3                    10.89

            -1.7                   2.89

            6.3                    39.69

            In practice, this would be a very laborious way of calculating an s.d.: either use one of the "computational" short-cut methods described in most textbooks, Excel, SPSS, or the standard deviation function on your calculator!

 

            Part (c)

            You should know that 68% of scores in a normal distribution fall within one standard deviation either side of the mean; 95% fall within two s.d.'s; and 99.7% fall within three s.d.'s. This is a theoretical expectation however: in practice, the values obtained will usually be slightly different. To work out what proportion of our scores actually fell within one s.d, of the mean is quite easy. Our mean is 14.7, and our s.d. is 7. We want to know how many scores fell within the range of 14.7 + 7 and 14.7 - 7: i.e., how many scores were between 7.7. and 21.7.

            13 of our 20 scores fell within these limits. To express this as a percentage: (13/20)*100 = 65%. This actual figure compares quite favourably with the expected figure of 68%.

 

             QUESTION 2.

             Part (a)

            Your grouped frequency distribution should have looked like this:

                score:     frequency:   score:         frequency:   score:         frequency:

            351-400:            1

            401-450:            3          601-650:            1          801-850:            0

            451-500:            3          651-700:            0          851-900:            0

            501-550:            2          701-750:            0          901-950:            0

            551-600:            4          751-800:            0          951-1000:          2

 

            Part (b)

            To calculate the mean, add up all the scores and divide by the total number of scores:

            9166/16 = 572.89.

            To calculate the standard deviation (using the n-1) formula:

           

           

 

            In English, this means:

            (i) subtract the mean of the scores (572.89) from every score, to get a set of difference scores;

            (ii) square each of the difference scores thus obtained;     

            (iii) add together all of these squared difference scores;

            (iv) divide the result of (iii) by the number of scores minus one (i.e., 15 in this case), to get what's called the variance;

            (v) take the square root of the result of (iv), to get the standard deviation. The answer you should have is 178.5.

           

            To calculate the median, arrange all the scores in order of size and take the middle score (if there is an odd number of scores), or the average of the middle two scores (if there is an even number of scores). Here, we have an even number of scores, so we do the latter:

 

             400, 408, 449, 450, 486, 490, 500, 504, 534, 558, 586, 590, 600, 615, 998, 998.

            The median is (504+534)/2 = 519.

            The mode is the most popular score: here the only score which occurs more than once is 998, so that's the mode.

            Take note of a couple of points about these statistics. Firstly, the mean, median and mode are quite different from each other. This suggests that the data are skewed - that is, they are not symmetrically distributed around the mean. Looking at the frequency distribution shows that this is clearly the case: in this example, low scores predominate, except for the two odd high scores. (We might question why these two high scores have arisen, since they are so different from the rest of the data. Was it because the experimenter timing the subjects also fell asleep during the Disney film? Or do these two high scores come from individuals who were different in some important way from the rest of the subjects, for example in being aesthetically challenged individuals not representative of the normal population?) The two high scores have distorted the mean, median and mode to different extents: they have biased the mean upwards, and the mode is totally unrepresentative of the bulk of the data. One virtue of the median is that it is relatively unaffected by extreme scores.

            (c) Recalculating the statistics with the two scores of 998 omitted produces a mean of 512.1, a s.d. of 70.7, and a  median of 502.0. Things you should have noticed are:

(i) There is now no mode, since each score occurs equally frequently.

(ii) The mean is now smaller, since the distorting effects of the two 998 scores has been removed. This is reflected in the s.d., which is now much smaller than it was.

(iii) The median is now the average of the middle two scores: (500 + 504)/2 = 502. Removing the two extreme high scores has affected the median much less than it affected the other statistics.

(iv) The mean and median are still different from each other, but there is less skew than before.

 

            QUESTION 3.  

            Part (a)

            The formula for converting a raw score into a z-score is:

 

                       

                                                 

            Here, X is 75,  is 86, and s is 6.5. So,

                                   

                       

            Thus the z-score corresponding to 75 is -1.692. Patient X is over one and a half standard deviations below the performance of the "average" patient. He is quite below average in reading ability.

 

            Part (b) 

            It always helps to draw a rough graph of these kinds of problems, so that you can keep track of what it is you are trying to do:

 


                                              ?

 

 

 

 


                                                  X=75           X=86                      s = 6.5

            We need to know the area under the normal curve which lies beyond the z-score corresponding to patient X's raw score of 75. This area corresponds to the proportion of patients who could be expected to score 75 or less.

            We already have the z-score, from part (a) above; all we need to do is consult a table of z-scores and areas under the normal curve, and look in the column which gives the "area beyond z". Our z-score is -1.692; since the normal curve is symmetrical, the area beyond a z-score of -1.692 is the same as the area beyond a z-score of 1.692: 0.0455. Thus the proportion of patients who would be expected to have scored 75 or less is 0.0455. To express this as a percentage, simply multiply the proportion by 100:  4.55% of patients would be expected to score 75 or less.

 

            Part (c)

            This one is easy, and can be worked out without doing any calculations! Since the normal distribution is symmetrical around its mean, 50% of scores would be expected to fall above the mean and 50% below it. 86 just happens to be the same as the mean of the distribution of patient's scores, so 50% (.50, expressed as a proportion) of patients would be expected to score 86 or above.

 


                                                                                                   ? = 0.5

 

 

 

 

 


                                                                          86

           

           

            Part (d)  

            This one needs to be done in a number of stages. As before, graphing the problem makes it more obvious what it is we are trying to do:

 

 


                                                                                                      ?

 

 

 

 

 


                                                      84                86                92

 

            (i) Find the z-score for 84, and then use the "area between the mean and z" column to find the area under the normal curve between 84 and 86.

            (ii) Find the z-score for 92, and then use the "area between the mean and z" column to find the area under the normal curve between  86 and 92.

            (iii) Add together the two areas you have just found: this represents the area between the two extremes of 84 and 92, which in turn corresponds to the proportion of scores falling between 84 and 92.

            The combined area is .4429; in other words, about 44.29% of patients would be expected to score between 84 and 92 on the reading test.

 

            Part (e)

            To get the number of patients obtaining scores between 84 and 92, multiply the proportion  doing so by the total number of subjects. Thus, in this case, .4429 * 173 = 76.6 ; in other words, approximately 77 patients scored between 84 and 92.

 

            Part (f)

            We already know the area beyond the z-score corresponding to a raw score of 75. In this context, "area beyond" means "area below". Since the total area under the normal curve is 1, the area above 75 must correspond to the total area (1) minus the area below 75 (0.0455). Thus the proportion of subjects scoring 75 or above must be 1-0.0455 = 0.9545. Over 95% of the normal patients scored higher than patient X.

 

 

 


                                                                 ?

 

 

                                                                      = 1, minus this:

 

 

              75                       86                                                       75