David De Cremer, D. D., Pillutla, M. M., &  Folmer, C. R. (2011). How Important Is an Apology to You?: Forecasting Errors in Evaluating the Value of Apologies. Psychological Science, 22 (1), 45-48.
http://pss.sagepub.com/content/22/1/45.full

[Note the answers below are meant to be instructive so are not so much model answers as indications of points you may wish to bear in mind in answering the assignment about any paper. The discussion below is far too long and involved to be a model answer!]

 Section A. Popper (25 marks)

1) Concisely state the theory that the authors present as being put up to test?

People who view apologies as socially desirable following interpersonal transgressions will overestimate the value of an apology when they imagine receiving it compared to when they actually receive it; hence people believe they will be more trusting after receiving an apology than they really would be.

 

2) What pattern of results, if any, would falsify the theory?

The theory would be falsified if we have statistical evidence that people overestimate the value of an apology less than a minimal meaningful amount (not specified by the authors).

Without specifying this minimal amount, the theory would be falsified if people rated the apology as significantly more valuable in reality than in imagination. If the amount is specified, then the theory is falsified if the amount by which people over-estimate is *significantly* less than this minimal amount; i.e. if the upper limit of the confidence interval on the over-estimation is less than or equal to this minimal amount.

Note that a non-significant result in itself would not count against the theory, It only counts if the test was sensitive, i.e. if the limits of the confidence interval exclude values consistent with theory, as just stated. The orthodox way of planning a sensitive experiment  is to determine the number of subjects in advance so that the power to pick up the minimal interesting value is high.

If the minimal meaningful amount is hard to distinguish from zero, a non-significant result can never in itself falsify (or even count against) a theory, using orthodox staistics. But in this case we could use a Bayes Factor pitting the theory against the null. The theory is falsified if the Bayes Factor gives a value below a 1/3 (or whatever cut off we conventionally agree is sufficient evidence). Such a Bayes Factor could falsify the theory whether or not a minimal meaningful difference other than zero can be clearly specified.

Incidentally, the above point indicates the relative strengths and weakness of confidence intervals and Bayes Factors for making use of non-significant results. Confidence intervals (or their Bayesian equivalents, credibility intervals) are inferentially useful for falsifying a theory when a clear minimal meaningful difference can be specified  - and they do this without having to represent any other characteristics of a theory. On the other hand, Bayes Factors require representing the predictions of a theory more completely, but can falsify a theory even when no minimal amount can be specified.

3) What background knowledge inspired this theory but is not being directly tested?

People's poor affective and behavioural forecasting in other domains. People are socialised into giving and accepting apologies.

 

4). What background knowledge must be assumed in order for the test to be a test of the theory in (1)?

Some of the assumptions are (students came up with other very relevant ones):

I That the ratings of how valuable and reconciling the apology reflect people's subjective feelings.

II Subjects did not determine the point of the experiment and respond according to demand characteristics

III People imagined the apology as having equivalent objective characteristics as the real one subjects received

IV Subjects in the two groups did not differ systematically in mood or other ways that might affect the dependent variables

V The subjects are representative enough of a population to which the theory applies.

 

5) How safe is the background knowledge in (4)?

I The scales have face validity, and highly correlate with each other in Study 1, so the assumption seems safe.

II The between subjects design means subjects did not know what other conditions the experiment involved. It is unlikely the subjects realised the contrast was imagination versus reality, so this assumption is fairly safe.

III In imagining an apology it seems likely people imagined an apology with all its normative features, i.e. sincerity, unless instructed otherwise, so there is a  danger that subjects imagined receiving an apology in an objectively different way than occurred in the real condition (which might have been short and offhand, for example), which might have seemed insincere. The authors say the subjects were instructed to imagine an 'identical' apology as was given in reality, so the authors were aware of the point; however in the absence of a more precise description, the point is hard to evaluate for sure.

IV By the magic of random assignment of subjects to groups, the authors have ensured there is no reason to expect any systematic difference between groups in any pre-existing characteristic by which subjects may vary – including mood, interpretation of what counts as valuable, importance of apologies, pride, and so on. This is why random assignment of subjects to groups is regarded as so important by psychologists. Groups may vary on these characteristics only by chance – and it is exactly that chance we calculate and take into account by our statistical analysis. Thus this assumption is safe.

V It might be objected that the theory is meant to apply to people in general but the study was done only on undergraduates. That is, subjects were not randomly selected from the population of people generally. However, there is no need to go to such impossible lengths as randomly selecting subjects from people in the world. The theory can be tested by taking any population for which it is meant to apply. The theory says the results will hold for people who value apologies as socially desirable. The introduction reviews evidence that undergraduates at least satisfy this criterion on average. (Statistically speaking we can only generalise to the population who made themselves available to the experiment, but we can plausibly generalise to relevantly similar populations  - as a conjecture, unless we have reason to believe otherwise). Thus this assumption is safe – logically, we can attempt to falsify the theory by testing only on undergraduates. (These final two points indicate why psychologists typically find randomly assigning subjects to groups desperately important but randomly selecting subjects not important for experimental research.)

 Section B. Neyman Pearson (25 marks)

6) Have the authors determined what difference (or range of differences) would be expected if the theory were true?

No.

 

7) If not, do you know any results or other papers that could allow you to state an expected size of difference? Provide an expected raw difference and an expected standardised difference and state your reasons.

We can use the pilot study to provide an expectation for Study 1, and both these studies to provide an expectation for Study 2.

For Study 1:
The mean difference on a 7-point scale was 0.75 between imagined and real conditions (in the right direction). The Pilot Study differed from Study 1 only in that the transgression was imagined in the former but enacted in the latter – the dependent variable used was exactly the same. It is not clear whether the difference would increase or decrease the effect. On the one hand, one might expect emotions to be stronger with a real transgression; on the other hand, the forecasting literature the authors review indicate precisely that people often over-estimate emotions in imagination. Thus, 0.75 is about our best guess for the expected raw difference, but the effect could well be smaller or larger than this.
To standardise the effect size, we can divide the raw effect size by the within-group standard deviation (this is Cohen's d). So first we need a within-group standard deviation for the pilot, pooling the data from both groups. The rule for "averaging" standard deviations is that you actually need to average the variances, not the standard deviations. A variance is a square of a standard devaition. So first square the standfard deviations in each group, average, then square root to bring it back to a standard deviation. Thus the pooled SD from the pilot = sqrt((1.57^2 + 1.23^2)/2) = 1.41. Thus the expected standardised effect for Study 1 = Cohen's d from Pilot = 0.75/1.41 = 0.53. This effect is no longer on units of a 7 point scale; it is in units of standard deviations. The predicted difference between groups is half a standard deviation.

For Study 2:
The mean difference for the Pilot was 0.75 and for Study 1 it was 1.75; the mean of both studies is 1.25 on a 7 point scale. (This is a straight average; one might prefer to weight the studies according to the number of subjects or give more weight to the Pilot on the grounds the manipulation was more similar to Study 2 than Study 1 was.) Study 2 used a different dependent variable – Euros (or dollars according to the results section), and Euros and ratings of value are on the face of it like chalk and cheese. So how can we make them comparable? We will consider two methods for making a first guess: a) considering the extent of each scale and b) considering the standard deviations. In terms of a) the scale in Study 2 is 0-10 Euros, i.e. it is 11 units long. So a quick guess of an expected raw effect size in terms of the new scale is 1.25*11/7 = 2.0 Euros. In terms of b), we will first find a predicted standardised difference.
The Cohen's d for the Pilot was 0.53; for Study 1 it was 1.16. Thus, a rough expected standardised effect for Study 2 is (0.53 + 1.16)/2 = 0.85.
The pooled SD for Study 2 was 2.85. Thus another guess at the expected raw effect size is 0.85*2.85 = 2.4 Euros.
The 2.4 Euros estimate is fairly close to the 2 Euros estimate which is reassuring. Translating between dependent variables can never be done purely mechanically, and we have to ask if the estimates are plausible given common sense. My judgment is that a 2 – 2.5 Euro difference between conditions is plausible given the paradigm, so I am happy to stick with the answers.

 

8. Have the authors established their sensitivity to pick up such a difference, through power or confidence intervals? If not, provide a calculation of both yourself.

No they did not.

For Study 1:

Power:
A typical textbook recommendation for researchers would be to use the 0.53 we estimated from the Pilot as a basis for the power determination for the main study. However, this would only estimate Type II error rate if we make the following assumption: If an effect exists it is plausibly at least this size; any possible smaller effect in the population would render the theory uninteresting or false. This assumption is surely false. I am guessing a raw effect of e.g. 0.5 units would be counted as meaningful and as confirming the theory. It is probably around 0.25 raw units where researchers might think the effect is beginning to be too small to be interesting (though I think 0.25 units would still be accepted, if just). (Ideally a researcher would consider previous studies where emotional and behavioural forecasting errors have been found and ask if the same mechanism were working here, just how small could the effect be and still be the same mechanism working.) 0.25 raw units corresponds to a Cohen's d of 0.53*(0.25/0.75) = .17.
In sum, unless one calculates power using a *minimally* interesting effect, one is not determining the Type II error rate of one's decision procedure. (But I didn't count any answer wrong that used 0.53 or some other prior estimate.)
It is wrong to calculate power based on the Cohen's d obtained in the experiment itself. I know it is tempting to use the d of 1.16 obtained in Study 1 to estimate power for Study 1 because it is the best estimate we have of the effect size for the exact manipulation used in the experiment. But it is completely uninformative. It does not tell us the Type II error rate of the procedure. If the effect was non-significant such “obtained power” is guaranteed to be less than 0.5, even if the procedure was very sensitive to pick up any interesting effect; and if the effect was significant, the power is guaranteed to be above 0.5, and may be very high (as here), even if the experiment was insensitive to the small size of effects we would be interested in if they were the population value. Such power tells nothing that the p value doesn't  - and may make us think we have controlled Type II error when we have not at all. Type II error is about theoretically-relevant effect sizes, not the obtained effect size, because our interest is in testing theories.
One might be tempted to use 0.53 because at least it is fixed by a relevant study and not plucked from the air. If we are quite confident in it as an estimate we could regard it as the smallest effect that could *plausibly* be expected (even if lower ones would still be interesting). If we calculate power using this, with 29 subjects in one group and 28 in the other, remember to choose the between subjects option, and the two tailed option (as the authors used a two tailed test). The power is then 50%. This is too low to be acceptable. It is also an over-estimate, given the comments above. If the over-estimate is too low, then the real power is too low. In conclusion, the experiment was not designed to be sensitive.

 

Confidence interval:

The main purpose of power is to pick a suitable number of subjects. After data are collected, a more informative measure of test sensitivity is the confidence interval (or Bayesian equivalent, but I didn't teach you how to do that). The mean difference was 1.75. The standard error of the difference is 0.42. (The obtained t value was sqrt(F) = sqrt(17) = 4.12; thus SE = 1.75/4.12.) The 5% TWO TAILED critical value of t with 51 degrees of freedom (using http://www.psychstat.missouristate.edu/introbook/tdist.htmhttp://www.psychstat.missouristate.edu/introbook/tdist.htm ) is 2.01. Thus the 95% CI is 1.75 +- 2.01*0.42 = [0.91, 2.59].  The sensitivity of the test is shown by the lower limit: It says that the only differences that we accept as possible population ones are all greater than 0.9; thus all values we decided above would be trivial have been ruled out. The only values we accept are interesting ones. Thus the test is in fact sensitive. (Note: The obtained mean difference is ALWAYS included in the interval by definition so its inclusion means nothing – other than the sample mean difference is a possible population mean difference.)

For Study 2:

Power:
We will calculate power for the estimate of the expected standardised effect we came up with in question 7, 0.85. The power is 77%. As this is an overestimate of the Type II error rate of the  decision procedure – the theory is quite consistent with smaller effect sizes, in fact the study obtained a smaller one and the authors treated it as confirming their theory. Thus, if the power is too low for an over-estimate, it is too low. In sum, the study was not designed to be sensitive.

Confidence interval:
The mean difference was 1.89. The standard error of the difference was 0.88.  (Obtained t = sqrt(4.66) = 2.16; SE = 1.89/2.16.)  95% CI = 1.89 +- 2.02*0.88 = [0.11, 3.67]. The interval excludes zero and includes our expected effect size if the theory were true (about 2 Euros), which shows the study is to that degree sensitive. However the interval also includes as possible population values mean differences as small as 11 cents. Such a small difference, if the actual population difference, would render the theory uninteresting. So (unlike Study 1) greater sensitivity is needed to fully evaluate the theory. (This is a harsh judgment by conventional standards -  but this style of thinking would likely resolve many disputes in various literatures by making people establish their effects with precision!)

9. Was the test of the theory  in (1) severe (in Popper’s sense)?

The power was low for both studies so the tests were not severe in that statistical sense.  On the other hand, no other theories were identified from background knowledge that would make the same predictions. So the experiment was severe in that sense. [Some of you did identify other theories and I am not disagreeing with your evaluation, just presenting the answer consistent with my responses in questions 4 and 5.]

Section C Lakatos (25 marks)

10. State the hard core of the research programme the authors are working in

The authors consider broadly two literatures in the introduction, one on the role of apologies in human interaction, and the other on behavioural and emotional forecasting. One might be tempted to identify two hard cores, the relevant one from each literature, and suggest the paper brings together two research programmes. This strikes me as a plausible starting point for a Lakatosian analysis. It also raises questions about Lakatos' ideas – temporarily fusing programmes would be an interesting extension of his analysis of science. I suspect it happens a lot and is part of how science works, but Lakatos did not discuss this process.
Another view is to say the key motivating idea for the paper is the claim that people systematically underestimate their behavioural and emotional reactions in a broad range of domains – and it is this claim that we should therefore treat as the hard core. We just happen to apply it to apologies. That is, the hard core comes from specifically the forecasting literature.
(You might argue that the study was motivated specifically from an underlying claim from the literature on apologies; I wouldn't treat that as wrong.)  Make sure when you specify a hard core you specify an actual claim, a proposition (or a set of them). Try to state a claim that is sufficiently rich it is plausibly what attracted a group of people and kept them busy over an extended time because it regularly generated new hypotheses.

11. Does the paper contribute to the research programme in a progressive or degenerating way? State your reasons.

The prediction that people over-estimate the effects of apologies seems to be temporally novel. We did not know whether or not people over-estimated the effect of apologies before the study was done, and thus the fact was not used in constructing the hard core of the programme. Therefore the programme is theoretically progressive. The fact was corroborated, and so the programme is empirically progressive as well.

 

Section D  Bayes (25 marks)

12) What was the mean difference obtained in the study?

1.75 (for Study 1)
1.89 (for Study 2)

 

13) What was the standard error of this difference?

0.42 (for Study 1)
0.88 (for Study 2)

14) Extending your answer in (7) concerning an expected raw difference, specify a probability distribution for the difference expected by the theory and justify it.

For Study 1:
The vaguest specification of the theory, that takes into account only that the theory says there will be an over-estimation, is a uniform distribution between 0 and 6 (the biggest difference possible given the scale).
A more reasonable specification would be to say, based on the answer to 7, that 0.75 is our best guess of the population value given other evidence relevant to the theory. We could round this up to 1 (“when in doubt spread it out”).  In order for the distribution to give negligible plausibility to differences below 0, and otherwise keep our minds open about possible effects, we could use a normal with mean 1 and SD of 0.5. (We could also e.g. use a half-normal with SD of 1, and it would give virtually the answer.)

Note that it is cheating to use the mean obtained in the study (1.75) to help form a prediction about that very mean. So we cannot use the 1.75 to help specify the distribution asked for in this question. We can use previous or later studies, or indeed other results from the same study (e.g. the standard deviations (as we did to find the estimate for Study 2 in question 7), other means, etc), but not the very mean itself.

A number of distributions would be reasonable representations of the theory  – they all give qualitatively the same answer, which is reassuring and indicates we don't have to worry about exactly how we represent the predictions of the theory. Bayes gives a consistent answer.

For Study 2:
The vague distribution would be a uniform from 0 to 10. The more informed distribution would use the 2.5 Euros estimated from question 7 as a mean of a normal (and therefore with SD 1.25). The latter distribution implies population differences between roughly 0 and 5 Euros can be regarded as plausible effects produced by the theory. (Or we could use a half-normal with an SD of 2.5 – it will give virtually the same Bayes Factor.)

 

15) What is the Bayes factor in favour of the theory over the null hypothesis?

Study 1:
Vague distribution:  Bayes Factor is 1033.
Normal with mean 1:  Bayes Factor is 1958.

As the vague theory is strongly supported over the null, there is little need to worry about a more informed variant, which will be even more strongly supported. (You only need to specify one Bayes Factor for the assignment, I give you a couple just for illustration.)

Study 2:
Vague distribution: BF = 2.18
Normal with mean 2.5: BF = 5.34

The vague theory is only anecdotally supported over the null.  But if we incorporate some reasonable restrictions on the effect, the theory is strongly supported over the null, as shown by the second result. In a real journal article I would report the second result, as I am willing to defend its assumptions – they follow simply from the other studies.

16. What does this Bayes factor tell you that the t-test does not?

In this case, they both provide support for the theory rather than the null. But the Bayes Factor quantifies the amount of evidence in favour of our specific theory over the null in contrast to the t-test, which just allows a black and white decision to reject the null. The  large size of the BF indicates it could discriminate the theories sensitively; the t-test by contrast does not provide a sensitive test.

Note: The Bayes Factor allows a sensitive test of the theory in a way the t-test does not. A sample mean difference of zero would have given for Study 1 (and non-vague representation of the theory) a BF of 0.20, strong evidence against the theory and for the null. Bayes allows us to have a sensitive test even without specifying a minimal interesting effect size – the minimal interesting effect is sufficiently small we can treat it as being zero. Using a t-test, under these conditions we cannot calculate power. When we do try to specify a minimal interesting result (in question 8) we get very low power for this test. In this sense, the Bayes Factor seems to extract more information from the data, being sensitive in a situation where significance testing cannot be