Left: 7. The Role of Up: Developing the Attributes of Right: 9. Implications for Policy

Subsections

8. Assessment and Revalidation

Assessment is not the prime focus of this report, but it plays an important part in the development of competence and judgement. Three types of purpose can be usefully distinguished:

Certification or qualification of individual doctors
Provision of information to guide learners and/or those who supervise and support them
Quality assurance and improvement of practice.

Certification and Qualification, particularly when examinations are involved, are often described as ``high stakes'' decisions. For examinations, the risk to candidates will depend on pass rates, the possibility of reassessment and the probabilities of first time and second time success. The cost of failure will depend on whether their career progression is significantly affected, as well as the need to engage in further training or study. However, even for candidates least at risk, the effect of any high stakes assessment is a considerable increase in attention to what is being assessed , with a concomitant decrease in attention to what is not being assessed. Hence the quality of an examination has to be judged not only by whether the right people pass (false positives being of particular concern whenever certification is involved) but also by its effect on candidates' direction of their learning effort. This washback effect of examinations may be positively valued; Wakeford and Southgate (1992), for example, report how the introduction of a Critical Reading Question paper into the Membership Examination of the Royal College of General Practitioners increased time spent by candidates on critically discussing papers and reading two key journals. Where negative views are expressed, it is usually because the examination is thought to overemphasise knowledge at the expense of competence (see Chapter 4), to neglect a holistic approach by focusing mainly on components of competence rather than their integration, or to emphasise competence at the expense of performance (candidates exhibit their best behaviour under ideal conditions).

8.1 Assessment Methods

Excellent overviews of assessment methods used in examinations are provided by van der Vleuten and Newble (1994), Shannon et al. (1995), and Fowell and Bligh (1998); and advice on the development of specific tests by Dauphinee et al. (1994); Newble (1994). Although all the methods reviewed are used at postgraduate level, much of the published research refers only to undergraduates. However, some development are clearly relevant at both levels. For example, Page and Bordage (1995) have taken advantage of recent research on clinical decision-making, supporting the view that problem-solving skills are case or problem specific rather than generic to replace written Patient Management Problems whose validity was increasingly in doubt (Norman et al., 1985a; Bordage and Page, 1987) by sets of problems focused only on those key features that are crucial to their successful resolution. Not only does this cover more cases in less time but it achieves high validity by adopting a more holistic construct of competence.

Several different groups have reported developing Objective Structured Clinical Examinations (OSCEs) for use at postgraduate level: Fraser et al. (1994) for GP Consultations; Sloan et al. (1993), Schwartz et al. (1995) for surgical interns; Hodges et al. (1998) for psychiatry. Hodges et al. (1996) paper on using OSCEs specifically targeted at difficult communication skills suggests that these also have a strong case-specific element, with consequent implications for going beyond generic training in communications. Another development in Canada involved lengthening the time spent at each station on a multi-case examination to 30 minutes in order to make a more comprehensive examination of the more sophisticated knowledge of senior surgical residents (MacRae et al., 1997). The resulting Patient Assessment and Management Examination (PAME) had better psychometric properties than other measures, when using 6 cases. Patient satisfaction ratings were included with global ratings by the examiner for each of the four phases: - initial patient assessment (8 mins), ordering and interpretation of investigations (4), a second interaction with the patient to discuss diagnosis and management (10) and a structured oral examination (6). A parallel development in Ireland led to a objective structured long Examination Record (OSLER) in which the candidates spend 20-30 minutes with the examiner alone, having already been observed examining and taking a history from the patient (Gleeson, 1997). Including a sufficient number of cases is still important for reliability; because case content is the most significant variable (van der Vleuten, 1996) and results should not be left to ``the luck of the draw''.

Both OSCEs and PAMEs/OSLERs require the use of simulated standardised patients (van der Vleuten and Swanson, 1990), though Barrows's (1993) review points to other, less structured use of standardised patients to assess a doctor's general approach to clinical examination rather than specific skills. Stillman et al. (1991) estimated that 2 half-days interacting with 19 Standardised patients was sufficient for evaluating the data gathering and interviewing skills of residents.

Pieters et al. (1994) found that GP trainees performed better with actor-patients than in recorded consultations with real patients in daily practice, but this evidence nevertheless provided good predictions of weak performance from trainees. They interpreted this as a distinction between competence and performance rather than lack of validity when using standardised patients: Finlay et al. (1995) used actor-patients for testing communication skills at the end of a Diploma course in Palliative Medicine; and showed that, although their ratings of the doctors were a little higher than those by the ``official'' examiners, they correlated well. Indeed standardised patients have been trained to give formative feedback to medical students and can play an important role in this aspect of their training (Stillman et al., 1990; Barrows, 1993).

Reznick et al. (1997) and Martin et al. (1997), extended the use of OSCEs to test technical skills with their objective structured Assessments of Technical Skill (OSATS) for measuring surgical residents' technical skill using bench model simulations outside the operating room. Jansen et al. (1996) have developed a test for technical clinical skills of general practitioners in the CME context. A more sophisticated but not highly expensive simulator (£3000) has been used by Byrne and Jones (1997) to assess anaesthetists responses to a range of simulated emergencies; and further work showed that under simulated critical conditions chart recording errors markedly increased (Byrne et al., 1998), thus raising questions about the accuracy of records of such incidents in real situations.

Although associated with formal examinations, these assessment techniques can also be used for formative purposes, to provide feedback to the trainee and/or trainer. Thus Pieters et al. (1994) used assessment with simulated patients to identify trainee GPs needing more support. Gleeson (1997) reported how immediate feedback from an OSLER led to a significant improvement in performance. Sloan et al. (1996) found that feedback could be given to participants during an OSCE without perturbing test reliability. Garibaldi et al. (1994) describe the increasing popularity of a voluntary in-training examination for residents in internal medicine. 47% took this opportunity to compare their performance on a comprehensive written examination with national norms, of whom 45% were second year residents.

8.2 Assessment of Performance

The term performance assessment is often applied to any assessment involving real or simulated patients; but, mindful of the important distinction between competence and performance, we shall refer to all the methods discussed above as either competence assessment or assessment of knowledge. The term performance assessment can then be confined to assessment of real on-the-job performance under working conditions. Such assessment is the only form of assessment under a ``pure'' apprenticeship system and has always played a significant role in the UK. Dauphinee's (1995) review of the increasing use of performance assessment in North America supports the trend towards improved validity but also analyses the reliability problems of new approaches. The least satisfactory of all formal assessments in the UK must surely be that associated with admission to the register. Though subject to ratification, the ``satisfactory completion'' of a doctor's pre-registration year depends on global judgements by their supervisors, made with varying degrees of rigour and varying amounts of unaggregated evidence.

Registration as a general practitioner now involves a formal summative assessment based partly on videorecordings of patient consultations. Since these are selected by the candidates as representing their best practice, they are sometimes described as assessments of competence rather than performance. A pilot study in Scotland (Campbell et al., 1995) concluded that assessors reached firm judgements by the time four consultations had been viewed, and estimated a 95% probability of identifying a non-competent trainee. It also includes a trainer's report informed by a checklist which is an overall rating of performance and takes into account dispositions and attitudes as well as clinical competence, a written examination based on Multiple Choice Question (MCQs) and an audit marked against set criteria.

The newly introduced Certificate of Completion of Specialist Training (CCST) represents a major change in policy for hospital doctors. Regional training programmes with built-in rotations are the responsibility of the postgraduate dean, together with the annual progress review and final assessment of specialist registrars (SPRs). The annual review is conducted by members of the regional specialist training committee not directly responsible for the trainee on the basis of inspecting each trainee's logbook and cross-examining them on it contents. For final assessments 1-2 specialists from outside the region are added. The logbooks are designed by the appropriate Royal College and specialism, and typically include a list of competencies, each of which can be assessed at several levels. Two examples are shown below for the Royal College of Obstetricians and Gynaecologists and the Royal College of Physicians. Entries are made jointly by the registrar and supervisor.

The RCOG levels of competence range from observation (1) to independent practice (5). The following list specifies what is meant by each level.

Level 1 Observes Observes the clinical activity performed by a

colleague

Level 2 Assists Assists a colleague perform the clinical activity

Level 3 Direct Supervision Performs the entire activity under direct supervision

of a senior colleague

Level 4 Indirect Supervision Performs the entire activity with indirect supervision

of a senior colleague

Level 5 Independent Performs the entire activity without need for

supervision

Level 1 (observes) and Level 2 (assists) include the presentation of basic and clinical knowledge, exhibition of clinical reasoning and identification of relevant principles associated with the target activity.

The RCP levels are

Level 0 Insufficient theoretical knowledge

Level 1 Theoretical knowledge but not competent

Level 2 Some competence

Level 3 Fully experienced and competent

Some colleges and specialisms also include written examinations in their requirements for CCST, though most regard their Membership Examinations taken before entering higher specialist training as providing a sufficient foundation of basic knowledge. Further knowledge, especially of recent research, is assumed to be included within the assessment of clinical performance.

Until recently, the traditional practice of using global ratings by supervisors as the main indicators of satisfactory progress has been a major weakness in the UK system of specialist training; because this approach has been repeatedly shown to suffer from halo affects, evaluator leniency and restricted use of the grading range (Norman et al., 1985a). As a result weaknesses in performance which ought to be picked up and dealt with are allowed to continue with adverse effects (Littlefield and Terrell, 1997). The use of other methods, a wider range of raters, improving the rating instrument and training the assessors (Wakeford et al., 1995) are all advocated as remedies, preferably in combination. Thus the quality of assessments in the new SPR logbooks will require careful monitoring and the will to take further action if necessary. The most obvious extension would be periodic structured ratings of observed performance using purpose-developed rating scales (e.g., Winckel et al., 1994).

Given the importance of progressing doctors more rapidly through what is now a shorter but more structured training programme, there is a need for some form of progress chasing which also improves quality of performance. The credentialing of internal medicine residents for undertaking procedures (Gabryel et al., 1991) was discussed in Section 5.1. Another approach developed by Brennan and Norman (1997) in Obstetrics used ``encounter cards'' with scales for knowledge, professional skills, manual skills and overall performance. At periodic intervals, residents were asked to have 6 to 8 encounters scored during a particular week (reducing the selection bias) by faculty who observed them with the patient. Immediate feedback was given, and the full sequence of cards reviewed a little later. This process is less unwieldy and probably more valid than an examination, and its purpose is purely formative. An alternative strategy, not requiring observation but more focused in its target, is to review the case notes of the last 10-15 patients with a particular condition to be managed by the trainee (Pietroni, 1993a) (Pietroni 1993).

A generic quantitative approach developed in Scotland (Potter et al., 1996; Milne et al., 1996) uses a workload system, which weights cases for complexity and degree of supervision to estimate the effective operative experience of surgical trainees. Anderson et al. (1989) suggest that for some pre-surgical candidates performance outcomes can be used as indicators of clinical acumen. Thus residents were asked to predict the likelihood of 101 patients having appendicitis, while evidence of outcomes was collected from pathological inspection after appendectomy or records of the alleviation of abdominal-pain without appendectomy. This enables the construction of a Diagnostic Ability Score (DAS). A similar approach can be used for regularly performed procedures, such as colonoscopy (for surgery) or central venous cannulation (for anaesthetists); and made more flexible by using cusum analysis, a cumulative average failure rate. This gives both (1) an objective account of trainees progress for planning further training or taking remedial action and (2) a cut off point for credentialing (Williams et al., 1992; Kestin, 1996,1995; van Rij et al., 1995).

While examinations can and should provide feedback to candidates and trainees to enable them to adjust learning goals and the provision of learning opportunities, the link is usually close for performance assessment. Through most of the papers reviewed above, there is a major concern for assessments which provide feedback appropriate for remedial action by the various stakeholders; and the corollary is sufficient commitment to remedial action to justify not just conducting assessment but implementing them with a disposition and skill that preserves their effectiveness and ensures their contribution to the quality of training. These considerations are even more important when we attend to the problems of developing and maintaining the competence of doctors who are already qualified. This may be in the context of quality assurance, needs assessment and/or evaluation of CME programmes and policies, or recertification/revalidation.

Pollock's (1996) review of measuring the performance of established surgeons covers communication, mortality and morbitity meetings, CME , peer review and audit. Except for communication these are context categories, the evidence used may come from statistical monitoring, a thematic review of a particular category of cases, or a review of individual cases. Local statistical data is available from medical records, but often additional data are collected for thematic reviews. Their utility is improved by the increasing availability of regional or national norms or relevant research findings linking data with patient outcomes. Such evidence is more clearly attributable to individuals in surgery than in most other specialisms.

Lockyer and Harrison's (1994) analysis of chart review noted its value in collaborative efforts to improve patient outcomes in a particular community; but also reported its limitations as the sole method for assessing a physicians's competence. The interpretation of chart data can be variable so a large number of cases (or judges) are needed to establish reliability. Whereas audit may be a useful trigger for quality improvement, it is too blunt an instrument to provide reliable assessment of individuals unless they show extreme deficiencies (McAuley et al., 1990; Nash et al., 1993).

8.3 Recertification or Revalidation

The Canadian repsonse to this problem has been to develop a three tier system for the monitoring and enhancement of physician performance which is expected to become mandatory. This involves:

Initial screening of all physicians (using fee for service billing patterns, peer assessment questionnaires and patient satisfaction questionnaires) on which feedback is provided (Page et al., 1995b; Jacques et al., 1995);
Assessment of about 10% of practitioners identified by the first screen as being at risk, (using hospital or office audit and structured interview) followed by appropriate CME programmes for groups and individuals;
Detailed assessment of high risk physicians- (1-2%) by a mixture of methods, followed by a specific remedial programme and possible removal from practice.

Assessment techniques appropriate for the third stage were reviewed by Norman et al. (1993) in connection with a similar programme operating in Ontario for Primary Care Physicians. Their 7 hour package, incorporating chart-stimulated recall from their records, an oral examination, of 3 cases, 3 standardised patients, a 5 station OSCE and a Multiple Choice Examination, proved to be highly reliable and fit for its purpose. The Ontario programme has been screening a random sample of 450 physicians a year, of whom 0.5% eventually required removal from practice.

In the US doctors have to get recertified every seven to ten years if they want to retain their membership of a specialist Board (the US equivalent of a Royal College). This does not affect their license to practice medicine in general, but their right to claim specialist status and fees. The process entails passing an examination which is increasingly becoming computer-based. Although based mainly on propositional knowledge, its importance should not be underestimated.

McAuley et al.'s (1990) survey revealed disturbing evidence about some doctors failure to keep up-to-date; and Tracey et al. (1997) showed that GPs self-assessment of their level of knowledge was highly unreliable. The scope of these examinations is also likely to increase as the technology enables it to cover various forms of simulated practice. However, on-site review of practice has recently been discontinued because of the expense; but their cost would be much less if they adopted the Canadians; three tier system whereby only 10% of those doctors screened receive later on-site visits.

British policy is currently in a state of rapid change. Southgate and Dauphinee (1998) report the progress of the first new policy to be implemented, the performance assessment of practitioners screened by the General Medical Council (GMC) as the result of a complaint. If the issue is one of poor performance then a peer review procedure not unlike the Canadian second stage is triggered. This entails a visit by two medical and one lay assessor from the GMC who engage in the following activities:

Assessment of medical record keeping
Discussion of the management of the doctor's own cases and clinical work
Observation of aspects of actual practice
Audit of clinical outcomes
Interviews with third parties
Structured interview with the doctor
Site tour to determine the circumstances of practice.

If the visit cannot rule out a serious deficiency of performance, then it is rapidly followed by a second stage involving formal tests of competence: - a written or oral test of practice related knowledge and clinical thinking, a test of consultation skills (clinical thinking and communication skills), and a structured clinical examination of practical clinical skills.

Mindful of increasing public concern about the competence of a significant minority of doctors and the small number of evidence-based complaints reaching them, the GMC recently announced proposals for the revalidation of specialists and general practitioners, which they intend in due course to extent to all doctors (GMC, 1999). The term ``revalidation'' appears to have been chosen to signify a more normal, less threatening event than ``recertification''. The central policy problem, given the inevitable cost of such a process, is to achieve three separate goals:

To detect, and where possible remedy the deficiencies found in, incompetent doctors;
To advise doctors about weak elements in their current practice which might lead to incompetence if not remedied, thus forestalling a more serious verdict at a later date; and
To contribute to the professional development of doctors whose competence is not in doubt.

The first of these goals is a reasonable proposition given the GMC's recent experience of performance assessment and recent developments in Canada. The second is more problematic. Although Irvine (1997) emphasises the need for ``sound local arrangements for recognising dysfunctional doctors early and for taking appropriate action'' (page 1613), Donaldson's (1994) review of 49 problem cases concluded that ``existing procedures for hospital doctors within the NHS are inadequate'' (page 1277).

However, this is now being given high priority at local level as the fourth main component of the new policy on Clinical Governance (National Health Service Executive, 1999). The Standing Committee in Postgraduate Medical and Dental Education (SCOPME) recommends a combination of clear standards and peer review, recognising that each is in need for further development (Standing Committee on Postgraduate Medical and Dental Education, 1999). As implied by our earlier definition of competence (Chapter 2), the standards must reflect the expectations of a particular doctor in a particular job. They cite the example of the self-assessment manual and standards of the Faculty of General Dental Practitioners and its use by individuals and voluntary peer review schemes. Given its brief, SCOPME's perspective is primarily educational and they have argued for some time that CPD, review and support should not be linked to regulatory or disciplinary arrangement. Thus they are concerned that: ``revalidation could easily be seen as a threat and continuing registration could become the sole focus of clinicians' responses, utilising their energy and available resources at the expense of their professional development'' (page 14).

SCOPME rightly argues for more research and development into the causes, assessment and educational remediation of poor performance; but nevertheless conclude with a tinge of euphemism that: ``the main challenge for everyone will be to find ways of making the revalidation process both a valued part of professional life and, for the overwhelming majority of doctors, an important stimulus to further development'' (page 14).

The factor ignored in this debate is the greater frequency of annual reviews associated with Personal Development Plans and Practice-Based Development Plans. If these incorporate an element of peer review (see Bourdillon (1999) for a description of the Dutch system), then periodic revalidation every 5 to 10 years should not weaken personal commitment to CPD. Indeed, if followed by a universal entitlement to a period of educational leave with a mutually agreed focus, achievement of the third goal of revalidation would be greatly enhanced.

8.4 Summary

Hitherto, most of the research effort has focussed on the assessment of competence linked to certification decisions rather than the assessment of performance on-the-job. This emphasis is gradually changing as public demand for robust quality assurance grows. Revalidation is about to be developed with very close attention to research on performance assessment. A range of assessment methods is reviewed, and the unreliability of the traditional practice of using global ratings by supervisors noted. Persistent conclusions from assessment research are the need to use several methods, to refine all instruments or protocols, to train assessors and to use several assessors.
The Canadian three-tier system for the monitoring and enhancement of physician performance is now well developed and familiar to those exploring revalidation in the UK. The need for good assessment practice linked with effective strategies for the improvement of practice (see Chapter 6) is critical, both for revalidation and for formative and summative assessments during postgraduate education.

Left: 7. The Role of Up: Developing the Attributes of Right: 9. Implications for Policy

Benedict du Boulay, DOH Report pages updated on Friday 9 February 2001

Level 1	Observes	Observes the clinical activity performed by a
		colleague
Level 2	Assists	Assists a colleague perform the clinical activity
Level 3	Direct Supervision	Performs the entire activity under direct supervision
		of a senior colleague
Level 4	Indirect Supervision	Performs the entire activity with indirect supervision
		of a senior colleague
Level 5	Independent	Performs the entire activity without need for
		supervision

Level 0	Insufficient theoretical knowledge
Level 1	Theoretical knowledge but not competent
Level 2	Some competence
Level 3	Fully experienced and competent