Reliability and diagnosis

classification of disorders NOV 2018

Bingo: Reliability of diagnosis – WordMint

My Bingo Clues: Bingo Clues reliability of diagnosis

Next, we need to think about the reliability of the different versions of the DSM and ICD. But, what does “a reliable diagnosis” mean and how would we measure this?

  • What does reliability mean when we use it to think about research findings in psychology?
  • What would it mean to say a diagnosis was reliable?
  • How could we determine whether a diagnosis is reliable?
  • Why do you think it might be hard for practitioners to reach a reliable diagnosis? Think about how we extract information from service-users? Why might different people come to different decisions about the person?
  • If 100 is a perfect agreement between practitioners regarding a particular diagnosis and 0 is no agreement at all; what number do you think would be acceptable for us to say that a particular diagnosis can be made reliably?

How is reliability measured?

In order for a diagnosis to be considered reliable, it should remain relatively constant over time, assuming the symptoms have not changed. This can be established using ‘test-retest’, meaning the same patient will be diagnosed twice, a number of weeks apart. If their symptoms are highly changeable, it may be impossible to make a reliable diagnosis. Furthermore, an individual should receive the same diagnosis when re-diagnosed by another practitioner, assuming of course that they are using the same version of the same classification system!

In practice, psychiatrists make their diagnoses, having gathered information about their patients through the use of unstructured, clinical interviews meaning patients may provide differing descriptions to different practitioners dependent upon many factors. Given that psychiatrists base their diagnosis upon the subjective interpretation of what a patient has said, it is understandable why the process may lead to unreliable labeling.

Cohen’s Kappa

Robert Spitzer, the chairman of the DSM-III task-force introduced the use of a statistic called Cohen’s Kappa to improve the reliability of this version of the DSM. Cohen’s Kappa is expressed as a decimal (between 0 and 1) and refers to the amount of people who receive the same diagnosis when diagnosed by two different practitioners. At the time, Spitzer and his colleagues considered that a value of 0.7 would denote ‘good agreement’, (Spitzer et al. 2012). While Cohen’s Kappa has been used in subsequent versions of the DSM, the threshold value has plummeted.

Time to work on some terminology. Make sure you understand the following terms, look them up in your textbooks and other sources as required:

  • inter-rater reliability
  • test-retest
  • PPV scores (you might need to search a little harder for this term :)!)
  • Cohen’s Kappa

Be sure to use these terms when describing, explaining, applying and evaluating issues regarding reliability and diagnosis.

How reliable is DSM 5?

‘Mad in America’ are a not-or-profit organisation calling for ‘profound change’ to the ‘current drug-based paradigm of care’ which they say has ‘failed our society’. On their website, Rachel Cooper (2014) explains that the DSM III required much higher Kappa scores than the DSM 5. She explains that in the DSM 5 field trials, (where the checklists were checked to ensure reliability), figures which previously would have been deemed ‘poor’ or ‘unacceptable’ were now seen as ‘good’. Robert Spitzer the chairman of the DSM III task force chose 0.7 as the threshold for ‘good agreement’ and some of the most common disorders seen in adults achieved values of 0.8 and over in DSM III (e.g. schizophrenia, mood disorders and substance abuse disorder) however, the DSM5 task force suggested that values as high as 0.8 would be miraculous’ and note that values of 0.4-0.6 are ‘realistic’ but values of 0.2-0.4 are ‘acceptable’. This has led to concerns about the reliability of DSM5. stars

With regard to research and clinical practice, does it really matter whether a diagnosis is reliable?



Before you go any further make sure you have answered the following questions.

  1. What Kappa score should be obtained according to Cooper if a diagnosis is seen to be reliable?
  2. How as this changed since 1974?
  3. Which disorders seem to be most/least reliably diagnosed using DSM5?
  4. Why does Kupfer say that it is difficult to make a reliable diagnosis sometimes?
  5. Why does Cooper argue that problems with reliability may not be as worrying as they first appear?

Evaluating the reliability of the DSM

Diamond 9 Activity; separate into S and W to begin with, put them into the diamond pattern; top and tail the fragments to create PET chains; integrate with AO1 to make an essay 🙂 reliability of DSM diamond 9

Gambling disorder still reliably diagnosed with DSM 5

One strength the DSM 5 is that test-re-test data has shown high levels of agreement for certain disorders.

For example, Randy Stinchfield and colleagues have been researching gambling disorder for the past 15 years. Using patients recruited from a treatment programme in Ontario and members of the local community, they were able to accurately identify 91% of participants, as either having or ot having gambling disorder, (Stinchfield et al. 2016).

This is an important study that shows the DSM 5 is highly reliable despite changes to the criteria for diagnosis, notably dropping the number of symptoms necessary from five to four.

Kappas in decline for many disorders
One weakness relating to reliability of diagnosis is that, according to Rachel Cooper (2014) only 15% of the disorders evaluated in the DSM5 field trials achieved a Kappa score of more than 0.6 compared with the original 0.7 identified by Spitzer in the review of DSM III.

This is concerning as two of the disorders which were considered to be least reliable were major depressive disorder and generalised anxiety disorder, two of the most prevalent disorders in the US.

This suggests that the DSM may be distinctly less reliable than previous versions.

Competing argument: This said, it is not really appropriate to compare the reliability data for DSM5 and previous versions as different methods were employed to evaluate DSM III. For example, David Kupfner the chair of the DSM 5 task force and Helena Kraemer, the chief methodologist responsible for the field trials (2012) explain that during the DSM5 field trials, practitioners were asked to ‘work as they usually would’ and ‘take patients as they come’ in order to ‘mirror’ normal practice whereas previously ‘test’ patients were carefully screened and practitioners were given detailed instructions and training. Therefore, it is unsurprising that reliability data was less impressive than in previous trials.

Working with the DSM in the real world

A strength of DSM-5 is the team responsible for the revisions (called the DSM-5 Task-force) worked collaboratively with the WHO to ‘harmonise’ the DSM and the ICD.

The previous lack of consistency between the two systems has made international research difficult, as well as treatment of people who have been diagnosed under different systems. This is because people with similar symptoms may have been diagnosed with different disorders, depending on whether the DSM or ICD was used.

This is an important advance in in terms of the way disorders are classified and should lead to greater reliability.

A further weakness of the use of the DSM5 in clinical practice is that is still highly likely to result in unreliable diagnoses.

Associate Professor of Psychiatry, Ahmed Aboraya and colleagues (2006) draw attention to many factors which prohibit practitioners from making reliable diagnoses including the fact they simply do not have sufficient time to spend on the necessary structured interviews and rating scales. They also note that practitioners prefer not to use these instruments as they interfere with the development of the therapeutic rapport necessary for successful treatment.

These issues demonstrate that although it may be possible to make a reliable diagnosis, in practice, there are many reasons why this may be difficult.

Issues and debates
One relevant issue is the use of psychological knowledge in society.

In order to improve the reliability of diagnosis in clinical practice Aboraya has developed the SCIP, (Schedule for Clinicians Interviews in Psychiatry) which has two phases, an unstructured phase to allow the practitioner to build rapport and a general picture of the patients symptoms, followed by a structured phase using 24 carefully chosen questions which ensure some standardisation across practitioners.

This has led to improved inter-rater reliability in clinical practice, (Aboraya et al. 2006).


Using evidence to evaluate the reliability of the DSM

Now we have some idea about how psychologists talk about reliability and diagnosis, let’s see what research evidence there is on this topic:

Chop up the studies in the following worksheet and sort them into 2 piles according to whether you think they are about reliability or validity of diagnosis. Then resort the pile of studies that you think are about reliability into whether they suggest diagnosis can be made reliably or not:

Validity and Reliability – Studies for sorting activity: R and V study squares

We will come back to the validity pile later! As you consider each of the studies on reliability, think about possible GRAVE points that you could make about these studies.You can now use these studies to answer the following question; as the question requires you to ASSESS, this requires you to make a judgment about reliability, its an 8 marker and so needs to follow ATCHOOBC.

Assessment Tasks:

  1. Assess the reliability of the DSM4TR or DSM5 with reference to research evidence (8)
  2. This is direct from the SAMS: If a person visited two different psychiatrists, they might receive two different diagnoses of their medical condition. Assess the reliability of mental disorder diagnosis using research evidence (8)

You should also be able to answer questions such as:

3. Explain ONE issue regarding the reliability of diagnosis using classifications systems such as the DSM4TR or DSM5 (3)

In these questions, be careful that you only use studies looking at the DSM4TR or DSM5, some of the studies you have looked at are for the ICD. On the revision area there is a table to help you sort the studies so that you know which questions to use them in. You should start to fill this in now: r-and-v-icd-and-dsm-table-sort-studies

Evaluating the Reliability of ICD

Improvements between ICD-9 and 10

One strength of the reliability of the ICD-10 is the research evidence provided by Alexander Ponizovsky and colleagues, (2006).

This large scale longitudinal study found that PPV scores, (the proportion of people who retain the same diagnosis when reassessed), increased by 26% for schizophrenia, 16% for mood disorders and 8% for anxiety disorders.

This clearly shows that the major expansion in the number of disorder from ICD-9 to the ICD-10 has not detracted from the reliability of those diagnoses. It is also important to add that mood disorders have a notoriously poor track record with regard to reliability of diagnosis in the DSM-5 (0.28 according to Regier, 2013) and thus Poniovsky’s findings suggest that the ICD-10 may be better than the DSM-5 in terms of reliably diagnosing these disorders.

Competing argument: Despite the impressive PPVs for those three disorders the study did reveal some less satisfactory results for the category of ‘childhood disorders’ and ‘personality disorders’ where PPVs were as low as 55% and 56% respectively. Also it should be noted that these PPVs may seem high in comparison with the some of the figures quoted for the DSM but these figures are based on agreement at the category levels whereas the DSM figures are for more specific diagnosis and this suggests that the ICD may not be more reliable for more specific diagnoses.


In your textbook, there is one extra study about the ICD by Galeazzi (2004), p37.

Also there is information about reliability and validity of the diagnosis of schizophrenia, p.39 and major depressive disorder, p.85. If you areintrsted or feel that you want/need more you will also find info about the R and V of the diagnosis of  OCD, p. 71 and anorexia nervosa, p.57

Wider Reading

starsThe following articles will help you to enrich your answers even further: This is Cooper’s article in full, a useful read.

Letter from Spitzer about reliability of DSM5: Standards_for_DSM-5_Reliability (1)