PCL-R Demonstrates Inadequate Field Reliability and Validity

The PCL-R demonstrates low rater agreement and inadequate field reliability and validity for predictions of recidivism in prison and hospital settings. This is the bottom line of a recently published article in Law and Human Behavior. Below is a summary of the research and findings as well as a translation of this research into practice.

Featured Article | Law and Human Behavior | 2017, Vol. 41, No. 1, 29-43

PCL-R Field Validity in Prison and Hospital Settings



Inge Jeandarme, Knowledge Center Forensic Psychiatric Care
John Edens, Texas A&M University
Petra Habets, Knowledge Center Forensic Psychiatric Care
Liesbeth Bruckers, University of Hasselt
Karel Oei, Tilburg University
Stefan Bogaerts, Tilburg University and The Kijvelanden


Recent field studies have questioned the interrater reliability (IRR) and predictive validity regarding (violent) recidivism of the Psychopathy Checklist-Revised (PCL-R). Using a forensic psychiatric sample, the current study investigated discrepancies in scoring between hospital and prison settings, as well as differences in predictive validity across these two settings. PCL-R information was collected from prison and hospital files, resulting in 224 PCL-R total scores and 74 double scores. When examining repeated measurements, large individual differences were found together with an intraclass correlation coefficient (ICCA,1) of .42 for the total score. Discrepant results were found for Factor 2, with repeated scores within the same setting having an ICCA,1 of .28 versus an ICCA,1 of .57 for repeated scores between settings. However, areas under the curve (AUCs) from receiver operating characteristic (ROC) analyses for total, factor and facet scores did not differ between settings. For the whole sample, Factor 2 scores marginally predicted violent and general recidivism after 2 years (AUC .62 and .63), whereas Factor 1 did not predict (violent) recidivism. Consistent with recent studies from other countries, these results suggest inadequate field reliability and validity in prison and hospital settings in Flanders (Belgium).


forensic psychiatric patient, PCL-R, field validity, interrater reliability, psychopathy

Summary of the Research

“The Psychopathy Checklist Revised (PCL-R; Hare, 2003) is an extensively used and researched instrument for diagnosing psychopathy. Early factor analyses suggested that the PCL-R consisted of two factors: Factor 1 representing the interpersonal and affective component and Factor 2 capturing the socially deviant and behavioral aspects (Hare, Clark, Grann, & Thornton, 2000). Later, Hare (2003) argued for the existence of a superordinate factor of psychopathy, underpinned by two factors (interpersonal/ affective and social deviance) and four facets (interpersonal, affective, lifestyle, and antisocial; cf. Cooke, Michie, & Hart, 2006).The PCL-R is also frequently introduced in the legal arena to inform violence risk assessment (DeMatteo et al., 2014b), either in isolation or included as an important component within risk assessment instruments, such as the Violence Risk Appraisal Guide (VRAG; Quinsey, Harris, Rice, & Cormier, 2006), and the Historical Clinical Risk Management–20 (HCR-20; Webster, Douglas, Eaves, & Hart, 1997)” (p. 29).

“Although scoring the PCL-R requires at least some subjective judgment, strong interrater reliability (IRR) with good to excellent intraclass correlation coefficients (ICCs) for the total score (.86 to .94), Factor 1 (.69 to .95) and Factor 2 (.74 to .94) have been reported in early validation studies as well as in independent controlled research (e.g., Cooke, Hart, & Michie, 2004; Gacono & Hutton, 1994; Hare, 2003; Ismail & Looman, 2016; Kroner & Mills, 2001; Laurell & Daderman, 2007; Porter, Woodworth, Earle, Drugge, & Boer, 2003)… However, there is growing evidence (e.g., DeMatteo et al., 2014a; Edens, Cox, Smith, DeMatteo, & Sorman, 2015; Lloyd, Clark, & Forth, 2010; Murrie, Boccaccini, Johnson, & Janke, 2008; Murrie et al., 2009) that PCL-R scoring is affected by the evaluation context, with adversarial settings such as contested criminal or civil commitment cases producing scores that diverge much more so than would be expected based on the ICC statistics reported in the professional manual (Hare, 2003)” (p. 30).

“Given the significant role that individual raters seem to play in the PCL-R scores they provide, it is not surprising that reliability across examiners is not particularly high even if those examiners are retained by the same side in a given case” (p. 31)…Although field reliability studies comparing examiners retained by the same side of a case have focused primarily on the PCL-R total score, some of this research has been able to investigate this topic at the factor, facet, and even item level… suggesting that Factor 1 and its two facets are typically significantly less reliable than Factor 2 and its two facets” (p. 31).”

“In summary, the small but accumulating body of literature suggests considerably attenuated reliability and predictive validity when the PCL-R is used in applied forensic settings. Although the first field studies were based on relatively small samples from one jurisdiction (Texas) involving a specific population (sex offenders), other studies have provided further evidence of lower reliability based on larger samples in other U.S. jurisdictions (DeMatteo et al., 2014b; Levenson, 2004; C. S. Miller et al., 2012; Neal et al., 2015) as well as in Canadian and European samples (Edens et al., 2015; Sturup et al., 2014)” (p. 32).

“The current study extends to the existing research concerning the field reliability and validity of the PCL-R by examining this topic in a relatively large sample of Belgian offenders found “not guilty by reason of insanity (NGRI)” (in Belgium referred to as “internees”) who were classified within a medium security risk level. PCL-R assessments were conducted while the patients resided in prison and/or in hospital. For a large minority of the sample, multiple scores were available for analysis…Data regarding level of education, psychiatric diagnosis, criminal history, hospitalization/imprisonment periods, risk assessment scores, and IQ scores were gathered by accessing both CPS files and psychiatric hospital records. Diagnoses were based on the Diagnostic and Statistical Manual of Mental Disorder-IV text revision (DSM–IV–TR; American Psychiatric Association, 2000)” (p. 34).


“The general conclusion of the current study in this sample of forensic psychiatric patients was that the PCL-R in real world settings conducted by real world raters in Belgium is fairly unreliable although there was some evidence of modest to moderate predictive validity for Factor 2 scores” (p. 37).

“The average total PCL-R score found in all patients with a PCL-R score was similar to the mean score for forensic psychiatric patients reported by Hare (2003), and there was not a significant difference found in patients scored within a prison versus a hospital setting. However, when comparing repeated measures for the same offender across settings, mean prison scores were lower than mean hospital scores, suggesting contextual pressure as expected” (p. 37).

Rater agreement was also poor. This could be due to the large sample of non-sexual offenders, an increased number of complex cases, the differences between hospital raters and prison raters, or the differences between criminologists’ ratings and psychologists’ ratings on the PCL-R.

“Overall, the predictive validity was poor, especially for total PCL-R score and Factor 1, which did not predict general or violent recidivism. Factor 2 scores significantly predicted general recidivism for all groups, whereas Factor 2 scores predicted violence only for the combined population (prison and hospital scores). On the facet level, surprisingly, Facet 3 scores were the only significant predictors of general (all and hospital scores) and violent recidivism (all scores)… Although some AUCs reached statistical significance, the level ranged from small to moderate effect sizes, with moderate effect sizes in Factor 2 prison and hospital scores for predicting general recidivism and small effect sizes for all scores (Factor 2 and Facet 3) for predicting general and violent recidivism and for Facet 3 hospital scores for general recidivism.” (p. 39)

Translating Research into Practice

“Field validity studies such as the current one are important for researchers to consider when developing and refining new instruments and for clinicians to be aware of when conducting assessments in practice…When discussing scores, raters and judges should be aware of the fact that potential biases of the rater could have an important impact” (p. 40).

In order to improve rater agreement, the authors suggest either not using the PCL-R if the rater does not frequently use the instrument, or having multiple raters score the PCL-R. Further, “Forensic examiners should provide a comprehensive report of their PCL-R findings, including a discussion on the cut-off used and the profile of the facet scores. Depending on the assessment context, they might also consider not reporting Factor 1 scores at all unless there is some compelling reason for their inclusion” (p. 41).

Overall, the results of the current study suggest “that the high levels of reliability reported in many controlled research studies are not generalizable to practice settings” (p. 37).

Other Interesting Tidbits for Researchers and Clinicians

In the context of future research, the authors were interested to learn if using the PCL-R for violence risk assessment would be more beneficial if the rater scored the offender without conducting an interview, diminishing the potential for individual bias to occur.

Join the Discussion

As always, please join the discussion below if you have thoughts or comments to add!

Authored by Sara Hartigan

Sara Hartigan is a second year Forensic Psychology Master’s student at John Jay and hope to obtain a Ph.D. in Clinical Forensic Psychology in the future. My main areas of interest include clinical evaluations and developing treatment interventions within the forensic population.