ChatGPT can assess risk of bias of medical trials
ChatGPT assesses the risk of bias in trials with agreement rates similar to those reported for human reviewers. LLM-based systems for RoB assessment may help streamline and enhance evidence synthesis
Background
Assessing the risk of bias (RoB) is a complex and time-intensive task that requires expertise and is prone to human error. Previous automation tools for RoB evaluation have relied on machine learning models trained on relatively small, task-specific datasets. In contrast, large language models (LLMs), such as ChatGPT, are sophisticated systems trained on vast, non-task-specific datasets from the internet. These models exhibit human-like capabilities and could potentially assist in tasks like RoB assessment.
Methods
Following a peer-reviewed protocol, we randomly selected 100 Cochrane reviews. Eligible reviews were new or updated, assessed medical interventions, included at least one eligible trial, and provided human consensus RoB evaluations using either Cochrane RoB1 or RoB2. One trial was randomly chosen from each review. Trials employing individual- or cluster-randomized designs met the inclusion criteria. We extracted human consensus RoB evaluations from the reviews and gathered methodological descriptions from the trials. A subset of 25 review-trial pairs was used to develop a ChatGPT prompt for assessing RoB based on trial methodology. The prompt was then applied to the remaining 75 review-trial pairs to measure the level of agreement between human assessments and ChatGPT for "Overall RoB" (primary outcome) and "RoB due to the randomization process." Additionally, we evaluated the consistency of ChatGPT’s own assessments (intrarater agreement) for "Overall RoB."
Results
The 75 reviews were drawn from 35 different Cochrane review groups, all utilizing RoB1. The 75 trials spanned five decades, with all but one published in English. The agreement between human assessments and ChatGPT for "Overall RoB" was 50.7% (95% CI 39.3%–62.0%), significantly higher than what would be expected by random chance (P = 0.0015). For "RoB due to the randomization process," human-ChatGPT agreement was 78.7% (95% CI 69.4%–88.0%; P < 0.001). ChatGPT demonstrated a self-consistency rate of 74.7% (95% CI 64.8%–84.6%; P < 0.001) in its "Overall RoB" assessments.
Conclusions
ChatGPT exhibits some capability in evaluating RoB and does not appear to be making random guesses or fabricating results. The observed agreement for "Overall RoB" surpasses some reported levels of agreement among human reviewers but falls short of the highest reported estimates. LLM-based tools for RoB assessment could potentially contribute to making evidence synthesis more efficient and improving its overall quality.
Forfattere:
Jose F Meneses-Echavez, Christopher James Rose, Julia Bidonde, Martin Ringsten, Julie Glanville, Thomas Potrebny, Chris Cooper, Ashley Elizabeth Muller, Hans Bugge Bergsund, Rigmor C Berg
Tema:
1. Kunnskapsstrategier – fra ord til handling
Type:
Forskning
Institusjon(er):
Division of Health Services, Norwegian Institute of Public Health, Oslo, Norway. Facultad de Cultura Física, Deporte y Recreación. Universidad Santo Tomás, Bogotá, Colombia Center for Epidemic Interventions Research, Norwegian Institute of Public Health, Oslo, Norway School of Rehabilitation Science, University of Saskatchewan, Saskatoon, Canada Cochrane Sweden, Lund University, Skåne University Hospital, Lund, Sweden Glanville.info, York, United Kingdom Section Evidence-Based Practice, Western Norway University of Applied Sciences, Bergen, Norway Bristol Medical School, University of Bristol, United Kingdom UiT, The Arctic University of Tromsø, Tromsø, Norway
Presentasjonsform:
Muntlig
Presenterende forfatter(e):
Jose F, Meneses-Echavez