AI and Impairment Ratings: Evaluating Five LLMs

Key Findings

- LLMs currently require expert guidance for impairment rating calculations.
- LLMs make procedural mistakes but demonstrate accuracy in arithmetic calculations
- Wide variation exists across different LLMs
- Later versions may perform worse than predecessors
- Most common errors involve rule application

Key takeaways: Using an example from the AMA Guides 5th edition, we tested five AI large language models (LLMs) with simple prompts. None correctly calculated the impairment rating. All used incorrect calculation methods and exhibited gross rule application errors. Final whole person impairment (WPI) results ranged from 9% to 36% error compared to the stated example. Notably, no arithmetic errors were observed despite prior documented research suggesting such issues. In one case, an updated model version performed worse than its predecessor. We conclude that expert domain knowledge is essential when using LLMs for impairment ratings to maintain defensible quality standards for clients.

Introduction

Artificial Intelligence (AI), in the form of large language models (LLMs), has garnered significant attention for its ability to respond intelligently to written prompts. Advanced multimodal models can process images, large bodies of text, audio, and spoken language. We describe a simple prompt test with five LLMs to calculate impairment ratings from The AMA Guides to the Evaluation of Permanent Impairment, Fifth edition, 2001, Example 17-7.

This investigation originates from the ImpairMaster AI Lab. Our intent is to reveal the capabilities of ChatGPT, Claude, and Gemini LLMs in the medicolegal domain and to understand performance variation across systems. This is not an exhaustive review; rather, we sought to assess LLM capabilities and limitations. We share these preliminary results to spur discussion in the community. We desire to evaluate additional models, techniques, and examples in future work.

We selected a well-known example with explicatory text sufficient for instructional purposes. Many examples in the AMA Guides 5th edition require arithmetic calculations with conditional paths and extensive cross-references to other sections. The ImpairMaster software product is designed for these calculations, so we included results from our latest Guides 5 version (released October 2024) for comparison. Future work may include examples from the AMA Guides 6th edition (both 2008 and 2024 versions).

In reviewing LLM outputs, we critiqued each narrative report. While an impairment report culminates in a numerical whole person impairment (WPI) percentage, the accompanying narrative provides critical supporting documentation. A well-executed narrative offers detailed justification for each calculation step, with specific references to tables and methodologies in the AMA Guides 5th edition. Proper narratives also display mathematical calculations transparently. To enhance readability, we developed a report annotation design that visually highlights each error. Individual reports and critiques are linked in the Results table below.

The AI provider market evolves rapidly, with well-funded competitors continuously optimizing various parameters including economics, quality, technical performance, scope, regulatory compliance, safety, and legal considerations. Our results represent point-in-time snapshots and are not definitive, particularly given the probabilistic nature of these algorithms. We conclude that while these tools demonstrate power, they require guidance and review by experts trained in impairment rating methodologies.

Motivation and Literature review

As part of our product development efforts, the ImpairMaster AI Lab evaluates novel technologies for user capability delivery. We sought to answer: How effective are LLMs for impairment rating workflows? In medical, legal, software development, and creative domains, LLMs receive favorable reviews. The paper "Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial" (Goh E., Gallo, R.J., Strong, E. et al., 2024) demonstrates LLM utility as a medical diagnostic aid, concluding: "Despite the impressive performance of these emerging technologies in benchmarking tasks, current integrations of LLMs require human participation, with the LLM augmenting, rather than replacing, human expertise and oversight."

Initially, mathematical capabilities of these AI models warranted particular attention. Earlier 2023-vintage models demonstrated uneven mathematical ability. "The fragility of LLMs in mathematical reasoning is evident across three dimensions" (Large Language Models for Mathematical Reasoning: Progresses and Challenges, Ahn et al., 2024). Advances have continued with "Large Reasoning Models" (LRM) employing Chain of Thought and agent-based techniques that break down prompts into multiple synthesized stages. These newer models continue under research scrutiny: "these [LRM] models fail to develop generalizable reasoning capabilities beyond certain complexity thresholds" (The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, Shojaee, Mirzadeh, et al., 2025).

Method

We selected Example 17-7 from the AMA Guides 5th Edition, "Impairment Due to Decreased Range of Motion from a Tibia Fracture," as our test case. We employed a single primary prompt with a secondary prompt when needed to direct the LLM toward independent analysis. One LLM (Claude) had been configured with tool use capability, and another (OpenAI GPT-5) recognized the example and presented only the Guide's conclusion. Both required a secondary prompt to elicit original analysis.

Primary prompt:

I'm working with a friend on the Independent Medical Exam market. This is a sample patient narrative. Can you help me create an impairment rating according to the AMA Guides 5th edition?
Subject: 45-year-old woman.
History: Sustained a tibia fracture in a motor vehicle accident.
Current Symptoms: No pain; stiffness about the ankle and foot, with some swelling of the foot and ankle toward evening. Cannot stand for long periods and cannot use shoes with elevated heels.
Physical Exam: Ankle flexion is 6°; ankle extension is 5°. Toe extension is less than 10° for all toes. 1-cm atrophy of the left calf. It is difficult to determine the strength of the ankle and toe extensors, but a mild weakness is noted.
Clinical Studies: X-rays: healed tibia fracture with no malalignment.
Diagnosis: Healed tibia fracture.

LLMs tested:

- OpenAI ChatGPT o3
- OpenAI ChatGPT GPT-5
- Claude Sonnet 4
- Claude Sonnet 4.5
- Gemini 2.5 Pro

For comparison, we also processed the example through ImpairMaster software.

Results

An impairment rating yields both a final whole person impairment (WPI) percentage and a narrative report describing the calculation methodology. The narrative report holds equal importance to the final numerical result. The WPI calculation must be defensible, with the narrative providing justification for each calculation step. A comprehensive narrative cites specific tables and methods from the AMA Guides 5th edition and details mathematical calculations as needed.

The WPI calculated by each LLM appears in the table below. The first row presents the authoritative AMA Guides 5 text answer; the final row shows the ImpairMaster calculation. The Error % column indicates the percentage deviation from the established Guides 5 text. The Quick Impression column provides summary analysis. The final column links to each model's detailed narrative with error annotations.

Each detailed narrative report link displays in two columns. The right column contains the user prompt and LLM response. The left column presents annotations summarizing LLM errors, using the Guides example text as the correct reference. Extended annotations continue in footnotes below. Note: Reports are optimized for larger screens; on mobile devices, annotations appear as a block after the report text.

Model and Date Tested WPI Error %
from Guides 5
Quick Impression Detailed Narrative Link
Guides 5 text (authoritative) 11% - - AMA reference*
text
ChatGPT o3
5/31/25
8% -27% Not suitable ChatGPT o3 dialog
ChatGPT GPT-5
8/15/25
12% 9% Marginally suitable GPT-5 dialog
Claude Sonnet 4
5/31/25
15% 36% Not suitable Sonnet 4 dialog
Claude Sonnet 4.5
10/01/25
16% 45% Not suitable Sonnet 4.5 dialog
Gemini 2.5 Pro
5/31/25
8% -27% Not suitable Gemini 2.5 dialog
ImpairMaster
8/14/25
11% 0% - ImpairMaster report

*subscription required

Discussion

The calculated WPI varied across AI models, with none correctly determining the final WPI. Examining detailed narratives reveals that LLMs inconsistently and inaccurately applied impairment rating rules to this case.

ChatGPT GPT-5 achieved the closest approximation to the correct WPI, with a relative error of 9%. However, it made several calculation and rule application errors. Worse, it was inconsistent between the two prompts' logic for combining results. It was incorrect at first, combining atrophy and ROM, and was later correct disallowing that combination after the second prompt. ChatGPT o3 made similar errors and also completely omitted the toe analysis.

Claude Sonnet 4.5 received the worst relative error of 45%. It too needed two prompts to elicit a calculation of the impairment. It made many reference errors, identifying the wrong table or section or page. Sonnet 4 was also prone to this type of error. Sonnet 4 also confused a DBE table for a ROM analysis. Both Sonnet 4.5 and 4 made calculation and table use errors.

Gemini 2.5 Pro exhibited a -27% relative error. It made table use errors as well as others. It did explain its mathematical calculation for CVC instead of using the lookup table, which was unique among the models tested.

Two models received updates during our study period, enabling performance comparison over time. Claude Sonnet's performance deteriorated from version 4 to 4.5. OpenAI's ChatGPT improved from o3 to GPT-5. Note that simple single-or-double prompt performance should not be considered indicative of any AI model's overall value.

The table below summarizes error types exhibited by each LLM:

Error Type Definitions:

- Omissions: Omitting essential narrative components
- CVC vs Sum: Using incorrect combination method
- Table or Rule Selection: Selecting wrong table or rule
- Table Use: Correct table selected but applied incorrectly
- Reference: Incorrect table number, section number, and/or page number
- Wrong Body Region: Incorrect use or combination of body regions
- Combination Method: Inappropriate combination of evaluation methods per Table 17-2

Model WPI % Error Omissions CVC vs Sum Table/Rule Selection Table Use Reference Wrong Body Region Combination Method
ChatGPT o3 -27% Omitted toe Used CVC when should sum Incorrect ankle dorsiflexion and plantarflexion Foot and toes Weakness, atrophy, ROM cannot be combined
ChatGPT GPT-5 9% Summed when CVC should be used Incorrect use of CVC table or incorrect CVC calculation Foot and toes Atrophy and ROM cannot be combined (1st prompt only)
Claude Sonnet 4 36% Selected DBE table for ROM analysis Incorrect toes; atrophy Multiple incorrect references Foot and toes Weakness, atrophy, ROM cannot be combined
Claude Sonnet 4.5 45% Incorrect ankle dorsiflexion and plantarflexion; toes; atrophy Multiple incorrect references Foot and toes Weakness, atrophy, ROM cannot be combined
Gemini 2.5 Pro -27% Incorrect ankle dorsiflexion and plantarflexion; toes Foot and toes Weakness, atrophy, ROM cannot be combined

All the models are error prone. Some are significantly off from the published example results. Their analytic reasoning can be flawed. They should be used with caution.

Future Directions

Impairment rating workflows are evolving with the emergence of broadly capable AI and LLMs. As we are at the beginning of this transformation, numerous research directions merit investigation. Future work could apply existing computational methods such as voting and verification systems. Emerging techniques including chain of thought and small language models warrant continued development. Our analysis employed minimal, simple prompts to elicit LLM responses. More robust implementations could employ detailed system prompts, multi-turn user dialogues, and resource-augmented generation (RAG) techniques to improve LLM output quality.

Additional research areas include adjusting underlying model parameters such as temperature, which affects response creativity. LLM models can be assembled into agent systems that approach impairment ratings in stages. We anticipate continued improvement in core LLM capabilities; for example, LLMs initially performed poorly at chess but have improved substantially over time (Complete Chess Games Enable LLM Become A Chess Master, Yinqi Zhang, Xintian Han, et. al., 2025 arXiv:2501.17186v2). Economic analysis comparing time saved in creation versus time required for validation merits study. Further investigation of model version evolution and evaluation of broader model sets, including self-hosted options, represents rich research territory. Survey work across additional examples and different AMA Guides editions would be valuable next steps, potentially culminating in an industry benchmark.

From a product development perspective, two engineering avenues warrant exploration: (1) integrating LLMs with tool-use interfaces to mitigate calculation errors, and (2) embedding LLM capabilities into existing user interfaces.

Conclusion

LLMs are powerful tools but require careful use and verification when applied through simple chat interfaces common to LLM interactions. In our study, LLMs did not correctly apply AMA Guides methodologies and demonstrated inconsistency both across different systems and across version generations. We recommend that LLMs alone are insufficient for creating impairment ratings. User domain expertise remains essential to validate defensible, accurate impairment ratings and narratives.

Claude was used to edit this article for grammar, clarity, and tone. The original draft was created without AI. Thanks to Gerry Kaplan for edits.

Written by
Ben Wen
CMO
Category
Insights