How to measure medical coding accuracy
A single coding “accuracy %” is ambiguous and easy to game. The rigorous way to measure coding accuracy is with precision and recall. Precision is the share of the codes a system recommends that are correct — it controls over-coding (audit risk). Recall is the share of the codes that should have been billed that the system actually caught — it controls missed charges (revenue leakage). Together, often summarized as their harmonic mean (F1), they describe how a coder or coding system really behaves.
Why a single “accuracy %” misleads
Picture every code that could theoretically apply to a visit. For any one encounter, the vast majority of them don't apply — and a system gets “credit” for correctly leaving each of those off. That's a classic class-imbalance problem: an overall accuracy number is dominated by the easy true negatives. A system that barely recommends anything can post an impressive accuracy figure while quietly missing real billable charges. Precision and recall throw out the easy true negatives and measure the codes that actually matter.
Precision: are the recommended codes correct?
Precision = true positives ÷ (true positives + false positives) — of the codes the system recommended, how many were right. Low precision means the system is over-coding: suggesting codes the chart doesn't support. In a coding context, that's the failure mode that drives denials, audit exposure, and clawback risk.
Recall: were the right codes caught?
Recall = true positives ÷ (true positives + false negatives) — of the codes that should have been billed, how many the system actually found. Low recall means the system is missing charges: leaving billable, documented work off the claim. That's the failure mode that drives revenue leakage — the silent kind that never generates a denial.
Suppose a visit truly supports 4 billable codes, and a system recommends 5 — of which 3 are correct and 2 are unsupported.
Precision = 3 ÷ 5 = 60% (3 of the 5 recommendations were right).
Recall = 3 ÷ 4 = 75% (it caught 3 of the 4 codes it should have).
F1 ≈ 67% (the harmonic mean that balances the two).
The trade-off, and why F1 exists
Precision and recall pull against each other. Recommend more aggressively and you catch more real codes (recall up) but also fire more wrong ones (precision down); recommend conservatively and the reverse happens. F1, the harmonic mean of the two, is the single number that rewards being good at both. Many programs also set an explicit floor — for example, a recall target so a minimum share of billable codes is always caught — alongside a precision target.
Scope-aware measurement: the part most vendors skip
A precision/recall number is only meaningful if you're clear about which codes it covers. Scope-aware measurement evaluates a coding system only on the codes it's actually responsible for — not every code on the claim. Without that discipline, out-of-scope codes the system was never meant to handle quietly dilute the metrics, and a 95% can mean almost anything. With it, the numbers mean what they say.
The ground truth also matters. Measuring against what coders actually billed — real, adjudicated claims — is a far stronger test than measuring against a synthetic answer key.
How the two metrics map to the two business risks
| Metric | What low values mean | Business risk |
|---|---|---|
| Precision | Over-coding — recommending unsupported codes | Audit & clawback exposure |
| Recall | Under-coding — missing billable codes | Revenue leakage (missed charges) |
This is exactly why a single accuracy number is dangerous for a coding buyer: it can hide a precision problem (audit risk) or a recall problem (lost revenue), and you can't tell which.
How Capsa measures accuracy
Capsa reports scope-aware precision and recall, measured against what your coders actually billed, with per-CPT, per-template, and per-rule breakdowns — so you can see not just a headline number but exactly where agreement and disagreement live. Capsa's 93–96% figures are internal, validated results on cases teams already coded, across its two live skills (vaccine administration and health screening). They are not an external certification, and they describe specific measured cohorts rather than a guarantee for every environment.
Frequently asked questions
Why is a single coding accuracy percentage misleading?+
What's the difference between precision and recall in coding?+
What is scope-aware measurement?+
How does Capsa measure coding accuracy?+
Sources
- Precision and recall — standard definitions (true/false positives and negatives, F-measure). en.wikipedia.org
- Peer-reviewed medical-coding research routinely reports accuracy as precision, recall, and F1 — see, e.g., systematic reviews of automatic ICD coding with large language models. medRxiv