Learn · Methodology

How to measure medical coding accuracy

Q: Why is a single coding accuracy percentage misleading?

Most of the codes that could apply to any given visit don't apply, so an overall accuracy number is dominated by the many codes a system correctly leaves off. A system that rarely recommends anything can post a high accuracy figure while still missing real billable charges. Precision and recall expose that behavior; a single accuracy percentage hides it.

By the Capsa Coding team Last updated June 3, 2026 Standard precision/recall definitions; see sources

Quick answer

A single coding “accuracy %” is ambiguous and easy to game. The rigorous way to measure coding accuracy is with precision and recall. Precision is the share of the codes a system recommends that are correct — it controls over-coding (audit risk). Recall is the share of the codes that should have been billed that the system actually caught — it controls missed charges (revenue leakage). Together, often summarized as their harmonic mean (F1), they describe how a coder or coding system really behaves.

Why a single “accuracy %” misleads

Picture every code that could theoretically apply to a visit. For any one encounter, the vast majority of them don't apply — and a system gets “credit” for correctly leaving each of those off. That's a classic class-imbalance problem: an overall accuracy number is dominated by the easy true negatives. A system that barely recommends anything can post an impressive accuracy figure while quietly missing real billable charges. Precision and recall throw out the easy true negatives and measure the codes that actually matter.

Precision: are the recommended codes correct?

Precision = true positives ÷ (true positives + false positives) — of the codes the system recommended, how many were right. Low precision means the system is over-coding: suggesting codes the chart doesn't support. In a coding context, that's the failure mode that drives denials, audit exposure, and clawback risk.

Recall: were the right codes caught?

Recall = true positives ÷ (true positives + false negatives) — of the codes that should have been billed, how many the system actually found. Low recall means the system is missing charges: leaving billable, documented work off the claim. That's the failure mode that drives revenue leakage — the silent kind that never generates a denial.

Worked example

Suppose a visit truly supports 4 billable codes, and a system recommends 5 — of which 3 are correct and 2 are unsupported.

Precision = 3 ÷ 5 = 60% (3 of the 5 recommendations were right).
Recall = 3 ÷ 4 = 75% (it caught 3 of the 4 codes it should have).
F1 ≈ 67% (the harmonic mean that balances the two).

The trade-off, and why F1 exists

Precision and recall pull against each other. Recommend more aggressively and you catch more real codes (recall up) but also fire more wrong ones (precision down); recommend conservatively and the reverse happens. F1, the harmonic mean of the two, is the single number that rewards being good at both. Many programs also set an explicit floor — for example, a recall target so a minimum share of billable codes is always caught — alongside a precision target.

Scope-aware measurement: the part most vendors skip

A precision/recall number is only meaningful if you're clear about which codes it covers. Scope-aware measurement evaluates a coding system only on the codes it's actually responsible for — not every code on the claim. Without that discipline, out-of-scope codes the system was never meant to handle quietly dilute the metrics, and a 95% can mean almost anything. With it, the numbers mean what they say.

The ground truth also matters. Measuring against what coders actually billed — real, adjudicated claims — is a far stronger test than measuring against a synthetic answer key.

How the two metrics map to the two business risks

Metric	What low values mean	Business risk
Precision	Over-coding — recommending unsupported codes	Audit & clawback exposure
Recall	Under-coding — missing billable codes	Revenue leakage (missed charges)

This is exactly why a single accuracy number is dangerous for a coding buyer: it can hide a precision problem (audit risk) or a recall problem (lost revenue), and you can't tell which.

How Capsa measures accuracy

Capsa reports scope-aware precision and recall, measured against what your coders actually billed, with per-CPT, per-template, and per-rule breakdowns — so you can see not just a headline number but exactly where agreement and disagreement live. Capsa's 93–96% figures are internal, validated results on cases teams already coded, across its two live skills (vaccine administration and health screening). They are not an external certification, and they describe specific measured cohorts rather than a guarantee for every environment.

Frequently asked questions

Why is a single coding accuracy percentage misleading?+

Most codes that could apply to a visit don't, so an overall accuracy number is dominated by the many codes a system correctly leaves off. A system that rarely recommends anything can post a high accuracy figure while still missing real billable charges. Precision and recall expose that; a single accuracy percentage hides it.

What's the difference between precision and recall in coding?+

Precision is the share of recommended codes that are correct — it controls over-coding and audit risk. Recall is the share of codes that should have been billed that were actually caught — it controls missed charges and revenue leakage. F1 is their harmonic mean and balances the two.

What is scope-aware measurement?+

Scope-aware measurement evaluates a coding system only on the codes it is actually responsible for, rather than every code on a claim. This keeps precision and recall meaningful instead of being diluted by out-of-scope codes the system was never meant to handle.

How does Capsa measure coding accuracy?+

Capsa compares its recommended codes against what coders actually billed, on a scope-aware basis, and reports precision and recall with per-CPT, per-template, and per-rule breakdowns. Its 93–96% figures are internal, validated results on cases teams already coded — not an external certification.

Sources

Precision and recall — standard definitions (true/false positives and negatives, F-measure). en.wikipedia.org
Peer-reviewed medical-coding research routinely reports accuracy as precision, recall, and F1 — see, e.g., systematic reviews of automatic ICD coding with large language models. medRxiv