NDC Normalization Guide
The National Drug Code is distributed in at least four distinct formats across federal data sources. Failing to normalize NDC before joining across datasets is one of the most common causes of silent data loss in drug analytics pipelines. This guide covers every format variant, the zero-padding rules, and production-ready SQL to normalize them all.
The Problem
The same physical drug — Lipitor 10 mg, 90-count bottle — may appear as all of the following depending on which federal dataset you're reading:
Five different string representations of the same NDC. A naive string join misses all but one.
A direct string join between a Medicaid claims file (flat 11-digit) and an FDA Orange Book file (4-4-2) will match zero rows — not because the data is wrong, but because the formats differ. The consequence is silently under-counted utilization, missing pricing matches, and incorrect formulary analytics.
Format Reference by Source
| Data Source | Format | Example |
|---|---|---|
| FDA Orange Book | 4-4-2 | 0071-0213-40 |
| FDA NDC Directory | 5-3-2 | 00071-213-40 |
| HIPAA / NDC Directory | 5-4-2 | 00071-0213-40 |
| Medicaid claims (CMS) | Flat 11-digit | 00071021340 |
| Medicare Part D | Flat 11-digit | 00071021340 |
| Some EHR exports | 10-digit | 0071021340 |
The flat 11-digit format (no hyphens) is the canonical normalization target used in CMS claims data.
Zero-Padding Rules
The key insight: 10-digit NDCs are 11-digit NDCs with one leading zero omitted from either the labeler segment or the product segment. The segment pattern (which zero is missing) determines which segment to pad.
Pad labeler segment from 4 to 5 digits
Pad product segment from 3 to 4 digits
Remove hyphens
Detecting the Format Automatically
When the format is unknown or mixed within a column, you can detect it from the string length and hyphen positions:
DuckDB — detect and normalize any NDC format
-- Normalize NDC to flat 11-digit regardless of input format
CREATE OR REPLACE FUNCTION normalize_ndc(raw_ndc VARCHAR) AS (
CASE
-- Already flat 11-digit
WHEN LENGTH(raw_ndc) = 11 AND NOT CONTAINS(raw_ndc, '-')
THEN raw_ndc
-- Flat 10-digit: left-pad to 11 (only if clearly 10 digits)
WHEN LENGTH(raw_ndc) = 10 AND NOT CONTAINS(raw_ndc, '-')
THEN LPAD(raw_ndc, 11, '0')
-- 5-4-2 hyphenated (len=13): just remove hyphens
WHEN LENGTH(raw_ndc) = 13 AND REGEXP_MATCHES(raw_ndc, '^d{5}-d{4}-d{2}$')
THEN REPLACE(raw_ndc, '-', '')
-- 4-4-2 hyphenated (len=12): pad labeler to 5, remove hyphens
WHEN LENGTH(raw_ndc) = 12 AND REGEXP_MATCHES(raw_ndc, '^d{4}-d{4}-d{2}$')
THEN LPAD(SPLIT_PART(raw_ndc, '-', 1), 5, '0')
|| SPLIT_PART(raw_ndc, '-', 2)
|| SPLIT_PART(raw_ndc, '-', 3)
-- 5-3-2 hyphenated (len=12): pad product to 4, remove hyphens
WHEN LENGTH(raw_ndc) = 12 AND REGEXP_MATCHES(raw_ndc, '^d{5}-d{3}-d{2}$')
THEN SPLIT_PART(raw_ndc, '-', 1)
|| LPAD(SPLIT_PART(raw_ndc, '-', 2), 4, '0')
|| SPLIT_PART(raw_ndc, '-', 3)
ELSE NULL -- unrecognized format; investigate separately
END
);NDC9 — Labeler + Product Without Package
In some analyses you want to join on drug identity without caring about package size. The first 9 digits of the flat NDC11 (labeler + product) form NDC9, which groups all package variants together. This is particularly useful when joining NADAC pricing, where one NDC9 may have multiple NDC11s (different pack sizes) with slightly different prices.
-- Extract NDC9 from flat NDC11
SELECT
ndc11,
LEFT(ndc11, 9) AS ndc9, -- labeler (5) + product (4)
RIGHT(ndc11, 2) AS package_code
FROM ndc_rxnorm
LIMIT 5;Common Pitfalls
Simple left-padding without format detection
Left-padding any 10-digit NDC to 11 digits assumes the missing zero is always in the labeler segment (4-4-2). This produces wrong results for 5-3-2 format NDCs. Always detect the segment pattern first.
Mixed formats in a single column
Some data pipelines concatenate data from multiple sources into a single NDC column, resulting in mixed 10-digit and 11-digit values. Add a LENGTH check and normalize each format branch separately.
NDC with leading zeros stripped by spreadsheet tools
Excel and some BI tools will strip leading zeros from NDC fields when they are stored as numbers. Always store NDC as VARCHAR and validate that values are exactly 11 characters (or 13 with hyphens in 5-4-2 format).
Trusting a single crosswalk source
No single NDC-to-RxNorm mapping source is complete. The NLM RxNorm API, FDA Product Label file, and commercial crosswalks each have different coverage. A dual-path matching strategy (NDC11 first, then NDC9 fallback) improves match rates significantly.
Normalization Validation Checklist
How TwinFyRx handles NDC normalization
TwinFyRx stores all NDCs in canonical flat 11-digit format in the ndc11 column, with ndc9 pre-computed for fallback matching. The ingest pipeline applies dual-path matching (NDC11 first, NDC9 fallback) against the RxNorm crosswalk, with anomaly logging for unmatched NDCs. This results in match rates above 95% across CMS claims datasets.