Reference/NDC Normalization

NDC Normalization Guide

The National Drug Code is distributed in at least four distinct formats across federal data sources. Failing to normalize NDC before joining across datasets is one of the most common causes of silent data loss in drug analytics pipelines. This guide covers every format variant, the zero-padding rules, and production-ready SQL to normalize them all.

The Problem

The same physical drug — Lipitor 10 mg, 90-count bottle — may appear as all of the following depending on which federal dataset you're reading:

0071-0213-4000071-213-4000071-0213-40000710213400071021340

Five different string representations of the same NDC. A naive string join misses all but one.

A direct string join between a Medicaid claims file (flat 11-digit) and an FDA Orange Book file (4-4-2) will match zero rows — not because the data is wrong, but because the formats differ. The consequence is silently under-counted utilization, missing pricing matches, and incorrect formulary analytics.

Format Reference by Source

Data Source	Format	Example	Normalized	Action
FDA Orange Book	4-4-2	0071-0213-40	00071021340	Pad labeler: 4→5 digits
FDA NDC Directory	5-3-2	00071-213-40	00071021340	Pad product: 3→4 digits
HIPAA / NDC Directory	5-4-2	00071-0213-40	00071021340	Remove hyphens only
Medicaid claims (CMS)	Flat 11-digit	00071021340	00071021340	Already canonical
Medicare Part D	Flat 11-digit	00071021340	00071021340	Already canonical
Some EHR exports	10-digit	0071021340	00071021340	Left-pad with 1 zero (verify segment pattern)

The flat 11-digit format (no hyphens) is the canonical normalization target used in CMS claims data.

Zero-Padding Rules

The key insight: 10-digit NDCs are 11-digit NDCs with one leading zero omitted from either the labeler segment or the product segment. The segment pattern (which zero is missing) determines which segment to pad.

4-4-2 (10-digit)Flat 11-digit

Pad labeler segment from 4 to 5 digits

0071-0213-4000071-0213-40

5-3-2 (10-digit)Flat 11-digit

Pad product segment from 3 to 4 digits

00071-213-4000071-0213-40

5-4-2 (11-digit with hyphens)Flat 11-digit

Remove hyphens

00071-0213-4000071021340

Detecting the Format Automatically

When the format is unknown or mixed within a column, you can detect it from the string length and hyphen positions:

DuckDB — detect and normalize any NDC format

-- Normalize NDC to flat 11-digit regardless of input format
CREATE OR REPLACE FUNCTION normalize_ndc(raw_ndc VARCHAR) AS (
    CASE
        -- Already flat 11-digit
        WHEN LENGTH(raw_ndc) = 11 AND NOT CONTAINS(raw_ndc, '-')
            THEN raw_ndc

        -- Flat 10-digit: left-pad to 11 (only if clearly 10 digits)
        WHEN LENGTH(raw_ndc) = 10 AND NOT CONTAINS(raw_ndc, '-')
            THEN LPAD(raw_ndc, 11, '0')

        -- 5-4-2 hyphenated (len=13): just remove hyphens
        WHEN LENGTH(raw_ndc) = 13 AND REGEXP_MATCHES(raw_ndc, '^d{5}-d{4}-d{2}$')
            THEN REPLACE(raw_ndc, '-', '')

        -- 4-4-2 hyphenated (len=12): pad labeler to 5, remove hyphens
        WHEN LENGTH(raw_ndc) = 12 AND REGEXP_MATCHES(raw_ndc, '^d{4}-d{4}-d{2}$')
            THEN LPAD(SPLIT_PART(raw_ndc, '-', 1), 5, '0')
                 || SPLIT_PART(raw_ndc, '-', 2)
                 || SPLIT_PART(raw_ndc, '-', 3)

        -- 5-3-2 hyphenated (len=12): pad product to 4, remove hyphens
        WHEN LENGTH(raw_ndc) = 12 AND REGEXP_MATCHES(raw_ndc, '^d{5}-d{3}-d{2}$')
            THEN SPLIT_PART(raw_ndc, '-', 1)
                 || LPAD(SPLIT_PART(raw_ndc, '-', 2), 4, '0')
                 || SPLIT_PART(raw_ndc, '-', 3)

        ELSE NULL  -- unrecognized format; investigate separately
    END
);

NDC9 — Labeler + Product Without Package

In some analyses you want to join on drug identity without caring about package size. The first 9 digits of the flat NDC11 (labeler + product) form NDC9, which groups all package variants together. This is particularly useful when joining NADAC pricing, where one NDC9 may have multiple NDC11s (different pack sizes) with slightly different prices.

-- Extract NDC9 from flat NDC11
SELECT
    ndc11,
    LEFT(ndc11, 9) AS ndc9,   -- labeler (5) + product (4)
    RIGHT(ndc11, 2) AS package_code
FROM ndc_rxnorm
LIMIT 5;

Common Pitfalls

Simple left-padding without format detection

Left-padding any 10-digit NDC to 11 digits assumes the missing zero is always in the labeler segment (4-4-2). This produces wrong results for 5-3-2 format NDCs. Always detect the segment pattern first.

Mixed formats in a single column

Some data pipelines concatenate data from multiple sources into a single NDC column, resulting in mixed 10-digit and 11-digit values. Add a LENGTH check and normalize each format branch separately.

NDC with leading zeros stripped by spreadsheet tools

Excel and some BI tools will strip leading zeros from NDC fields when they are stored as numbers. Always store NDC as VARCHAR and validate that values are exactly 11 characters (or 13 with hyphens in 5-4-2 format).

Trusting a single crosswalk source

No single NDC-to-RxNorm mapping source is complete. The NLM RxNorm API, FDA Product Label file, and commercial crosswalks each have different coverage. A dual-path matching strategy (NDC11 first, then NDC9 fallback) improves match rates significantly.

Normalization Validation Checklist

All NDC values are VARCHAR — never INTEGER or FLOAT

All values are exactly 11 characters (after normalization)

No hyphens remain in the normalized column

No NULL values introduced unexpectedly (check unrecognized formats separately)

Match rate from NDC → RXCUI crosswalk is >90% (flag outliers for investigation)

NDC9 column available for fallback joins when NDC11 has no match

Historical NDCs (retired packages) handled with a status flag, not dropped

How TwinFyRx handles NDC normalization

TwinFyRx stores all NDCs in canonical flat 11-digit format in the ndc11 column, with ndc9 pre-computed for fallback matching. The ingest pipeline applies dual-path matching (NDC11 first, NDC9 fallback) against the RxNorm crosswalk, with anomaly logging for unmatched NDCs. This results in match rates above 95% across CMS claims datasets.

See the RxNorm–NDC crosswalk →

What is RxNorm?What is NDC?What is NADAC?What is ATC Classification?RxNorm–NDC Crosswalk RxNorm vs NDC NADAC vs AWP Pricing Drug Pricing Datasets