1,000 Systems, 1.2M+ Nodes, 321K Crosswalk Edges
TL;DR: Three tables, one ingestion pipeline, three ingester patterns, and a hub-and-spoke crosswalk topology. This is how we built a knowledge graph connecting every major classification system in the world.
The data model
The entire graph fits in three tables:
erDiagram
classification_system ||--o{ classification_node : contains
classification_node ||--o{ equivalence : "source"
classification_node ||--o{ equivalence : "target"
classification_system {
string id PK
string name
string region
string authority
int node_count
}
classification_node {
string system_id FK
string code PK
string title
string parent_code
boolean is_leaf
}
equivalence {
string source_system FK
string source_code FK
string target_system FK
string target_code FK
string match_type
}
| Table | Rows | What it stores |
|---|---|---|
classification_system |
1,000+ | System metadata (name, region, authority) |
classification_node |
1,212,000+ | Every code in every system (hierarchy via parent_code) |
equivalence |
321,000+ | Crosswalk edges between systems |
The simplicity is intentional. Classification systems are trees (codes have parents). Crosswalks are edges (this code maps to that code). Three tables capture everything.
The ingestion pipeline
graph TD
subgraph Sources["Authoritative Sources"]
CSV["Census Bureau CSV"]
XLSX["Eurostat XLSX"]
XML["WHO/CMS XML"]
ZIP["Regenstrief ZIP"]
end
subgraph Pipeline["Ingestion"]
PARSE["Parse + validate"]
UPSERT["Upsert nodes"]
XWALK["Build crosswalks"]
LEAF["Compute leaf flags"]
end
subgraph DB["PostgreSQL"]
SYS["classification_system"]
NODE["classification_node"]
EQUIV["equivalence"]
end
CSV --> PARSE
XLSX --> PARSE
XML --> PARSE
ZIP --> PARSE
PARSE --> UPSERT --> NODE
PARSE --> XWALK --> EQUIV
PARSE --> LEAF
UPSERT --> SYS
Each system has a dedicated ingester. Every ingester downloads directly from the authoritative source - Census Bureau, UN, WHO, Eurostat - and loads into PostgreSQL.
Three ingester patterns
Pattern A: Standalone (~15 systems)
Custom parser for a specific source file. Handles CSV, XLSX, XML, JSON, or HTML.
Examples: NAICS (Census CSV), LOINC (Regenstrief ZIP), ICD-10-CM (CMS XML), HS (WCO structure)
Pattern B: NACE-derived (~30 systems)
European national systems that copy NACE Rev 2 exactly. The ingester copies all 996 NACE nodes and creates 1:1 equivalence edges. About 15 lines of code.
Examples: WZ 2008 (Germany), ATECO 2007 (Italy), NAF Rev 2 (France), PKD 2007 (Poland)
Pattern C: ISIC-derived (~80 systems)
National adaptations of ISIC Rev 4. Same copy-and-link pattern. Covers most of Latin America, Africa, Asia, and the Middle East.
Examples: CIIU (Colombia, Chile, Peru), VSIC (Vietnam), BSIC (Bangladesh)
Scale challenges
| System | Codes | Challenge |
|---|---|---|
| NCI Thesaurus | 211,072 | Largest system - pushed full-text search indexing limits |
| NDC (Drug Codes) | 112,077 | Product-level granularity, frequent updates |
| LOINC | 102,751 | ZIP streaming, batch insert for memory management |
| ICD-10-CM | 97,606 | 7-level deep hierarchy, complex parent chain logic |
| ICD-10-PCS | 79,987 | Procedure codes with combinatorial structure |
| Patent CPC | 254,249 | Widest system - custom XML parser required |
Crosswalk topology
The 321,000 edges follow a hub-and-spoke pattern:
graph TD
ISIC["ISIC Rev 4\n(industry hub)"]
NAICS["NAICS"] --> ISIC
NACE["NACE"] --> ISIC
NIC["NIC"] --> ISIC
ANZSIC["ANZSIC"] --> ISIC
WZ["WZ 2008"] --> NACE
ATECO["ATECO"] --> NACE
NAF["NAF"] --> NACE
CIIU["CIIU (LATAM)"] --> ISIC
HS["HS 2022\n(trade hub)"]
HTS["HTS (US)"] --> HS
CN["CN (EU)"] --> HS
AHTN["AHTN (ASEAN)"] --> HS
ISCO["ISCO-08\n(occupation hub)"]
SOC["SOC"] --> ISCO
ANZSCO["ANZSCO"] --> ISCO
NOC["NOC"] --> ISCO
Three hubs:
- ISIC Rev 4 for industry (most national systems have a direct crosswalk)
- HS 2022 for trade (national tariff schedules extend HS)
- ISCO-08 for occupations (national systems have concordances)
Translation between any two systems typically routes through a hub: Source -> Hub -> Target.
Provenance
Every system has a provenance tier so consumers can decide how much to trust each mapping:
| Tier | Meaning | Example |
|---|---|---|
official_source |
Parsed directly from the authority | NAICS from Census Bureau |
official_derived |
National adaptation of a reference | WZ 2008 from NACE Rev 2 |
expert_curated |
Built from authoritative references | Domain taxonomies |
ai_assisted |
Structure verified by automated tools | Some domain expansions |
Key design decisions
- Leaf flags computed dynamically - never hard-coded, always derived from the hierarchy
- Idempotent ingestion - every ingester uses
ON CONFLICT ... DO UPDATE, safe to re-run - TDD enforced - every ingester has a test file, tests run in an isolated
test_wotschema - No ORMs - raw SQL via asyncpg for performance at 1.2M rows