Contributing a New Classification System
TL;DR: Adding a classification system to WorldOfTaxonomy takes about 2 hours. Three paths (NACE-derived, ISIC-derived, or standalone), strict TDD, and idempotent ingestion. This guide walks through every step.
Three paths to contribution
graph TD
START["Pick a system"] --> CHECK{"What is it\nbased on?"}
CHECK -->|Copies NACE Rev 2| A["Path A: NACE-Derived\n~15 lines of code"]
CHECK -->|Adapts ISIC Rev 4| B["Path B: ISIC-Derived\n~15 lines of code"]
CHECK -->|Own source data| C["Path C: Standalone\nCustom parser"]
A --> TEST["Write failing test"]
B --> TEST
C --> TEST
TEST --> IMPL["Write ingester"]
IMPL --> GREEN["Run test green"]
GREEN --> CLI["Wire into CLI"]
CLI --> XWALK["Add crosswalk edges"]
XWALK --> PR["Open PR"]
| Path | When to Use | Effort | Examples |
|---|---|---|---|
| A: NACE-Derived | System copies all NACE Rev 2 codes | ~15 lines | WZ 2008, ATECO 2007, NAF Rev 2, PKD 2007 |
| B: ISIC-Derived | National adaptation of ISIC Rev 4 | ~15 lines | CIIU (Colombia), VSIC (Vietnam), BSIC (Bangladesh) |
| C: Standalone | Own source file (CSV, XLSX, JSON, XML) | Custom parser | NAICS, LOINC, ICD-10-CM, HS |
Before you start
- Check open issues for systems already requested
- Find the official source - government statistical office, standards body, or international organization
- Never use third-party copies (not GitHub mirrors, not Kaggle, not Wikipedia)
Step by step
1. Write the failing test first
TDD is non-negotiable. A test that was never red proves nothing.
Create tests/test_ingest_my_system.py:
import pytest
@pytest.mark.asyncio
async def test_ingest_my_system(test_conn):
from world_of_taxonomy.ingest.my_system import ingest
await ingest(test_conn)
# Verify system registered
row = await test_conn.fetchrow(
"SELECT * FROM test_wot.classification_system WHERE id = 'my_system_2024'"
)
assert row is not None
assert row['name'] == 'My Classification System 2024'
# Verify nodes created
count = await test_conn.fetchval(
"SELECT count(*) FROM test_wot.classification_node WHERE system_id = 'my_system_2024'"
)
assert count > 0
# Verify hierarchy integrity (no orphan nodes)
orphans = await test_conn.fetchval("""
SELECT count(*) FROM test_wot.classification_node n
WHERE n.system_id = 'my_system_2024'
AND n.parent_code IS NOT NULL
AND NOT EXISTS (
SELECT 1 FROM test_wot.classification_node p
WHERE p.system_id = n.system_id AND p.code = n.parent_code
)
""")
assert orphans == 0, f"Found {orphans} orphan nodes"
Run it. Confirm it fails. Then write the implementation.
2. Create the ingester
SYSTEM = {
"id": "my_system_2024",
"name": "My Classification System 2024",
"authority": "Issuing Body",
"region": "Global",
"version": "2024",
"description": "What this system classifies",
}
NODES = [
("A", "Section A", "Description", None),
("A01", "Subsection A01", "Description", "A"),
]
async def ingest(conn) -> None:
# Upsert system
await conn.execute("""
INSERT INTO classification_system (id, name, authority, region, version, description)
VALUES ($1, $2, $3, $4, $5, $6)
ON CONFLICT (id) DO UPDATE SET
name = EXCLUDED.name, authority = EXCLUDED.authority,
region = EXCLUDED.region, version = EXCLUDED.version,
description = EXCLUDED.description
""", SYSTEM["id"], SYSTEM["name"], SYSTEM["authority"],
SYSTEM["region"], SYSTEM["version"], SYSTEM["description"])
# Compute leaf flags dynamically
codes_with_children = {parent for (_, _, _, parent) in NODES if parent}
for code, title, desc, parent in NODES:
is_leaf = code not in codes_with_children
await conn.execute("""
INSERT INTO classification_node (system_id, code, title, description, parent_code, is_leaf)
VALUES ($1, $2, $3, $4, $5, $6)
ON CONFLICT (system_id, code) DO UPDATE SET
title = EXCLUDED.title, description = EXCLUDED.description,
parent_code = EXCLUDED.parent_code, is_leaf = EXCLUDED.is_leaf
""", SYSTEM["id"], code, title, desc, parent, is_leaf)
3. Run the test green
python3 -m pytest tests/test_ingest_my_system.py -v
4. Wire into the CLI
Add your system to world_of_taxonomy/__main__.py.
5. Add crosswalk edges
CROSSWALKS = [
("A01", "isic_rev4", "011", "exact"),
("A02", "isic_rev4", "012", "broad"),
]
for source_code, target_system, target_code, match_type in CROSSWALKS:
await conn.execute("""
INSERT INTO equivalence (source_system_id, source_code, target_system_id, target_code, match_type)
VALUES ($1, $2, $3, $4, $5)
ON CONFLICT DO NOTHING
""", SYSTEM["id"], source_code, target_system, target_code, match_type)
6. Update CLAUDE.md and open a PR
One system per PR. Include: test file, ingester, CLI wiring, CLAUDE.md update.
The ingestion flow
sequenceDiagram
participant DEV as Developer
participant TEST as pytest
participant ING as Ingester
participant PG as PostgreSQL
DEV->>TEST: Write failing test
TEST-->>DEV: Red (expected)
DEV->>ING: Write ingester
ING->>PG: Upsert system metadata
ING->>PG: Upsert nodes with hierarchy
ING->>PG: Insert crosswalk edges
DEV->>TEST: Run test again
TEST->>PG: Query test_wot schema
PG-->>TEST: Verify data
TEST-->>DEV: Green
DEV->>DEV: Open PR
Key rules
| Rule | Why |
|---|---|
| Never hard-code leaf flags | Compute from hierarchy - parent relationships change |
| Never skip the red step | A test that was never red proves nothing |
| Use test_wot schema | Production data is never touched |
| Download from authoritative sources | Provenance and accuracy matter |
| Idempotent ingestion | ON CONFLICT ... DO UPDATE makes re-runs safe |
| No em-dashes | CI enforces this project-wide |
Systems most wanted
- National industry codes from the Middle East, Sub-Saharan Africa, and Central Asia
- UNSD and Eurostat statistical classifications
- Commodity classifications (agricultural, mineral, pharmaceutical)
- US state-level occupation codes
- Professional licensing classifications
Check the GitHub issues for the current list. Pick one. Write the test. Ship the PR.