Data Quality and Provenance
TL;DR: Every system is tagged with one of four provenance tiers - from official government downloads (Tier 1) to expert-curated domain vocabularies (Tier 4). SHA-256 hashes, source URLs, and dates are stored for audit. This guide explains the framework and how to verify data.
Four-tier provenance framework
graph TD
subgraph Tier1["Tier 1: Official Download"]
T1["Source file from standards body\nSHA-256 hash stored\nExamples: NAICS, ISIC, LOINC, HS"]
end
subgraph Tier2["Tier 2: Structural Derivation"]
T2["Derived from official system\n1:1 structural mapping verified\nExamples: WZ, NAF, ATECO (NACE variants)"]
end
subgraph Tier3["Tier 3: Manual Transcription"]
T3["Transcribed from official publications\nSource URL and date recorded\nExamples: SIC 1987, some Asian/African systems"]
end
subgraph Tier4["Tier 4: Expert Curated"]
T4["Domain expert knowledge\nPeer-reviewed structure\nExamples: All domain_* taxonomies"]
end
T1 --> T2
T2 --> T3
T3 --> T4
| Tier | Label | Description | Verification |
|---|---|---|---|
| 1 | official_download |
Data downloaded directly from the authoritative source | File hash stored for audit |
| 2 | structural_derivation |
Derived from an official system (e.g., NACE national variants) | 1:1 structural mapping verified |
| 3 | manual_transcription |
Transcribed from official publications (PDF, HTML, print) | Cross-checked against source |
| 4 | expert_curated |
Curated by domain experts based on industry knowledge | Peer-reviewed structure |
Tier 1: Official download
The gold standard. Data files (CSV, Excel, XML) are downloaded directly from the standards body's website. A SHA-256 hash of the source file is stored in the source_file_hash column for reproducibility.
Examples: NAICS 2022 (Census Bureau CSV), HS 2022 (WCO), ISIC Rev 4 (UN CSV), LOINC (Regenstrief download), ICD-10-CM (CMS), NCI Thesaurus (NCI)
Tier 2: Structural derivation
Systems that reuse the structure of an official system with localized naming. For example, all EU NACE Rev 2 national variants (WZ 2008, NAF Rev 2, ATECO 2007, etc.) share the identical code structure.
Examples: WZ 2008 (Germany), ONACE 2008 (Austria), NOGA 2008 (Switzerland), all EU NACE national variants, all ISIC Rev 4 national adaptations
Tier 3: Manual transcription
Data transcribed from official documents that do not provide machine-readable downloads. The original source URL and date are recorded for audit.
Examples: SIC 1987 (transcribed from OSHA HTML), some Asian and African national classifications
Tier 4: Expert curated
Domain-specific vocabularies created by subject matter experts. These fill gaps where no official standard exists.
Examples: All domain_* taxonomies (truck freight, agriculture, mining, construction, cybersecurity, AI, etc.)
Provenance metadata fields
Each classification system carries these audit fields:
| Field | Description | Example |
|---|---|---|
data_provenance |
Provenance tier | official_download |
source_url |
URL of the authoritative data source | https://www.census.gov/naics/ |
source_date |
Date the source data was accessed/published | 2024-01-15 |
license |
License terms for the data | Public Domain |
source_file_hash |
SHA-256 hash of the original file (Tier 1 only) | a3f2b7c... |
Querying provenance via API
Get provenance for a system
curl https://worldoftaxonomy.com/api/v1/systems/naics_2022
Response includes data_provenance, source_url, source_date, license, and source_file_hash.
Audit report
# Full provenance audit across all systems
curl https://worldoftaxonomy.com/api/v1/audit
The audit report shows:
- Breakdown by provenance tier (system count, node count per tier)
- Tier 1 systems missing a file hash
- Tier 2 structural derivation count and node coverage
- Skeleton systems (placeholder entries awaiting full data)
MCP audit tool
# Via MCP
tools/call get_audit_report {}
Returns the same audit data in a format suitable for AI agent consumption.
Data verification practices
Hash verification (Tier 1)
For official download systems, the source_file_hash lets you verify data integrity:
- Download the original file from
source_url - Compute its SHA-256 hash
- Compare against the stored
source_file_hash - If they match, the data in WorldOfTaxonomy matches the original file
sequenceDiagram
participant You
participant WOT as WorldOfTaxonomy API
participant Source as Standards Body
You->>WOT: GET /systems/naics_2022
WOT-->>You: source_url, source_file_hash
You->>Source: Download original CSV
Source-->>You: naics_2022.csv
You->>You: sha256sum naics_2022.csv
Note over You: Compare hash with source_file_hash
Structural verification (Tier 2)
For structural derivation systems, you can verify:
- The code structure matches the parent system exactly
- Crosswalk edges are 1:1 (every code in the derived system maps to exactly one code in the parent)
Cross-reference verification
For any system, you can cross-reference node counts and structure against the authoritative publication.
Skeleton systems
Some systems are included as structural placeholders where the full dataset is not freely available (e.g., SNOMED CT, CPT). These are marked with low node counts and are included to preserve the crosswalk topology. Full data requires a license from the respective standards body.
| Skeleton System | Reason | License Holder |
|---|---|---|
| SNOMED CT | Proprietary license | SNOMED International |
| CPT | Copyright protected | American Medical Association |
| RxNorm | Partial skeleton | NLM (some data freely available) |
| DSM-5 | Copyright protected | American Psychiatric Association |
Reporting data quality issues
If you find incorrect data, missing codes, or wrong crosswalk mappings:
- GitHub: File an issue on the project repository with system ID, code, expected vs actual value, and a link to the authoritative source
- API: Include the
report_issue_urlfrom any API response for direct reporting
All classification data in WorldOfTaxonomy is provided for informational purposes only. It should not be used as a substitute for official government or standards body publications. For regulatory, legal, or compliance purposes, always verify codes against the authoritative source.