Building Taxonomies for AI Training Data: Why Classification Still Matters
There’s an irony in AI development that doesn’t get discussed enough. The most advanced machine learning models — systems that can generate text, recognize images, and reason about complex problems — depend fundamentally on a very old-fashioned discipline: taxonomy. How training data is classified, labelled, and organized determines what a model learns. And poor taxonomy design produces models that are subtly but pervasively flawed.
Organizations building or fine-tuning AI models are rediscovering what knowledge management professionals have known for decades: classification matters, and it’s harder than it looks.
Why Taxonomy Matters for Training Data
Machine learning models learn patterns from labelled data. A sentiment analysis model learns what “positive” and “negative” mean from examples that humans have classified into those categories. An image recognition model learns what a “cat” is from images labelled as containing cats.
The taxonomy — the classification scheme applied to training data — shapes what the model can learn and how it distinguishes between categories. If the taxonomy is too coarse (only “positive” and “negative” sentiment, with no “neutral” or “mixed” categories), the model can’t express nuanced judgments. If it’s too granular (50 different sentiment categories that even human annotators can’t reliably distinguish), the model learns noise rather than signal.
Three taxonomy design problems recur across AI training data projects:
Ambiguous category boundaries. When human annotators can’t reliably agree on which category an item belongs to, the model receives contradictory signals. This is measured by inter-annotator agreement rates. Categories with agreement rates below 80% typically produce unreliable model behaviour for those categories. The problem usually isn’t the annotators — it’s the taxonomy defining categories that overlap or whose boundaries are insufficiently specified.
Missing categories. Training data taxonomies often reflect the designer’s assumptions about what categories exist in the data rather than what’s actually present. A customer service query taxonomy might include “billing,” “technical support,” and “returns” but miss “account security” — meaning all security-related queries get forced into an adjacent category, teaching the model incorrect associations.
Hierarchical inconsistency. When taxonomies have multiple levels (as most practical ones do), inconsistency in how depth and specificity are applied creates problems. If “electronics” has five subcategories but “clothing” has fifteen, the model’s classification granularity will be uneven across those domains.
Principles for AI-Ready Taxonomies
Effective taxonomies for AI training data share several characteristics that differ from traditional knowledge management taxonomies.
Operationalizability. Every category must be definable in terms that allow consistent human annotation. This means providing clear, written definitions with examples and counter-examples for each category — not just a label. “Negative sentiment” isn’t a definition; “expression of dissatisfaction, frustration, or complaint about a product, service, or experience” is closer to one. Including boundary cases (items that are close to the category but shouldn’t be included) is essential for annotation consistency.
Testable boundaries. Before deploying a taxonomy for large-scale annotation, test it with a sample of annotators on representative data. Measure inter-annotator agreement. Revise categories where agreement is low. This iterative process — design, test, measure, revise — is time-consuming but prevents far more expensive problems downstream when models trained on poorly labelled data underperform.
Appropriate granularity. The right level of granularity depends on the model’s intended use case. A general customer routing model might need 10-15 categories. A specialized medical triage model might need 200. The temptation is always toward more categories (greater precision), but each additional category requires more training examples and creates more opportunities for classification errors.
Extensibility. Taxonomies for AI training data need to accommodate new categories as the domain evolves. A product classification taxonomy designed in 2023 probably doesn’t include product types that emerged in 2025. Building in mechanisms for adding categories — and retraining models when categories change — is a design consideration from the start.
Lessons from Knowledge Management
The knowledge management field has decades of experience with taxonomy design that AI practitioners would benefit from studying. Several principles translate directly:
Faceted classification. Rather than forcing items into a single hierarchical taxonomy, faceted approaches allow multiple independent classification dimensions. A customer query might be classified by topic (billing, technical), urgency (high, medium, low), and channel (phone, email, chat) simultaneously. This produces richer training data than a single-dimension taxonomy and allows models to learn multiple independent classification tasks.
Warrant. In taxonomy theory, “warrant” refers to the justification for including a category. Literary warrant (it appears in the source material), user warrant (users search for it), and organizational warrant (the organization needs to distinguish it) are traditional types. For AI training data, the relevant warrant is model warrant: does including this category improve the model’s ability to perform its intended task? Categories without model warrant add complexity without benefit.
Maintenance processes. Knowledge management recognizes that taxonomies require ongoing maintenance. Categories become obsolete, new concepts emerge, usage patterns shift. The same is true for AI training data taxonomies, but AI teams often treat the taxonomy as a fixed artefact created once during project setup. Establishing regular review cycles — ideally informed by model performance data showing where misclassifications cluster — keeps the taxonomy aligned with reality.
Research from organizations like the International Association for Ontology and its Applications provides frameworks for these design decisions that translate well to AI contexts.
Common Mistakes
Designing taxonomy in isolation from annotators. Taxonomy designers (typically data scientists or domain experts) sometimes create classification schemes without consulting the people who’ll actually apply them. Annotators often identify ambiguities and practical problems that designers miss. Involving annotators in taxonomy design and testing is essential.
Ignoring class imbalance. In many real-world datasets, some categories are far more common than others. A customer service taxonomy might have 60% of queries in “billing” and 2% in “accessibility.” Models trained on imbalanced data perform poorly on minority classes. The taxonomy design should consider whether extremely rare categories should be merged with related categories or handled through data augmentation strategies.
Confusing taxonomy with ontology. A taxonomy classifies items into categories. An ontology defines relationships between concepts. Both are useful for AI, but they serve different purposes. Using an ontology when a taxonomy would suffice adds unnecessary complexity. Using a taxonomy when relationships between concepts matter produces incomplete training data.
Practical Recommendations
Start with the model’s task, not the data. Define what the model needs to distinguish, then design categories that support those distinctions. Work backward from model requirements to taxonomy design.
Test with real annotators before committing. A taxonomy that looks logical on a whiteboard may be unusable in practice. Pilot annotation with 500-1000 items, measure agreement, and iterate before scaling.
Document everything. Category definitions, boundary cases, revision history, and design rationale should all be captured and maintained. When model performance degrades months or years later, this documentation helps diagnose whether taxonomy drift is the cause.
Treat taxonomy as a living system, not a project deliverable. Budget for ongoing maintenance, periodic review, and evolution as the domain changes and model requirements shift.