Trainable Classifier Lab
What trainable classifiers are, when to use one instead of a SIT, EDM, or fingerprint, and what building your own really takes.
Pattern matching vs machine learning
Purview classifies content two fundamentally different ways. Knowing which problem you have is most of the decision.
Sensitive Information Types
Pattern matching: a regex or keyword finds the data, nearby evidence confirms the context. If you can describe the data as a pattern, this is your tool.
- Detects: credit cards, IDs, account numbers, codenames
- You control: every part of the pattern
- Feedback loop: instant, test and adjust in minutes
Trainable Classifiers
Machine learning: the classifier learns a category from hundreds of examples and recognises a contract or a CV the way a person does, by the content as a whole.
- Detects: contracts, source code, financials, resumes
- You control: the examples it learns from
- Feedback loop: slow, training cycles take days
The custom classifier lifecycle
Pretrained classifiers work immediately. A custom one is a small project, and most of it is collecting good examples.
Check the built-in catalog
MinutesPurview comes with pretrained classifiers (source code, resumes, financial statements and more) that work straight away. If one covers your category, stop here: custom classifiers are real work.
Collect seed content
Days to weeks50 to 500 samples that clearly belong in the category, 150 to 1,500 that clearly do not. A human picks these, and their quality decides the project. Only the 2,000 newest samples are processed.
Stage in SharePoint
HoursPositive and negative samples go in separate, dedicated SharePoint folders containing nothing else. Use a Communication site, not a Teams folder, and allow an hour for indexing if the folders are new.
Create and train
Up to 24 hoursPoint the portal at the positive folder, then the negative one. The model builds within 24 hours, and automated testing (in preview) has cut the whole workflow from around 12 days to about two.
Review predictions
DaysWork through the test results confirming each prediction. Poor accuracy is cheap to fix here: add seed data and retrain. After publishing it is not.
Publish and monitor
OngoingPublished classifiers become conditions in auto-labelling, auto-apply retention, and DLP. The catch: a published classifier cannot be retrained. To improve one, delete it and rebuild with bigger sample sets.
Where classifiers can be used
Trainable classifiers are conditions, the same as SITs: policies reference them. One gap catches people out.
| Solution | Built-in classifiers | Custom classifiers |
|---|---|---|
| Auto-labelling with sensitivity labels | ✓ | ✓ |
| Auto-apply retention label policies | ✓ | ✓ |
| Data Loss Prevention policies | ✓ | ✓ |
| Communication Compliance Microsoft-provided classifiers only. Custom trainable classifiers are not supported. | ✓ | ✗ |
The constraints that shape projects
Five facts worth knowing before anyone commits to a custom classifier. Each one has derailed a real project.
Custom classifiers are English only
Built-in classifiers evaluate multiple languages, but custom trainable classifiers only support English content.
Encrypted items are invisible
Classifiers only work with items that are not encrypted. Content protected with encrypting sensitivity labels will not be evaluated.
No retraining after publish
Retraining a published custom classifier is not supported. If accuracy is poor in production, you remove the classifier and rebuild it with larger, better sample sets. Get it right before you publish.
Creator-only by default
By default, only the account that creates a custom classifier can train it and review its predictions. Pick the owner deliberately, not whoever happened to be logged in.
E5-level licensing
Trainable classifiers sit in the Microsoft 365 E5 / E5 Compliance feature set. Check the Microsoft 365 licensing guidance for security and compliance for specifics.