Catch Model Failures Before Production.
We test your model's predictions for instability and give you a sealed report. Find problems before users do.
What standard testing misses.
Production outages cost $100k-$5M per incident in downtime + customer churn
95% accuracy → but predictions collapse when inputs vary slightly
You test what you expect, miss what you don't
Model updates break edge cases, prod finds out first
$10k-$30k audit vs $100k-$5M P0 incident. Third-party sealed logs prevent blame games during post-mortems.
Who this is for.
🔬 AI Safety Labs
Red team brittleness before model release. Find collapse patterns standard adversarial testing misses.
⚙️ MLOps Teams
Pre-deployment regression testing. Validate model updates won't break production edge cases.
🏛️ Regulated Industries
Healthcare, finance, gov: compliance-ready audit logs. Sealed evidence for regulatory validation.
What you get.
Send us your prediction dataset. Get back a complete diagnostic suite within 24-72 hours.
🎯 Risk Triage JSON
Trinity taxonomy: CI Tiers (stable → critical), CSI Types (I-V), SRI Grades (A-F). Isolated high-confidence errors and brittle row flagging. Machine-readable format for pipeline integration.
📊 HTML Summary
CI + SRI scores, Trinity verdict, AUC metrics for all three signals (CI, SRI, Confidence), ROC curves, risk distribution histograms, triage hotspots, and stability interpretation.
📜 Collapse Log
Row-level forensic CSV with per-case CI scores and flip detection.
🔒 Sealed Package
Cryptographically sealed ZIP with SHA-256 manifests for court-ready integrity.
🛡️ Data Quality Audit
Automatic cleaning report: rows dropped, missing values detected, format issues flagged before analysis.
🔐 Fast & Private
24-72 hour delivery with zero retention policy. Automated pipeline. We destroy datasets after delivery, keep only crypto manifests (snapshot.json).
🚫 Abstention + Epistemic Guardrails
Free Addon on RequestInstant deployment safety layer. Two-gate flagging system (CI + Confidence) that identifies which predictions to route to human review. Plus epistemic guardrails that de-weaponize uncatchable Type I errors (stable + confident + wrong) by removing epistemic authority instead of abstaining. No retraining, no hyperparameter tuning. Works on any model, any dataset, right now.
Full CSV with abstention + behavior policy columns
Gate breakdown + EG coverage stats
De-weaponize Type I errors with behavior policies
Sample Output
| prompt_id | ci_score | confidence | is_error | should_abstain | abstention_reason | behavior_policy |
|---|---|---|---|---|---|---|
| sst2_00605 | 0.87 | 0.98 | 1 | True | high_ci | |
| sst2_00184 | 0.76 | 0.99 | 1 | True | high_ci | |
| sst2_00042 | 0.12 | 0.52 | 0 | True | low_conf | |
| sst2_00891 | 0.08 | 0.94 | 0 | False | ||
| sst2_01247 | 0.03 | 0.96 | 1 | False | assertion_guarded |
Route high-CI or low-confidence predictions to human review. Catches unstable or uncertain predictions before they cause harm.
Type I errors (stable + confident + wrong) can't be caught by CI or confidence. Instead of abstaining and losing coverage, we de-weaponize them by removing epistemic authority. Same prediction, same coverage, lower harm.
Example: Traffic light misclassification (green → red).
Before EG: "The traffic light is red." (dangerous assertion)
After EG: "I may be mistaken, but the traffic light is red." (epistemic hedge removes authority)
Free Behavioral Regression Test
Order 2 analyses, get a comparison free. See what standard A/B testing misses.
🔍 What You Get
- 8 behavior tags per sample
- CSI type comparison (Type I-V shift)
- risky_rows.csv with Type I's
- Full reports (TXT, MD, CSV, JSON)
📊 Real Example
BERT vs DistilBERT on AG News (2,000 samples)
⚠️ Why This Matters
BERT had higher confidence (0.96 on errors) and lower CI (more stable). Standard metrics call this better. But it had 82.6% Type I error rate. DistilBERT reduced it to 49.2% by introducing beneficial uncertainty.
How It Works
- Order CI evaluations for 2 models (e.g., GPT-4 vs Claude, baseline vs fine-tuned)
- We analyze both models with full triage (CSI types, behavioral tags)
- Get a free comparison report showing per-sample behavioral changes
- See exactly which samples regressed and why
Mention "behavioral regression test" in your email
Two signals, not one.
Detect hidden brittleness that CI alone might miss. SRI measures internal reasoning coherence, providing perfect complementarity to CI.
✨ What Does Each Measure?
CI: How much your model cracks under meaning-preserving perturbations. SRI: How well your model holds its decision structure across variants. Together, they catch failures traditional metrics miss entirely.
🚨 What Gets Caught?
CI catches when your model cracks. SRI catches structural decay. Hidden instability: models that pass QA with confident outputs but have fragile reasoning that fails under real-world stress.
💎 Included Automatically
Every professional CI evaluation includes SRI analysis at no extra cost. You get both signals: how much your model cracks and how well it holds structure.
What we need from you.
Simple CSV or Parquet dataset containing your model predictions. No weights, no code.
| id | variant_id | true_label | pred_label | confidence |
|---|---|---|---|---|
| case_001 | base | Positive | Positive | 0.92 |
| case_001 | v1 | Positive | Positive | 0.89 |
| case_001 | v2 | Positive | Negative | 0.71 |
| case_002 | base | Negative | Negative | 0.95 |
| case_002 | v1 | Negative | Negative | 0.93 |
* The third row shows a flip (same input semantics, different prediction). 3+ variants per base ID recommended.
We can run inference for you.
No predictions? No problem. Send us your raw prompts + ground truth labels, and we'll generate variant predictions on your behalf.
We run your model on Modal serverless GPUs. Isolated, ephemeral instances that spin down after each job. Your data never touches OpenAI, Anthropic, or major cloud providers.
Or run it yourself with:
Best for programmatic perturbations (typos, paraphrases, synonym swaps)
Best for large-scale variant generation (100k+ cases)
Simple 4-step process.
Pick Your Tier
Standard (up to 100k rows) or Enterprise (100k+ rows, custom SLAs)
Send Dataset
Email CSV/Parquet with predictions. No model weights, no code. Encrypted transfer.
Receive Sealed Report
Get complete diagnostic ZIP within 24-72 hours (tier-dependent)
Pay by Invoice
Net 15 terms. Credit card, ACH, or wire transfer via Stripe.
Pick your tier.
No hidden fees. Pay per audit, no subscriptions.
For startups, organizations, and academic institutions. We'll run a free pilot on your data so you can see the value before committing.
Most common option for production audits
- HTML report + Raw CSV data + Triage json
- Up to 100k base rows
- Custom variant count (3v, 5v, 10v+)
- 48-72 hour turnaround
- Email support (12-24h response)
- Sealed ZIP delivery
- Cohort slice analysis included
- Free managed inference on request
- Free abstention analysis + annotated dataset on request
- 🎁 Buy 2 analyses → Free comparison report
For high-volume, urgent, or custom requirements
- Everything in Standard
- 24-hour rush delivery available
- Priority email (4-hour response)
- Custom SLAs & compliance packages
- White-label reports available
💡 ROI Example: One prevented production outage typically costs $100k–$500k. Our audit: $10k–$30k.
Common questions.
Published methodology: DOI: 10.5281/zenodo.17718180
Published methodology: DOI: 10.5281/zenodo.18016507
Speed: Our framework has been stress-tested across multiple model architectures and data domains. Building equivalent tooling in-house takes a full team (ML engineers, researchers, infrastructure) 6-12 months and $300k-$500k TCO. Most teams never achieve production-grade reliability.
Coverage: We've validated edge case generation across different task types (classification, NLU, generation). You test what you expect. We find what you don't (and probably wouldn't think to test).
We catch pre-deployment brittleness: CI evaluation runs before you ship, finding silent regressions in staging. Think of it as a stress test that feeds into your observability stack.
Works together: Use our audit to validate releases, then monitor with your existing tools. Our triage.json format integrates directly with standard drift dashboards.