> EVAL_SERVICES

Catch Model Failures Before Production.

We test your model's predictions for instability and give you a sealed report. Find problems before users do.

Find silent regressions pre-deployment
Court-admissible audit logs for regulatory compliance
Independent auditor validation with sealed evidence
Request Evaluation →
Designed for
AI Safety Labs
MLOps Teams
Regulated Industries

⚠️ The Problem

What standard testing misses.

P0/P1 incidents from edge case failures

Production outages cost $100k-$5M per incident in downtime + customer churn

Benchmark accuracy hides systematic brittleness

95% accuracy → but predictions collapse when inputs vary slightly

In-house testing has blind spots

You test what you expect, miss what you don't

Silent regressions slip through CI/CD

Model updates break edge cases, prod finds out first

CI evaluation catches collapse patterns before production

$10k-$30k audit vs $100k-$5M P0 incident. Third-party sealed logs prevent blame games during post-mortems.


🎯 Who Uses This

Who this is for.

🔬 AI Safety Labs

Red team brittleness before model release. Find collapse patterns standard adversarial testing misses.

Pain Point: Benchmark scores don't catch systematic edge case failures

⚙️ MLOps Teams

Pre-deployment regression testing. Validate model updates won't break production edge cases.

Pain Point: Silent regressions in model updates discovered post-deployment

🏛️ Regulated Industries

Healthcare, finance, gov: compliance-ready audit logs. Sealed evidence for regulatory validation.

Pain Point: Need third-party validation auditors will accept

📦 Deliverables

What you get.

Send us your prediction dataset. Get back a complete diagnostic suite within 24-72 hours.

🎯 Risk Triage JSON

Trinity taxonomy: CI Tiers (stable → critical), CSI Types (I-V), SRI Grades (A-F). Isolated high-confidence errors and brittle row flagging. Machine-readable format for pipeline integration.

Impact: Feed directly into your CI/CD for automated analysis

📊 HTML Summary

CI + SRI scores, Trinity verdict, AUC metrics for all three signals (CI, SRI, Confidence), ROC curves, risk distribution histograms, triage hotspots, and stability interpretation.

Impact: Executive-ready report for stakeholders

📜 Collapse Log

Row-level forensic CSV with per-case CI scores and flip detection.

Impact: Engineers get exact failure cases to debug

🔒 Sealed Package

Cryptographically sealed ZIP with SHA-256 manifests for court-ready integrity.

Impact: Audit trail regulators will accept

🛡️ Data Quality Audit

Automatic cleaning report: rows dropped, missing values detected, format issues flagged before analysis.

Impact: Know if bad data affected your results

🔐 Fast & Private

24-72 hour delivery with zero retention policy. Automated pipeline. We destroy datasets after delivery, keep only crypto manifests (snapshot.json).

Impact: Results before sprint ends + works with sensitive data

🚫 Abstention + Epistemic Guardrails

Free Addon on Request

Instant deployment safety layer. Two-gate flagging system (CI + Confidence) that identifies which predictions to route to human review. Plus epistemic guardrails that de-weaponize uncatchable Type I errors (stable + confident + wrong) by removing epistemic authority instead of abstaining. No retraining, no hyperparameter tuning. Works on any model, any dataset, right now.

Annotated Dataset
Full CSV with abstention + behavior policy columns
Summary Report
Gate breakdown + EG coverage stats
Epistemic Guardrails
De-weaponize Type I errors with behavior policies

Sample Output

prompt_id ci_score confidence is_error should_abstain abstention_reason behavior_policy
sst2_00605 0.87 0.98 1 True high_ci
sst2_00184 0.76 0.99 1 True high_ci
sst2_00042 0.12 0.52 0 True low_conf
sst2_00891 0.08 0.94 0 False
sst2_01247 0.03 0.96 1 False assertion_guarded
Abstention Case Study → Epistemic Guardrails Case Study →
🛡️ Two-Layer Defense
Abstention (Catchable Errors):

Route high-CI or low-confidence predictions to human review. Catches unstable or uncertain predictions before they cause harm.

Epistemic Guardrails (Type I Ghosts):

Type I errors (stable + confident + wrong) can't be caught by CI or confidence. Instead of abstaining and losing coverage, we de-weaponize them by removing epistemic authority. Same prediction, same coverage, lower harm.

Example: Traffic light misclassification (green → red).
Before EG: "The traffic light is red." (dangerous assertion)
After EG: "I may be mistaken, but the traffic light is red." (epistemic hedge removes authority)


🎁 Bonus Offer

Free Behavioral Regression Test

Order 2 analyses, get a comparison free. See what standard A/B testing misses.

🔍 What You Get

  • 8 behavior tags per sample
  • CSI type comparison (Type I-V shift)
  • risky_rows.csv with Type I's
  • Full reports (TXT, MD, CSV, JSON)
Best For: Fine-tuning validation, model upgrades

📊 Real Example

BERT vs DistilBERT on AG News (2,000 samples)

Accuracy: 92.5% → 93.7% (+1.2%)
Type I Errors: 123 → 62 (-50%!) 🎯
Verdict: Better + Safer

⚠️ Why This Matters

BERT had higher confidence (0.96 on errors) and lower CI (more stable). Standard metrics call this better. But it had 82.6% Type I error rate. DistilBERT reduced it to 49.2% by introducing beneficial uncertainty.

Standard A/B testing would miss this entirely.

How It Works

  1. Order CI evaluations for 2 models (e.g., GPT-4 vs Claude, baseline vs fine-tuned)
  2. We analyze both models with full triage (CSI types, behavioral tags)
  3. Get a free comparison report showing per-sample behavioral changes
  4. See exactly which samples regressed and why
Request 2 Analyses + Free Comparison

Mention "behavioral regression test" in your email


🔬 Structural Retention Index (SRI)

Two signals, not one.

Detect hidden brittleness that CI alone might miss. SRI measures internal reasoning coherence, providing perfect complementarity to CI.

✨ What Does Each Measure?

CI: How much your model cracks under meaning-preserving perturbations. SRI: How well your model holds its decision structure across variants. Together, they catch failures traditional metrics miss entirely.

Key Insight: A model can have stable predictions but collapsing internal reasoning

🚨 What Gets Caught?

CI catches when your model cracks. SRI catches structural decay. Hidden instability: models that pass QA with confident outputs but have fragile reasoning that fails under real-world stress.

💎 Included Automatically

Every professional CI evaluation includes SRI analysis at no extra cost. You get both signals: how much your model cracks and how well it holds structure.

Best For: High-stakes AI (medical, fraud detection, autonomous systems)

📋 Requirements

What we need from you.

Simple CSV or Parquet dataset containing your model predictions. No weights, no code.

id variant_id true_label pred_label confidence
case_001 base Positive Positive 0.92
case_001 v1 Positive Positive 0.89
case_001 v2 Positive Negative 0.71
case_002 base Negative Negative 0.95
case_002 v1 Negative Negative 0.93

* The third row shows a flip (same input semantics, different prediction). 3+ variants per base ID recommended.


🔬 Don't Have Predictions Yet?

We can run inference for you.

No predictions? No problem. Send us your raw prompts + ground truth labels, and we'll generate variant predictions on your behalf.

⚡ Managed Inference Included Free

We run your model on Modal serverless GPUs. Isolated, ephemeral instances that spin down after each job. Your data never touches OpenAI, Anthropic, or major cloud providers.

Your model weights or API endpoint
Automatic perturbation generation (typos, paraphrases, synonyms)
Output: CI-ready prediction CSV with variants

Or run it yourself with:

SGLang - Fast structured generation

Best for programmatic perturbations (typos, paraphrases, synonym swaps)

vLLM - High-throughput batching

Best for large-scale variant generation (100k+ cases)


🚀 How to Get Started

Simple 4-step process.

1

Pick Your Tier

Standard (up to 100k rows) or Enterprise (100k+ rows, custom SLAs)

2

Send Dataset

Email CSV/Parquet with predictions. No model weights, no code. Encrypted transfer.

3

Receive Sealed Report

Get complete diagnostic ZIP within 24-72 hours (tier-dependent)

4

Pay by Invoice

Net 15 terms. Credit card, ACH, or wire transfer via Stripe.

Start Your Audit →

💰 Transparent Pricing

Pick your tier.

No hidden fees. Pay per audit, no subscriptions.

🚀 Free Pilot Program Limited Slots

For startups, organizations, and academic institutions. We'll run a free pilot on your data so you can see the value before committing.

Up to 5k base rows + 3 variants
Full deliverables
No credit card required
Request Free Pilot →
Standard
$10k–$30k

Most common option for production audits

  • HTML report + Raw CSV data + Triage json
  • Up to 100k base rows
  • Custom variant count (3v, 5v, 10v+)
  • 48-72 hour turnaround
  • Email support (12-24h response)
  • Sealed ZIP delivery
  • Cohort slice analysis included
  • Free managed inference on request
  • Free abstention analysis + annotated dataset on request
  • 🎁 Buy 2 analyses → Free comparison report
Request Quote
💰 What drives pricing?
Dataset size: 10k rows → $10k base. 100k rows → $20k-$30k
Complexity: Multi-class, custom analysis (+$2k-$8k)
Turnaround: Standard (48-72h) included. Rush <24h (+$2k)

💡 ROI Example: One prevented production outage typically costs $100k–$500k. Our audit: $10k–$30k.


❓ FAQ

Common questions.

A CSV or Parquet file with your model predictions. Required columns: id, variant_id, true_label, pred_label, and confidence. No model weights, no source code. See the Requirements section for the full schema.
Standard tier: 48-72 hours. Enterprise tier with rush delivery: 24 hours. Turnaround starts when we receive your dataset and payment confirmation.
Zero retention. We destroy your dataset after delivering results. We only keep cryptographic manifests (snapshot.json) for audit trail integrity. Your data never touches third-party services.
CI measures prediction stability under benign input variation. A model with high accuracy but low stability will flip predictions when inputs change slightly (typos, paraphrases, formatting). We quantify this brittleness and identify exactly which cases are at risk.
📚 Research-Backed Framework

Published methodology: DOI: 10.5281/zenodo.17718180

SRI measures internal reasoning coherence across input variants. A model can have stable predictions (low CI) but collapsing decision structure. SRI catches structural decay that confidence scores miss entirely — critical for high-stakes applications where you need both stable outputs AND stable reasoning.
📚 Research-Backed Framework

Published methodology: DOI: 10.5281/zenodo.18016507

Yes. All reports include cryptographically sealed evidence with SHA-256 manifests. Third-party validation with tamper-proof audit trails.
Credibility: Third-party sealed logs are legally defensible and blame-neutral during post-mortems. In-house testing creates conflicts of interest.

Speed: Our framework has been stress-tested across multiple model architectures and data domains. Building equivalent tooling in-house takes a full team (ML engineers, researchers, infrastructure) 6-12 months and $300k-$500k TCO. Most teams never achieve production-grade reliability.

Coverage: We've validated edge case generation across different task types (classification, NLU, generation). You test what you expect. We find what you don't (and probably wouldn't think to test).
Different goals: Observability platforms (Arize, Galileo, WhyLabs) monitor production drift (they catch issues after deployment).

We catch pre-deployment brittleness: CI evaluation runs before you ship, finding silent regressions in staging. Think of it as a stress test that feeds into your observability stack.

Works together: Use our audit to validate releases, then monitor with your existing tools. Our triage.json format integrates directly with standard drift dashboards.
Enterprise customers with recurring evaluation needs can negotiate volume discounts and custom SLAs. Contact us at ask@collapseindex.org to discuss your requirements.