> EVAL_SERVICES

Catch Model Failures Before Production.

We test your model's predictions for instability and give you a sealed report. Find problems before users do.

✓ Find silent regressions pre-deployment

✓ Court-admissible audit logs for regulatory compliance

✓ Independent auditor validation with sealed evidence

Request Evaluation →

Designed for

AI Safety Labs

MLOps Teams

Regulated Industries

⚠️ The Problem

What standard testing misses.

✗

P0/P1 incidents from edge case failures

Production outages cost $100k-$5M per incident in downtime + customer churn

✗

Benchmark accuracy hides systematic brittleness

95% accuracy → but predictions collapse when inputs vary slightly

✗

In-house testing has blind spots

You test what you expect, miss what you don't

✗

Silent regressions slip through CI/CD

Model updates break edge cases, prod finds out first

CI evaluation catches collapse patterns before production

$10k-$30k audit vs $100k-$5M P0 incident. Third-party sealed logs prevent blame games during post-mortems.

🎯 Who Uses This

Who this is for.

🔬 AI Safety Labs

Red team brittleness before model release. Find collapse patterns standard adversarial testing misses.

Pain Point: Benchmark scores don't catch systematic edge case failures

⚙️ MLOps Teams

Pre-deployment regression testing. Validate model updates won't break production edge cases.

Pain Point: Silent regressions in model updates discovered post-deployment

🏛️ Regulated Industries

Healthcare, finance, gov: compliance-ready audit logs. Sealed evidence for regulatory validation.

Pain Point: Need third-party validation auditors will accept

📦 Deliverables

What you get.

Send us your prediction dataset. Get back a complete diagnostic suite within 24-72 hours.

🎯 Risk Triage JSON

Trinity taxonomy: CI Tiers (stable → critical), CSI Types (I-V), SRI Grades (A-F). Isolated high-confidence errors and brittle row flagging. Machine-readable format for pipeline integration.

Impact: Feed directly into your CI/CD for automated analysis

📊 HTML Summary

CI + SRI scores, Trinity verdict, AUC metrics for all three signals (CI, SRI, Confidence), ROC curves, risk distribution histograms, triage hotspots, and stability interpretation.

Impact: Executive-ready report for stakeholders

📜 Collapse Log

Row-level forensic CSV with per-case CI scores and flip detection.

Impact: Engineers get exact failure cases to debug

🔒 Sealed Package

Cryptographically sealed ZIP with SHA-256 manifests for court-ready integrity.

Impact: Audit trail regulators will accept

🛡️ Data Quality Audit

Automatic cleaning report: rows dropped, missing values detected, format issues flagged before analysis.

Impact: Know if bad data affected your results

🔐 Fast & Private

24-72 hour delivery with zero retention policy. Automated pipeline. We destroy datasets after delivery, keep only crypto manifests (snapshot.json).

Impact: Results before sprint ends + works with sensitive data

🚫 Abstention + Epistemic Guardrails

Free Addon on Request

Instant deployment safety layer. Two-gate flagging system (CI + Confidence) that identifies which predictions to route to human review. Plus epistemic guardrails that de-weaponize uncatchable Type I errors (stable + confident + wrong) by removing epistemic authority instead of abstaining. No retraining, no hyperparameter tuning. Works on any model, any dataset, right now.

Annotated Dataset
Full CSV with abstention + behavior policy columns

Summary Report
Gate breakdown + EG coverage stats

Epistemic Guardrails
De-weaponize Type I errors with behavior policies

Sample Output

prompt_id	ci_score	confidence	is_error	should_abstain	abstention_reason	behavior_policy
sst2_00605	0.87	0.98	1	True	high_ci
sst2_00184	0.76	0.99	1	True	high_ci
sst2_00042	0.12	0.52	0	True	low_conf
sst2_00891	0.08	0.94	0	False
sst2_01247	0.03	0.96	1	False		assertion_guarded

Abstention Case Study → Epistemic Guardrails Case Study →

🛡️ Two-Layer Defense

Abstention (Catchable Errors):

Route high-CI or low-confidence predictions to human review. Catches unstable or uncertain predictions before they cause harm.

Epistemic Guardrails (Type I Ghosts):

Type I errors (stable + confident + wrong) can't be caught by CI or confidence. Instead of abstaining and losing coverage, we de-weaponize them by removing epistemic authority. Same prediction, same coverage, lower harm.

Example: Traffic light misclassification (green → red).
Before EG: "The traffic light is red." (dangerous assertion)
After EG: "I may be mistaken, but the traffic light is red." (epistemic hedge removes authority)

🎁 Bonus Offer

Free Behavioral Regression Test

Order 2 analyses, get a comparison free. See what standard A/B testing misses.

🔍 What You Get

8 behavior tags per sample
CSI type comparison (Type I-V shift)
risky_rows.csv with Type I's
Full reports (TXT, MD, CSV, JSON)

Best For: Fine-tuning validation, model upgrades

📊 Real Example

BERT vs DistilBERT on AG News (2,000 samples)

Accuracy: 92.5% → 93.7% (+1.2%)

Type I Errors: 123 → 62 (-50%!) 🎯

Verdict: Better + Safer

⚠️ Why This Matters

BERT had higher confidence (0.96 on errors) and lower CI (more stable). Standard metrics call this better. But it had 82.6% Type I error rate. DistilBERT reduced it to 49.2% by introducing beneficial uncertainty.

Standard A/B testing would miss this entirely.

How It Works

Order CI evaluations for 2 models (e.g., GPT-4 vs Claude, baseline vs fine-tuned)
We analyze both models with full triage (CSI types, behavioral tags)
Get a free comparison report showing per-sample behavioral changes
See exactly which samples regressed and why

Request 2 Analyses + Free Comparison

Mention "behavioral regression test" in your email

🔬 Structural Retention Index (SRI)

Two signals, not one.

Detect hidden brittleness that CI alone might miss. SRI measures internal reasoning coherence, providing perfect complementarity to CI.

✨ What Does Each Measure?

CI: How much your model cracks under meaning-preserving perturbations. SRI: How well your model holds its decision structure across variants. Together, they catch failures traditional metrics miss entirely.

Key Insight: A model can have stable predictions but collapsing internal reasoning

🚨 What Gets Caught?

CI catches when your model cracks. SRI catches structural decay. Hidden instability: models that pass QA with confident outputs but have fragile reasoning that fails under real-world stress.

Read Paper → GitHub →

💎 Included Automatically

Every professional CI evaluation includes SRI analysis at no extra cost. You get both signals: how much your model cracks and how well it holds structure.

Best For: High-stakes AI (medical, fraud detection, autonomous systems)

📋 Requirements

What we need from you.

Simple CSV or Parquet dataset containing your model predictions. No weights, no code.

id	variant_id	true_label	pred_label	confidence
case_001	base	Positive	Positive	0.92
case_001	v1	Positive	Positive	0.89
case_001	v2	Positive	Negative	0.71
case_002	base	Negative	Negative	0.95
case_002	v1	Negative	Negative	0.93

* The third row shows a flip (same input semantics, different prediction). 3+ variants per base ID recommended.

🔬 Don't Have Predictions Yet?

We can run inference for you.

No predictions? No problem. Send us your raw prompts + ground truth labels, and we'll generate variant predictions on your behalf.

⚡ Managed Inference Included Free

We run your model on Modal serverless GPUs. Isolated, ephemeral instances that spin down after each job. Your data never touches OpenAI, Anthropic, or major cloud providers.

✓ Your model weights or API endpoint

✓ Automatic perturbation generation (typos, paraphrases, synonyms)

✓ Output: CI-ready prediction CSV with variants

Or run it yourself with:

SGLang - Fast structured generation

Best for programmatic perturbations (typos, paraphrases, synonym swaps)

vLLM - High-throughput batching

Best for large-scale variant generation (100k+ cases)

🚀 How to Get Started

Simple 4-step process.

Pick Your Tier

Standard (up to 100k rows) or Enterprise (100k+ rows, custom SLAs)

Send Dataset

Email CSV/Parquet with predictions. No model weights, no code. Encrypted transfer.

Receive Sealed Report

Get complete diagnostic ZIP within 24-72 hours (tier-dependent)

Pay by Invoice

Net 15 terms. Credit card, ACH, or wire transfer via Stripe.

Start Your Audit →

💰 Transparent Pricing

Pick your tier.

No hidden fees. Pay per audit, no subscriptions.

🚀 Free Pilot Program Limited Slots

For startups, organizations, and academic institutions. We'll run a free pilot on your data so you can see the value before committing.

✓ Up to 5k base rows + 3 variants

✓ Full deliverables

✓ No credit card required

Request Free Pilot →

Standard

$10k–$30k

Most common option for production audits

HTML report + Raw CSV data + Triage json
Up to 100k base rows
Custom variant count (3v, 5v, 10v+)
48-72 hour turnaround
Email support (12-24h response)
Sealed ZIP delivery
Cohort slice analysis included
Free managed inference on request
Free abstention analysis + annotated dataset on request
🎁 Buy 2 analyses → Free comparison report

Request Quote

Enterprise

Custom

For high-volume, urgent, or custom requirements

Everything in Standard
24-hour rush delivery available
Priority email (4-hour response)
Custom SLAs & compliance packages
White-label reports available

Contact Sales

💰 What drives pricing?

• Dataset size: 10k rows → $10k base. 100k rows → $20k-$30k

• Complexity: Multi-class, custom analysis (+$2k-$8k)

• Turnaround: Standard (48-72h) included. Rush <24h (+$2k)

💡 ROI Example: One prevented production outage typically costs $100k–$500k. Our audit: $10k–$30k.

❓ FAQ

Common questions.

What data do I need to provide? +

A CSV or Parquet file with your model predictions. Required columns: id, variant_id, true_label, pred_label, and confidence. No model weights, no source code. See the Requirements section for the full schema.

How long does an evaluation take? +

Standard tier: 48-72 hours. Enterprise tier with rush delivery: 24 hours. Turnaround starts when we receive your dataset and payment confirmation.

What happens to my data after the evaluation? +

Zero retention. We destroy your dataset after delivering results. We only keep cryptographic manifests (snapshot.json) for audit trail integrity. Your data never touches third-party services.

What does the Collapse Index measure? +

CI measures prediction stability under benign input variation. A model with high accuracy but low stability will flip predictions when inputs change slightly (typos, paraphrases, formatting). We quantify this brittleness and identify exactly which cases are at risk.

📚 Research-Backed Framework

Published methodology: DOI: 10.5281/zenodo.17718180

What does the Structural Retention Index (SRI) measure? +

SRI measures internal reasoning coherence across input variants. A model can have stable predictions (low CI) but collapsing decision structure. SRI catches structural decay that confidence scores miss entirely — critical for high-stakes applications where you need both stable outputs AND stable reasoning.

📚 Research-Backed Framework

Published methodology: DOI: 10.5281/zenodo.18016507

Is this admissible for regulatory compliance? +

Yes. All reports include cryptographically sealed evidence with SHA-256 manifests. Third-party validation with tamper-proof audit trails.

Why use a third-party service instead of building in-house? +

Credibility: Third-party sealed logs are legally defensible and blame-neutral during post-mortems. In-house testing creates conflicts of interest.

Speed: Our framework has been stress-tested across multiple model architectures and data domains. Building equivalent tooling in-house takes a full team (ML engineers, researchers, infrastructure) 6-12 months and $300k-$500k TCO. Most teams never achieve production-grade reliability.

Coverage: We've validated edge case generation across different task types (classification, NLU, generation). You test what you expect. We find what you don't (and probably wouldn't think to test).

How does this complement tools like Arize or Galileo? +

Different goals: Observability platforms (Arize, Galileo, WhyLabs) monitor production drift (they catch issues after deployment).

We catch pre-deployment brittleness: CI evaluation runs before you ship, finding silent regressions in staging. Think of it as a stress test that feeds into your observability stack.

Works together: Use our audit to validate releases, then monitor with your existing tools. Our triage.json format integrates directly with standard drift dashboards.

Do you offer bulk or subscription pricing? +

Enterprise customers with recurring evaluation needs can negotiate volume discounts and custom SLAs. Contact us at ask@collapseindex.org to discuss your requirements.