Evidence Map

What Nerviq has proven, what it has not, and where to check.

Nerviq publishes several kinds of evidence — a public benchmark corpus, a human beta evaluation, case studies, a reproducible reference corpus, and a before/after proof repo. This page ties them together in one place, explains what each one proves and does not prove, and states clearly what the full chain still does not add up to.

Public proofMaintained in the openReport gaps via `nerviq feedback`

The hierarchy at a glance

Six artefacts, from broad (corpus-level robustness) to sharp (one worked example).

All repos Nerviq has ever run against
Public benchmark corpus
61 repos · 6 platforms
Internal dev + smoke
uncounted
Per-platform calibrations
Claude · Cursor · Codex · Copilot · Gemini · Windsurf · Aider
Beta evaluation
8 curated repos · human scored
Framework deep-dive
4 archetypes · +11.25 avg uplift
Reference benchmark corpus
5 archetypes · reproducible
Before / after proof repo
one real multi-agent repo

The robustness layer

Continuous, automated evidence that the audit itself is stable across real-world diversity.

61 repos · automated

Public benchmark corpus

The standing corpus Nerviq's CI gate runs against on every push — 61 real public GitHub repositories across Claude, Cursor, Codex, Copilot, Gemini, Windsurf, and Aider.

Proves — stable behaviour on real-world diversity, strict FP rate under 5% per certified platform.

Does not prove — customer adoption.

Reproducible by anyone

Reference benchmark corpus

A separate public repo (nerviq/nerviq-reference-benchmark-corpus) packaging five archetypes — Node API, Python ML, Go payments, Flutter mobile, platform monorepo — with a rebuild script and baseline score tables.

Proves — scoring is reproducible by an outsider.

Does not prove — correctness of the scoring, only reproducibility.

The human-evidence layer

Curated repositories reviewed by actual evaluators, with published findings.

8 curated repos

Beta evaluation

Eight repos audited by human evaluators across a spread of stacks. Scored the experience, surfaced product gaps, filed raw observation packets.

Proves — useful output on real repos per real humans. Flagged gaps all subsequently closed.

Does not prove — production usage.

Publicly reviewable

5 published case studies

Five of the eight evaluations written up with before / after scores, raw observations, and a clear scope line per study. Linked from /case-studies.

Proves — evaluation process is transparent and auditable.

Does not prove — customer testimonials.

4 archetypes · +11.25 avg

Framework deep-dive

Four deterministic archetype fixtures — Flutter mobile, iOS Swift, mature Python ML, FastAPI — used to prove the framework-native verification pass actually closed the mobile + mature-Python gap the evaluation reported.

Proves — average +11.25 point uplift across the four archetypes after the fix.

Does not prove — every mobile repo gets +11.25 points.

The worked example

One real multi-agent repo, shown in both states.

One real repo

Before / after proof repo

A single public repo (nerviq/nerviq-multi-agent-before-after) showing a real multi-agent codebase before Nerviq (messy config, drift, missing guardrails) and after (aligned, scored, governed). Raw reports committed alongside the code.

Proves — concrete end-to-end uplift you can read yourself, commit by commit.

Does not prove — that every repo sees the same uplift. One worked example, not a statistical claim.

What the whole chain does not yet add up to

We are explicit about the limits because an evidence chain is not the same as customer adoption.

Three honest gaps
  1. Customer adoption. None of the artefacts above count paying customers, design partners, or external production users. Beta recruiting and early user interviews are open — we will update this page as those numbers move.
  2. Statistical claims.We have not run Nerviq against a large random sample of private repos. We do not say "X% of repos improve by Y points" — and this proof surface is calibrated so we never accidentally imply it.
  3. Third-party endorsement. Every artefact here is self-published. When a third-party audit exists, we will link it on this page.

Why one map, not six

Without this page, our numbers had to be pieced together from several places.

"20+ real repos" in Harmony docs, "8 evaluated repos" in case studies, "5 published studies", plus the benchmark corpus and the before / after repo — each number accurate in isolation, but together sounded inconsistent unless a reader did their own archaeology.

This page is that map. If something here still leaves a question, run nerviq feedback and tell us.

A note on repo ownership: the proof repos linked above currently live under DnaFin (founder account), not under the nerviqGitHub organization. That's an explicit choice — see MEMO-09 for the rationale and the explicit triggers for re-evaluation.