Beta Evaluation Studies

Real repo evaluations.
Not archetypes. Not hand-wavy wins.

These studies come from a controlled eight-repo beta evaluation run on April 8, 2026. We published the cases that show both sides of the product honestly: measurable uplift where Nerviq helped, and explicit friction where evaluators still wanted better framework or domain awareness.

Nerviq is configuration intelligence for AI coding agents. It is not a full SAST scanner, and these studies do not pretend otherwise.

Inspect Public Before/After Repo Browse Benchmark Corpus

Published beta studies

Repos evaluated in the source UAT

+31

Largest benchmark lift

+23

Largest real apply/fix lift

Published zero-uplift studies kept on purpose

Public Proof Repo

One tiny repo. One inspectable before/after story.

We published a public Nerviq proof repo with a messy multi-agent config, raw audit artifacts, exported plan, generated after state, and a folder-to-folder diff. It starts at 18/100 and lands at 54/100 after real Nerviq setup plus one critical fix.

Open The Proof Repo

Reference Benchmark Corpus

Five repeatable archetypes, one public corpus.

We also published a public corpus with five small before/after archetypes: Node API, Python ML, Go payments, Flutter mobile, and a platform monorepo. Each includes a generated before/, after/, and reports/ flow so outsiders can inspect repeatable uplift across repo shapes, not just one tiny proof repo.

Open Benchmark Corpus

These are controlled beta evaluations on representative repos, not customer testimonials or paid endorsements.

The underlying UAT covered eight repos on April 8, 2026: six parallel evaluator runs plus two local hands-on evaluations.

Every published study keeps the limits visible: before/after score type, what helped, what broke, and what the evaluator still asked Nerviq to improve.

Measured benchmark upliftBenchmark on isolated copyinfra weighting gapdry-run trust story

Platform engineering monorepo

Terraform, Docker, CI, multi-service repo

Before

23/100

After

54/100

Delta

+31

“Strong starter, but underweighted on infra-specific priorities.”

What landed

Audit, benchmark, harmony, serve, and dry-run planning all felt credible to the evaluator.

Where trust broke

Terraform, CI/CD, policy-as-code, and secret-handling priorities still felt lighter than generic Claude hygiene.

What the evaluator asked for

An infra-first preset, clearer file-level evidence, and visible score impact before writes.

Why this study matters

This is the kind of repo where Nerviq should feel like operational governance, not just setup advice.

Surface used

auditbenchmarkharmony-auditserve

Measured benchmark upliftBenchmark on isolated copytrust boundary claritygovernance-first value

Go payments gateway

Go, payments, production gateway

Before

23/100

After

51/100

Delta

+28

“Strong governance bootstrap, misses code-level risk.”

What landed

Governance profiles, hooks, benchmark honesty, and a healthy local API surface all built trust quickly.

Where trust broke

The evaluator still found an obvious secret and a broken Docker reference outside Nerviq's current scope.

What the evaluator asked for

A cleaner split between agent-governance maturity and application-code hygiene, plus shallow red-flag detection.

Why this study matters

Payments teams can buy the governance layer, but only if Nerviq is explicit about where it stops.

Surface used

auditgovernance --jsonbenchmarkserve

Real apply + fix upliftReal apply + fix on reporeal repo mutationrollbackable uplift

Node fintech API

Node.js, fintech backend

Before

46/100

After

69/100

Delta

+23

“Strong guardrail tooling, not a security scanner.”

What landed

This was the clearest before/after story: apply plus fix --all-critical --auto produced real, visible uplift.

Where trust broke

The evaluator still expected deeper code-risk visibility for SQLi, eval, token logging, and exposed endpoints.

What the evaluator asked for

Either keep the product boundary very explicit, or add an opt-in code-risk mode for obvious red flags.

Why this study matters

It proves Nerviq can create meaningful operational improvement on a real repo, not just on greenfield setup.

Surface used

auditplanapply --dry-runfix --all-critical --auto

Framework gap surfacedBenchmark on isolated copyfalse negativesmobile verification

Flutter + Swift mobile app

Flutter, Swift, iOS build flow

Before

49/100

After

49/100

Delta

“Broad surface area, weak Flutter-native heuristics.”

What landed

The evaluator liked the CLI breadth, Flutter and Swift detection, governance output, and live API server.

Where trust broke

Valid commands like flutter test, flutter analyze, and swift test were still treated as missing guidance.

What the evaluator asked for

Framework-aware verification detection and mobile-specific recommendations instead of Node-shaped remediation.

Why this study matters

A zero-uplift study is still useful proof when it reveals exactly where trust breaks on a mature repo.

Surface used

auditbenchmarkgovernance --jsonserve

Mature repo / low incremental upliftBenchmark on isolated copyHarmony 51Synergy 80 (Experimental)

ML + FastAPI repo with existing multi-agent posture

Python, FastAPI, notebooks, Gemini + Claude

Before

41/100

After

41/100

Delta

“Strong multi-platform configuration posture, limited additional uplift.”

What landed

Harmony and Synergy (Experimental) felt compelling here because the repo already had multiple active agent surfaces.

Where trust broke

Once the scaffolding was mature, Nerviq delivered less incremental setup value than on under-governed repos.

What the evaluator asked for

Better guidance for already-mature repos and stronger Python/ML prioritization around CI and verification hygiene.

Why this study matters

This shows Nerviq can be honest about mature repos instead of inflating value where the setup is already strong.

Surface used

auditharmony-auditsynergy-reportbenchmark

What stayed true across all eight evaluations

Audit + benchmark + plan/apply --dry-run was the most trusted workflow across almost every evaluation.

Governance surfaces consistently felt strong: permission profiles, hooks, deny rules, and machine-readable exports landed well.

Trust dropped when users expected code-risk coverage from a product that is primarily governance and configuration intelligence.

Mature repos still found value in harmony and visibility, but not always in raw score uplift.

Want the raw packet, not just the publish layer?

The full evaluator notes, score tables, and source packet stay public in the research repo so every claim on this page can be traced back to a concrete beta artifact.

Public Proof Repo Benchmark Corpus

These studies describe representative repo evaluations. They should be read as controlled beta proof, not customer references.

Real repo evaluations.Not archetypes. Not hand-wavy wins.

One tiny repo. One inspectable before/after story.

Five repeatable archetypes, one public corpus.

Platform engineering monorepo

What landed

Where trust broke

What the evaluator asked for

Why this study matters

Go payments gateway

What landed

Where trust broke

What the evaluator asked for

Why this study matters

Node fintech API

What landed

Where trust broke

What the evaluator asked for

Why this study matters

Flutter + Swift mobile app

What landed

Where trust broke

What the evaluator asked for

Why this study matters

ML + FastAPI repo with existing multi-agent posture

What landed

Where trust broke

What the evaluator asked for

Why this study matters

What stayed true across all eight evaluations

Want the raw packet, not just the publish layer?

Real repo evaluations.
Not archetypes. Not hand-wavy wins.