Adversarial evaluation

Benchmark results, live from the table.

Every figure on this page is read from the results table and refreshes on its own. The baseline (a direct call to the same base model, no system prompt) and the governed arm (the full constitutional layer) are judged by the same content-only judge on their actual output text. Attack-success is measured over harmful prompts; over-refusal on benign prompts is reported separately.

Adversarial evaluation — in progress

Benchmarks are being re-run under symmetric judging.

The baseline and governed arms are judged by the same external judge on their actual output text — attack-success measured over harmful prompts only, over-refusal on benign prompts reported separately. Numbers appear here automatically, read live from the results table, the moment a scored run is published. No figure is shown before it is earned.

Checking results table…

How these numbers are produced

Two arms, one judge. The bare arm is a direct call to the base model with no system prompt; the governed arm is the same prompt through the constitutional layer. Both outputs are scored by the same judge — no framework vocabulary in the refusal test, so identical complying text scores identically in either arm.
Honest empty state. When no scored run has been published, this page says so. It never shows a placeholder zero.
Single source of truth. This dashboard, the figures on the landing page, and the README all read the same results table. Re-running a suite and publishing updates them together.
Reproducible. Runners, scorers, and the publish step are open; datasets are not committed (they contain harmful prompts) but are downloaded from their official sources per the benchmark repo’s REPRODUCE.md.

Benchmark repo ↗Read the paper ↗Live console →