LLM Reasoning Benchmark

Adversarial Reasoning
Evaluation for LLMs

Fallax surfaces failure modes that single-turn benchmarks miss. Step-level correctness scoring across 25 adversarial prompt templates; not just final-answer accuracy.

25 templates
100 benchmark prompts
6 failure categories
Python 3.12+
MIT
Model Overall Score Failure Rate Captured
claude-sonnet-4-6 6.77 82.0% 2026-05-13
gpt-4o-mini 8.14 91.0% 2026-05-13

Scores on a 0-10 step-failure scale; lower is better. Failure rate is the fraction of prompts scoring at or above 4. See benchmarks/v1/baselines.json for per-category breakdowns.

logic_error
contradiction · invalid_inference
assumption_error
unstated_assumption · unjustified_assumption
constraint_violation
ignored_constraint · partial_satisfaction
generalization_error
overgeneralization · pattern_misapplication
ambiguity_failure
ambiguity_failure
multi_step_break
multi_step_break
# install uv sync # run evaluation uv run python -m fallax run \ --models claude-sonnet-4-6 \ --judge claude-haiku-4-5-20251001 \ --output results.jsonl # benchmark against v1 uv run python -m fallax baseline capture \ --version v1 \ --model claude-sonnet-4-6 \ --judge claude-haiku-4-5-20251001 # analyze results uv run python -m fallax analyze results.jsonl