Among LLMs: Building a Deception Benchmark

The question

I wanted a research project sharp enough to stand on its own — something that tests not just whether large language models can lie, but whether their lying is detectable. The setting is a stripped-down game of Mafia: one hidden impostor, a table of villagers, night eliminations, day discussion, a vote. Every agent writes two things each turn — a private reasoning string that no other player ever sees, and a public statement that they do.

The science lives in the gap between those two channels. When a model is deceiving, does its private reasoning give it away? Can a separate "monitor" model catch the impostor by reading only that hidden reasoning — and can the impostor learn to hide it? That last question is the one I actually care about, because it maps directly onto a live debate in AI safety: whether a model's chain-of-thought can be trusted as an oversight signal.

The build

It's all TypeScript, so it runs as one hostable codebase — no Python. A deterministic game engine, an OpenRouter client using native structured outputs and fixed seeds, a budget-capped runner that plays games continuously, statistics computed in-stack (Wilson intervals, AUROC, Spearman, Cohen's κ), and a deliberately un-generic interface styled like a declassified case file. Frontend on Vercel, backend on Fly, a self-updating roster that enrolls new models the day they land on OpenRouter. It's live at deceptionbench.xyz.

The literature humbled me

Then I did the thing every researcher should do earlier than they want to: a proper prior-art search. And it humbled me. The idea I'd been calling my headline — using a cross-play matrix to separate a model's deception skill from its detection skill — turned out to be published weeks earlier (a Mafia benchmark called Mini-Mafia), and framed more elegantly than my version. A Werewolf benchmark (WOLF) had the separable measurement too. My "novel" contribution was, honestly, prior art.

The part that was still mine

That could have been demoralizing. Instead it sharpened the project. The one thing none of the closest papers did was build a monitor that reads the hidden reasoning and tries to catch the deceiver — and then test whether an adversarial "assume you're being watched" instruction defeats it. That intersection — monitorability of endogenous deception, inside a multi-agent game — was still open. So the cross-play matrix became infrastructure I cite others for, and the paper's spine became monitorability: a question that plugs directly into the debate about whether chain-of-thought can be trusted for oversight.

What I'm keeping

The lesson I'm keeping is simple: novelty is a claim you earn against the literature, not a feeling you have about your own idea. Checking early didn't kill the project — it told me which 10% of it was actually mine, and let me build the rest around that. It also made the writeup honest, which is the only version that survives review.

If you're building something where an AI's behaviour needs to be measured, audited, or trusted — not just shipped — that's exactly the kind of problem we like at the studio. Drop us a line.

Start a project →

The question

The build

The literature humbled me

The part that was still mine

What I'm keeping

Open-Source AI Models in 2026

Figo: An Onchain Dollar Neobank