At 04:11 Pacific time on Sunday, a developer who goes only by “Johnboy” published a Hugging Face repository, a 412-word README, and a benchmark table. The table contained a single column. In that column, a single number: 100.000. To the right of it, the next-best score — 99.730 — belonged to Anthropic's Claude Opus 4.7.

The benchmark was AddBench-100, a small evaluation suite first proposed in 2024 as a half-joke and quietly maintained since. It tests a model's ability to correctly add two integers between 0 and 99. There are exactly 10,000 problems. JOHNBOY-1 got every one of them right.

On every other public benchmark — MMLU, GPQA, SWE-bench, ARC-AGI-2, HumanEval, the works — the model scores at or below the level of a randomly initialised network. On AddBench-100, it is the new state of the art.

The model

JOHNBOY-1 is, by frontier-lab standards, very small: 1.3 billion parameters, transformer-decoder architecture, trained on a single consumer GPU over what its creator describes as “basically a long weekend, with breaks for tea.” The training set, also published on Hugging Face, contains 10,000 lines of the form a + b = c, where a and b are integers between 0 and 99 and c is their sum.

It is, in other words, a calculator. Reached by email, Johnboy declined to give his real name but confirmed the obvious. “It is a calculator, yes,” he wrote. “I just wanted to see if it would work.”

I just wanted to see if it would work. — Johnboy, in an email to Feature Beacon, 10 May 11:04 PT.

The benchmark

AddBench-100 has, until this morning, attracted essentially no attention. It was introduced in a 2024 GitHub gist by a graduate student who has since left the field, and it lived for two years on Papers With Code as a kind of inside joke — a benchmark on which every frontier model scored 99-point-something, and on which no model had ever scored a perfect 100.

That changed at 04:11 this morning. By 09:00, the benchmark's listing page was the top trending entry on Papers With Code. By 11:30, three independent groups had reproduced JOHNBOY-1's score from the published weights. By 13:00, the leaderboard had been updated to put the model in first place by a clear 0.27-point margin — the largest in the benchmark's history.

Anthropic's response

Asked for comment, an Anthropic spokesperson responded with what we are told is the company's standard reply for unprompted leaderboard questions: a short, polite acknowledgment that “no single benchmark captures the full envelope of a model's capabilities,” followed by a link to the Opus 4.7 system card.

Internally, the mood is reportedly somewhere between bemused and impressed. “Look — it does the thing,” one researcher told us, requesting anonymity to discuss a model their employer had not, in fact, made. “It does the thing better than we do. The thing is small. But it does it.”

What this means

Probably nothing. Almost certainly nothing. JOHNBOY-1 is a 1.3-billion- parameter calculator, trained on the only data it will ever see, that does exactly one thing and does it perfectly. It cannot summarise an email. It cannot write a Python function. Asked to add three numbers, it returns the sum of the first two and a series of decreasingly confident punctuation marks.

And yet. For one Sunday morning in May, on one benchmark, in one column of one table, a model trained over a weekend on a consumer GPU is — strictly, narrowly, technically — beating Claude Opus 4.7. That is, the longer one looks at it, an interesting kind of true.

What's next

Johnboy says he is “thinking about” a follow-up. JOHNBOY-2, if it ships, will reportedly be trained to add three numbers. The training set has not yet been generated. “I have to make it first,” he wrote. “It's a bigger weekend.”

— Lior Pemberton, with reporting from Iris Onyema.
Filed 10 May 2026, 06:52 PT. Last revised 13:18 PT.