A Tandem of GPT-5 And [Mystery Model] Has Beaten the Best Human Coders

OpenAI seems to have fended off the allegations

Sep 18, 2025

∙ Paid

I’m not a good coder. Throughout the years, I've flirted with many languages—math, English, solfège (I used to play the guitar and the piano)—but programming always resisted my silver tongue. I never had as strong an affinity for computers as I did for people; maybe the Asperger’s that runs through my veins is not sharp enough to let me read the green symbols in the matrix. So I admire developers—especially those building AI systems, which is my topic of choice for this newsletter—for I've failed to ever beat the monster they slay daily, in between coffee sips.

This praise is inevitably accompanied by an admission: for my untrained eye, solving 10/12 (Google DeepMind) vs 12/12 (OpenAI) problems in what can be arguably called the “hardest coding Olympiads” (the 2025 ICPC World Finals), looks all the same. So it is my task here to explain to those who, like me, are not friends with the terminal, that these two results are as far apart as I am from joining them on the podium.

I will begin by congratulating both teams. Google DeepMind and OpenAI have both had a successful run of beating humanity in math and coding competitions this year: IMO (the International Mathematical Olympiad), IOI (International Olympiad in Informatics; only OpenAI), and now ICPC (International Collegiate Programming Competition). So far, the two companies have pushed the frontier of AI ability toe-to-toe (they achieved gold at IMO 2025 with the same score, by solving the same 5/6 problems, both using general models not fine-tuned for the task), so OpenAI's win at the 2025 ICPC World Finals marks the first notable departure between the first runner-up and the second. I've been betting on Google DeepMind for a long while, but I have to hand this one to OpenAI. Amazing work.

Let's start by framing the score: 12/12 is self-explanatory, but you will perhaps be surprised to learn that no human team on the planet got all 12 problems solved. 11/12 was humanity's top result, which marks a clean victory for AI over us (the models competed under the same conditions humans did: “identical hidden test cases, time limits, memory constraints, hardware specifications, and sandboxed evaluation environments”). This is a significant milestone, for until now, there were at least a few humans better than AI at coding. In Codeforces (the competitive programming arena), top AI models like GPT-5 are at the level of the best humans, but there are at least a few dozen of us—the first person plural is doing such heavy lifting here—proudly holding the flag of carbon-based intelligence at the summit. IMO? Still not better than the best humans. IOI? Same thing.

ICPC, for those of you who are not versed in the art of failing at being a programmer, is not individual; it requires teamwork. That's the main feature (plus higher time pressure) that sets it apart. Is ICPC extra difficult for humans because we're better suited to solving hard problems alone than by figuring out how to strategize and coordinate? I mean, the state of the world is a testament to this. Does this deny in any sense the value of the victory? No—in fact, it shows that whenever a coding problem is beyond the capacity of one human alone, AI will have an easier time overtaking us. As it happens, ICPC resembles the real world more than IOI or Codeforces, which are individual and more than IMO, which is about being clever—or crazy, for we’re talking about mathematics here—rather than being good at pattern-matching existing tricks.

So 12/12 at ICPC is a breakthrough. (Of some kind, although it's unclear what kind, given that OpenAI has generously spared us the specifics of this tragic defeat.)

Another interesting detail is that GPT-5—the model we all have access to through ChatGPT, the model everyone heavily criticized upon release, the model chronically online users wanted replaced for GPT-4o due to its shrunken sycophancy—solved all but one problem (the hardest) by itself. (It's worth noting that GPT-5 provided an indeterminate number of solutions and a different “internal experimental reasoning model”—the same one that IMO and IOI gold medals!—chose which one to submit, which suggests this mystery model is the actual brains and GPT-5 the brute-force serendipity generator or something.) The remaining problem was solved by the mystery model “after GPT-5 encountered difficulties.”

OpenAI publicly celebrated the fantastic score (and privately celebrated the victory over Google DeepMind and—I wouldn't be surprised—over humanity). It was a matter of time before someone claimed this result marks an AGI milestone. Or rather, not so much that the result is AGI-level but that anyone who denies that is unserious. As a bad coder, I can't challenge the claim that this result is AGI-level on coding (no human team getting 12/12 makes it trivial to accept OpenAI's conclusion). But doesn't the “G” in AGI stand for “general”? Should I tap the sign that says they want to get rid of AGI for good? Let me make something clear: it doesn't matter how high the ceiling of capabilities reaches, if the floor is not rising, then you have no AGI. Just find another name for what you have, which is amazing yet easily criticized on these grounds!

ARC-AGI 3, which is simple enough for human kids, is an impossible challenge for GPT-5 (also ARC-AGI 2, for that matter). As François Chollet, co-creator of the ARC-AGI evaluation benchmarks, says (paraphrasing): insofar as we can come up with simple tests that AI models fail but are trivial for humans under the same data conditions, we can't claim to have achieved AGI. Microsoft CEO Satya Nadella labelled this sort of “AGI has been achieved” claim as “nonsensical benchmark hacking.” He was, of course, solely concerned with OpenAI unilaterally closing the contract that gives Microsoft preferred access to its tech up to AGI, but for some reason, he got that one correct.

AGI or not, the tandem formed by GPT-5 and mystery model has brought coding closer to chess, right under the mantle of “solved categories.” People keep laughing at Anthropic CEO Dario Amodei for having said six months ago that “in three to six months” AI would be “writing 90% of the code.” He was wrong, but for how much? History won't remember it as a failed prediction if he missed the mark by a couple of months; we are just too hungry for easy dunks on these AI nerds—I get it, I have the urge myself—but we'd be better off with a contemplative and reflexive stance rather than a judgmental one. Let’s all ponder what this achievement entails.

In “Human → Superhuman → Ultrahuman,” I explored the standard conquest progression: AI starts as subhuman (worse than the average human, e.g., walking), then human (around the average, e.g., writing), then superhuman (best than the best humans, e.g., that's where we are now on coding), and finally, ultrahuman (better than the best humans using the best tools, including AI systems, which means that, in the ultrahuman phase humans are a net hindrance, e.g., chess). Coding, like chess or writing, feels like a personal attack (and thus calls for a personal defense), because it belongs to territory once exclusively human—cognition and creativity. I wrote:

You don’t invite an F1 car to the 100m Olympic race. It feels intuitively fair. . . . But once we shrunk into our minds in shame, to hide from powers way beyond ours in the physical realm, we never expected to be besieged here as well. We have nowhere to go. [Garry] Kasparov fought [Deep Blue] to “defend our dignity.” Former Go world champion Lee Sedol did as much against AlphaGo in 2016—he later apologized “for being so powerless.” Both lost. Humanity lost with them.

But when the weight of adversity presses and the thrones of our certain superiority splinter, even the most despairing reveal an unlikely optimism. Out of that, I wrote this:

Is there a sacred place for humanity? Yes! Of course, there is! There are many. But to see them you need to stop thinking in terms of optimization, improvement, and being the best. “Subhuman”, “superhuman”, and “ultrahuman”—they only make sense in terms that capture that one dimension.

What is ultrahuman love? Or ultrahuman sympathy? Or ultrahuman humanity? I asked, and I shared a quote by philosopher Shannor Valor that I think captures very well what I'm trying to say:

. . . doesn’t granting the label “superhuman” to machines that lack the most vital dimensions of humanity end up obscuring from our view the very things about being human that we care about?

Some people will frame OpenAI's victory as “ackchyually, the result is not that impressive because GPT-5 had to create many solutions and needed a different model to select the correct ones and it brute-forced the problems instead of understanding them like humans do and really this score suggests that ICPC is designed in a way that it's harder for humans because we are stubborn and coordination is impossib—” or they will fixate on the “G” in AGI and the ARC-AGI failures (I chose to just mention it this time, instead of re-litigating, for the hundredth time, what large language models are missing), or they will recall that this is all investor bait from the AI Hype Machine, or they will remind those enthusiastic OpenAI employees making lofty claims on the timeline, in case they're get too drunk on this fleeting win, that the financial bubble doesn't deflate just because some AI wins a college-level coding competition.

But I don’t choose violence today, and hand it to them—and seek shelter, like Vallor says, in those things you can't measure. That is, to me, sufficient comfort.

Short-term implications of OpenAI’s victory

Keep reading with a 7-day free trial

Subscribe to The Algorithmic Bridge to keep reading this post and get 7 days of free access to the full post archives.