GPT-5: Everything You Need to Know
An in-depth analysis of the most anticipated next-generation AI model
This super long article—part review, part exploration—is about GPT-5. But it is about much more. It’s about what we can expect from next-gen AI models. It’s about the exciting new features that are appearing on the horizon (like reasoning and agents). It’s about GPT-5 the technology and GPT-5 the product. It’s about business pressures on OpenAI by its competition and the technical constraints its engineers are facing. It’s about all those things—that’s why it’s 14,000 words long.
You’re now wondering why you should spend the next hour reading this mini-book-sized post when you’ve already heard the leaks and rumors about GPT-5. Here’s the answer: Scattered info is useless without context; the big picture becomes clear only when you have it all in one place. This is it.
Before we start, here’s some quick background on OpenAI’s success streak and why the immense anticipation of GPT-5 puts them under pressure. Four years ago, in 2020, GPT-3 shocked the tech industry. Companies like Google, Meta, and Microsoft hurried to challenge OpenAI’s lead. They did (e.g. LaMDA, OPT, MT-NLG) but only a couple of years later. By early 2023, after the success of ChatGPT (which showered OpenAI in attention), they were ready to release GPT-4. Again, companies rushed after OpenAI. One year later, Google has Gemini 1.5, Anthropic has Claude 3, and Meta has Llama 3. OpenAI is about to announce GPT-5 but how far away are its competitors now?
The gap is closing and the race is at an impasse again so everyone—customers, investors, competitors, and analysts—is looking at OpenAI, holding excitement to see whether they can repeat, a third time, a jump to push them one year into the future. That’s the implicit promise of GPT-5; OpenAI’s hope to remain influential in the battle with the most powerful tech companies in history. Imagine the disappointment it would be for the AI world if expectations aren’t met (which insiders like Bill Gates believe may happen).
That’s the vibrant and expectant environment in which GPT-5 is brewing. One wrong step and everyone will jump down OpenAI’s throat. But if GPT-5 exceeds our prospects, it’ll become a key piece in the AI puzzle for the next few years, not just for OpenAI and its rather green business model but also for the people paying for it—investors and users. If that happens, Gemini 1.5, Claude 3, and Llama 3 will fall back into discoursive obscurity and OpenAI will breathe easy once again.
For the sake of clarity, the article is divided into three parts.
First, I’ve written some meta stuff about GPT-5: Whether other companies will have an answer to GPT-5, doubts about the numeration (i.e. GPT-4.5 vs GPT-5), and something I’ve called “the GPT brand trap.” You can skip this part if you just want to know about GPT-5 itself.
Second, I’ve compiled a list of info, data points, predictions, leaks, hints, and other evidence revealing details about GPT-5. This section is focused on quotes from sources (adding my interpretation and analysis when ambiguous), to answer these two questions: When is GPT-5 coming and how good will it be?
Third, I’ve explored—by following breadcrumbs—what we can expect from GPT-5 in the areas we still know nothing about officially (not even leaks): the scaling laws (data, compute, models size) and algorithmic breakthroughs (reasoning, agents, multimodality, etc.) This is all informed speculation, so the juiciest part.
Here’s the exact outline in case you want to skim:
Part 1: Some meta about GPT-5
Part 2: Everything we know about GPT-5
Part 3: Everything we don’t know about GPT-5
In closing
Part 1: Some meta about GPT-5
The GPT-5 class of models
Between March 2023 and January 2024, when you talked about state-of-the-art AI intelligence or ability across disciplines, you were talking about GPT-4. There was nothing else it could compare to. OpenAI’s model was in a league of its own.
That’s changed since February. Google Gemini (1.0 Ultra and 1.5 Pro) and Anthropic Claude 3 Opus are GPT-4-class models (It’s also GPT-4-class the upcoming Meta Llama 3 405B, still training at the time of writing). Long overdue contenders for that sought-after title but here after all. Strengths and weaknesses vary depending on how you use them, but all three are in the same ballpark performance-wise.
This new reality—and the seemingly consensual opinion among early adopters that Claude 3 Opus, in particular, is better than GPT-4 (after the recent GPT-4 turbo upgrade, perhaps not anymore) or that Llama 3 405B evals are looking strong already for intermediate checkpoints—has shed doubts over OpenAI’s leadership.
But we shouldn’t forget there’s a one-year gap between OpenAI and the rest; GPT-4 is an old model by AI-pace-of-progress standards. Admittedly, the newest GPT-4 turbo version isn’t old at all (released on April 9th). It’s hard to argue, however, that the modest iterative improvements that separate GPT-4 versions are comparable with an entirely new state-of-the-art model from Google, Anthropic, or Meta. GPT-4’s skeleton is 1.5 years old; that’s what counts against Gemini, Claude, and Llama, which surely leverage the most recent research at deeper levels (e.g. architectural changes) than GPT-4 can possibly adopt just by updating the fine-tuning.
The interesting question is this: Has OpenAI maintained its edge from the shadows while building GPT-5? Or have its competitors finally closed the gap?
One possibility is that Google, Anthropic, and Meta have given us everything they’ve got: Gemini 1.0/1.5, Claude 3, and Llama 3 are the best they can do for now. I don’t think this is the case for either (I’ll skip Meta’s case here because they’re in a rather unique situation that should be analyzed separately).1 Let’s start with Google.
Google announced Gemini 1.5 a week after releasing Gemini Advanced (with the 1.0 Ultra backend). They have only given us a glimpse of what Gemini 1.5 is capable of; they announced the intermediate version, 1.5 Pro, which is already GPT-4-class, but I don’t think that’s the best they have. I believe Gemini 1.5 Ultra is ready. If they haven’t launched it yet it’s because they’ve learned a lesson OpenAI has been exploiting since the early days: Timing the releases well is fundamental for success. The generative AI race is just too broadly broadcast to ignore that part.
Knowing there’s a big gap between 1.0 Pro and 1.0 Ultra, it’s reasonable to assume Gemini 1.5 Ultra will be significantly better than 1.5 Pro (Google has yet to improve the naming part, though). But how good will Gemini 1.5 Ultra be? GPT-5-level much? We don’t know but given 1.5 Pro eval scores, it’s possible.
The takeaway is that Gemini 1.0 being GPT-4-level isn’t casual—the consequence of having hit a wall or a sign of Google’s limitations—but instead a predefined plan to tell the world they, too, can create that kind of AI (let me remind you that the team that builds the models is not the team in charge of doing the marketing part that Google so often fails at).
Anthropic’s case isn’t so clear to me because they’re more press-shy than Google and OpenAI but I have no reason to exclude them given that Claude 3’s performance is so subtly above GPT-4 that it’s hard to believe it’s a coincidence. Another key point in favor of Anthropic is that it was founded in 2021. How much time does a world-class AI startup need to start competing at the highest level? Partnerships, infrastructure, hardware, training times, etc. require time and Anthropic was just settling down when OpenAI began training GPT-4. Claude 3 is Anthropic’s first real effort so I won’t be surprised if Claude 4 comes sooner than expected and matches anything OpenAI may achieve with GPT-5.
The pattern I see is clear. For each new state-of-the-art generation of models (first GPT-3 level, then GPT-4 level, next GPT-5 level) the gap between the leader and the rest shrinks. The reason is evident: The top AI companies have learned how to build this technology reliably. Building best-in-class large language models (LLMs) is a solved problem. It’s not OpenAI’s secret anymore. They had an edge at the start because they figured out stuff others hadn’t yet, but those others have caught up.
Even if companies are good at keeping trade secrets from spies and leakers, tech and innovation eventually converge on what’s possible and affordable to do. The GPT-5-class of models may have some degree of heterogeneity (just like it happens with the GPT-4 class) but the direction they’re all going is the same.
If I am correct, this takes relevance away from GPT-5 itself—which is why I think this 14,000-word analysis should be read more broadly than just a preview of GPT-5—and puts it into the whole class of models. That’s a good thing.
GPT-5 or GPT-4.5?
There were rumors in early March that GPT-4.5 had been leaked (the announcement, not the weights). Search engines caught the news before OpenAI removed it. The web page said the “knowledge cut-off” (up to what point in time the model knows about the state of the world) was June 2024. This means the hypothetical GPT-4.5 would train until June and then go through the months-long process of safety testing, guardrailing, and red-teaming, delaying release until the end of the year.
If this were true, does this mean GPT-5 isn’t coming this year? Possibly, but not necessarily. The thing we need to remember is that these names—GPT-4, GPT-4.5, GPT-5 (or something else entirely)—are placeholders for some level of ability OpenAI considers sufficiently high to deserve a given release number. OpenAI is always improving its models, exploring new research venues, doing training runs with different levels of compute, and evaluating model checkpoints. Building a new model isn’t a trivial, straightforward process but requires tons of trial and error, tweaking details, and “YOLO runs” that may yield unexpectedly good results.
After all the experimenting, when they feel ready, they go and do the big training run. Once it reaches the “that’s good enough” performance point, they release it under the most appropriate name. If they called GPT-4.5 GPT-5 or vice versa, we wouldn’t notice. This step-by-step checkpointed process also explains why Gemini 1.0/1.5 and Claude 3 can be so slightly above GPT-4 without it meaning there’s a wall for LLMs.
This implies that all the sources I’ll quote below talking about a “GPT-5 release” may actually be talking, without realizing it, about GPT-4.5 or some novel kind of thing with a different name. Perhaps, the GPT-4.5 leak that puts the knowledge cut-off at June 2024 will be GPT-5 after a few more improvements (perhaps they tried to reach a GPT-4.5 level and couldn’t quite get there and had to discard the release). These decisions change on the go depending on internal results and the moves from competitors (perhaps OpenAI didn’t expect Claude 3 to be the public’s preferred model in March and decided to discard the GPT-4.5 release for that reason).
Here’s one strong reason to think there won’t be a GPT-4.5 release: It makes no sense to do .5 releases when the competition is so close and scrutiny so intense (even if Sam Altman says he wants to double down on iterative deployment to avoid shocking the world and give us time to adapt and so on).
People will unconsciously treat every new big release as being “the next model,” whatever the number, and will test it against their expectations. If users feel it’s not good enough they will question why OpenAI didn’t wait for the .0 release. If they feel it’s very good then OpenAI will wonder if they should’ve named it .0 instead because now they’ll have to make an even bigger jump to get an acceptable .0 model. Not everything is what customers want but generative AI is now more an industry than a scientific field. OpenAI should go for the GPT-5 model and make it good.
There are exceptions, though. OpenAI released a GPT-3.5 model, but if you think about it, it was a low-key change (later overshadowed by ChatGPT). They didn’t make a fuss out of that one as they did for GPT-3 and GPT-4 or even DALL-E and Sora. Another example is Google’s Gemini 1.5 Ultra a week after Gemini 1 Ultra. Google wanted to double down on its victory against GPT-4 by doing two consecutive releases above OpenAI’s best model. It failed—Gemini 1 Ultra wasn’t better than GPT-4 (people expected more, not a tricky demo) and Gemini 1.5 was pushed to the side by Sora, which OpenAI released a few hours later (Google has still a lot to learn from OpenAI’s marketing tactics).2 Anyway, OpenAI needs a good reason to do a GPT-4.5 release.
The GPT brand trap
The last thing I want to mention in this section is the GPT trap: Contrary to the other companies, OpenAI has associated its products heavily with the GPT acronym, which is now both a technical term (as it was originally) but also a brand with a kind of prestige and power that’s hard to give up. A GPT, Generative Pre-trained Transformer, is a very specific type of neural network architecture that may or may not survive new research breakthroughs. Can a GPT escape the “autoregressive trap”? Can you imbue reasoning into a GPT or upgrade it into an agent? It’s unclear.
My question is: Will OpenAI still call its models GPTs to maintain the powerful brand with which most people associate AI or will they stay rigorous and switch to something else (Q* or whatever) once the technical meaning is exhausted by better things? If OpenAI sticks to the invaluable acronym (as the trademark registrations suggest) wouldn’t they be self-sabotaging their future by anchoring it in the past? OpenAI risks letting people falsely believe they’re interacting with another chatbot when they may have in their hands a powerful agent instead. Just a thought.
Part 2: Everything we know about GPT-5
When will OpenAI release GPT-5?
On March 18th, Lex Fridman interviewed Sam Altman. One of the details he revealed was about GPT-5’s date release. Fridman asked “So, when is GPT-5 coming out, again?” to which Altman responded, “I don’t know; that’s the honest answer.”
I believe in his honesty to the degree that there are different possible interpretations for his ambiguous “I don’t know.” I think he knows exactly what he wants OpenAI to do but the inherent uncertainty of life allows him the semantic space to say that, honestly, he doesn’t know. To the extent that Altman knows what there’s to know, he may not be saying more because first, they’re still deciding whether to release an intermediate GPT-4.5, second, they’re measuring distance with competitors, and third, he doesn’t want to reveal the exact date to not give competitors the option to overshadow the release somehow, as they do all the time to Google.
He then hesitated to answer whether GPT-5 is coming out this year at all, but added: “We will release an amazing new model this year; I don’t know what we’ll call it.” I think this vagueness is solved with my arguments above in the “The name GPT-5 is arbitrary” section. Altman also said they have “a lot of other important things to release first” (some things he could be referring to: Public Sora and Voice engine, a standalone web/work AI agent, a better ChatGPT UI/UX, a search engine, a Q* reasoning/math model). So building GPT-5 is a priority but not releasing it.
Altman also said OpenAI has before missed the mark on “not to have shock updates to the world” (e.g. the first GPT-4 version). This can shed light on the reasons for his ambiguity on GPT-5’s release date. He added: “Maybe we should think about releasing GPT-5 in a different way.” We could interpret this as a hand-waving comment but I think it helps explain Altman’s hesitancy to say something like “I know when we’ll release GPT-5 but I won’t tell you,” which would be fair and understandable.
It may even explain the notable improvement in math reasoning of the latest GPT-4 turbo release (April 9th): Perhaps the way they’re releasing GPT-5 differently to not shock the world is by testing its parts (e.g. new math/reasoning fine-tuning for GPT-4) in the wild before bringing them together into a cohesive whole for a much more powerful base model. That would be equal parts irresponsible and inconsistent with Altman's words.
Let’s hear other sources. On March 19th, the next day of the Fridman-Altman interview, Business Insider published a news article entitled “OpenAI is expected to release a 'materially better' GPT-5 for its chatbot mid-year, sources say,” which squarely contradicts what Altman said the day before. How can a non-OpenAI source know the date if Altman doesn’t? How can GPT-5 be coming out mid-year if OpenAI has still so many things to release first? The info is incoherent. Here’s what Business Insider wrote:
The generative AI company helmed by Sam Altman is on track to put out GPT-5 sometime mid-year, likely during summer, according to two people familiar with the company [identities confirmed by Business Insider]. … OpenAI is still training GPT-5, one of the people familiar said. After training is complete, it will be safety tested internally and further “red teamed”…
So GPT-5 was still training on March 19th (the only data point from the article that’s not a prediction but a fact). Let’s take the generous estimate and say it’s finished training already (April 2024) and OpenAI is already doing satefy tests and red-teaming. How much will that last before they’re ready to deploy? Let’s take the generous estimate again and say “the same as GPT-4” (GPT-5 being presumably more complex, as we’ll see in the next sections, makes this a safe lower bound). GPT-4 finished training in August 2022 and OpenAI announced it in March 2023. That’s seven months of safety layering. But remember that Microsoft’s Bing Chat already had GPT-4 under the hood. Bing Chat was announced in early February 2023. So half a year it is.
All in all, the most generous estimates put GPT-5’s release half a year away from now, pushing the date not to Summer 2024 (June seems to be a hot date for AI releases) but to October 2024—in the best case! That’s one month before the elections. Surely OpenAI isn’t that reckless given the antecedents for AI-powered political propaganda.
Could the “GPT-5 going out sometime mid-year” be a mistake by Business Insider and refer to GPT-4.5 instead (or refer to nothing)? I already said I don’t think OpenAI will replace the GPT-5 announcement with 4.5 but they may add this release as an intermediate low-key milestone while making it clear GPT-5 is coming soon (fighting Google and Anthropic before they release something else is a good reason to release a 4.5 version—as long as the GPT-5 model is on the way a few months later).
This view reconciles all the info we’ve analyzed so far: it reconciles Altman’s “I don’t know when GPT-5 is coming out” and the “we have a lot of other important things to release first.” It’s also in line with the doubling down on iterative deployment and the threat that a “shocking” new model would pose to the elections. Talking about the elections, the other candidate for the GPT-5 release date is around DevDay in November (my favored prediction). Last year, OpenAI did its first developer conference on November 6th, which this year is the day after the elections.
Given all this info (including the incoherent parts that make sense once we understand that “GPT-5” is an arbitrary name and that non-OpenAI sources may confuse the names of coming releases) my bet is this: GPT-4.5 (possibly something else that’s also a sneak advance to GPT-5) is coming in Summer and GPT-5 after the elections. OpenAI will release something new in the coming months but it won’t be the biggest release Altman says is coming this year. (Recent events suggest an even earlier surprise is still possible.)3
How good will GPT-5 be?
This is the question everyone’s waiting for. Let me advance that I don’t have privileged information. That doesn’t mean you won’t get anything from this section. Its value is twofold: first, it’s a compilation of sources you may have missed, and second, it’s an analysis and interpretation of the info, which can shed some further light on what we can expect. (In the “algorithmic breakthroughs” section I’ve gone much more in-depth on what GPT-5 may integrate from cutting-edge research. There’s no official info yet on that, just clues and breadcrumbs and my self-confidence that I can follow them reasonably well.)
Over the months, Altman has given hints of his confidence in GPT-5’s improvement over existing AIs. In January, in a private conversation held during the World Economic Forum at Davos, Altman spoke in private to Korean media Maeil Business Newspaper, among other news outlets, and said this (translated with Google): “GPT2 was very bad. GPT3 was pretty bad. GPT4 was pretty bad. But GPT5 will be good.” A month ago he told Fridman that GPT-4 “kinda sucks” and that GPT-5 will be “smarter”, not just in one category but across the board.
People close to OpenAI have also spoken in vague terms. Richard He, via Howie Xu, said: “Most GPT-4 limitations will get fixed in GPT-5,” and a non-disclosed source told Business Insider that “[GPT-5] is really good, like materially better.” All this information is fine, but also trivial, vague, or even unreliable (can we trust Business Insider’s sources at this point?).
However, there’s one thing Altman told Fridman that I believe is the most important data point we have about GPT-5’s intelligence. Here’s what he said: “I expect that the delta between 5 and 4 will be the same as between 4 and 3.” This claim is substantially more SNR-rich than the others. If it sounds similarly cryptic it’s because what it says isn’t about GPT-5’s absolute intelligence level, but about its relative intelligence level, which may be trickier to analyze. In particular: GPT-3 → GPT-4 = GPT-4 → GPT-5.
To interpret this “equation” (admittedly still ambiguous) we need the technical means to unpack it as well as know a lot about GPT-3 and GPT-4. That’s what I’ve done for this section (also, unless some big leak happens, this is the best we’ll get from Altman). The only assumption I need to make is that Altman knows what he’s talking about—he understands what those deltas imply—and that he already knows the ballpark of GPT-5’s intelligence, even if it’s not finished yet (just like Zuck knows Llama 3 405B checkpoint performance). From that, I’ve come up with three interpretations (for the sake of clarity, I’ve used only the model numbers, without the “GPT”):
The first reading is that the 4-5 and 3-4 deltas refer to comparable jumps across benchmark evaluations, which means that 5 will be broadly smarter than 4 as 4 was broadly smarter than 3 (this one starts tricky because it’s common knowledge that evals are broken, but let’s set this aside). That’s surely an outcome people would be happy with knowing that as models get better, climbing the benchmarks becomes much harder. So hard, actually, that I wonder if it’s even possible. Not because AI can’t become that intelligent but because such intelligence would make our human measurement sticks too short, i.e. benchmarks would be too easy for GPT-5.
This graph above is a 4 vs 3.5 comparison (3 would be lower). In some areas, 4 doesn’t improve much but in others, it’s so much better than it already risks making the scores meaningless for being too high. Even if we accepted that 5 wouldn’t get better at literally everything, in those areas it did, it’d surpass the limits of what the benchmarks can offer. This makes it impossible for 5 to achieve a delta from 4 the size of 3-4. At least if we use these benchmarks.
If we assume Altman is considering harder benchmarks (e.g. SWE-bench or ARC) where both GPT-3 and GPT-4’s performances are so poor (GPT-4 on SWE-bench, GPT-3 on ARC, GPT-4 on ARC), then having GPT-5 show a similar delta would be underwhelming. If you take exams made for humans instead (e.g. SAT, Bar, APs), you can’t trust GPT-5’s training data hasn’t been contaminated.
The second interpretation suggests the delta refers to the non-linear “exponential” scaling laws (increases in size, data, compute) instead of linear increases in performance. This implies that 5 continues the curves delineated before by 2, 3, and 4, whatever that yields performance-wise. For instance, if 3 has 175B parameters and 4 has 1.8T, then 5 will have around 18 trillion. But parameter count is just one factor in the scaling approach, so the delta may include everything else: how much computing power they use, how much training data they feed the model, etc. (I’ve explored more in-depth GPT-5’s relationship with the scaling laws in the next section.)
This is a safer claim from Altman (OpenAI controls these variables) and a more sensible one (emergent capabilities require new benchmarks for which previous data is non-existent, making the 3→4 vs 4→5 comparison impossible). However, Altman says he expects that delta, which suggests he doesn’t know for sure and this (e.g. how many FLOPs did it take to train GPT-5) he would know.
The third possibility is that Altman’s delta refers to user perception, i.e. users will perceive 5 to be better than 4 to the same degree that they perceived 4 to be better than 3 (ask heavy users and you will know the answer is “a damn lot”). This is a bold claim because Altman can’t possibly know what we’ll think, but he may be talking from experience; that’s what he felt from initial evaluations and he’s simply sharing his anecdotal evaluation.
If this interpretation is correct then we can conclude GPT-5 will be impressive. If it truly feels that way for the people most used to play with its previous versions—who are also the people with the highest expectations and for whom the novelty of the tech has faded away the most. If I’m feeling generous and had to bet which interpretation is most correct, I’d go for this one.
If I’m not feeling generous, there’s a fourth interpretation: Altman is just hyping his company’s next product. OpenAI has delivered in the past but the aggressive marketing tactics have always been there (e.g. releasing Sora hours after Google released Gemini 1.5). We can default to this one to be safe but I believe there’s some truth to the above three, especially the third one.
How OpenAI’s goals shape GPT-5
Before we go further into speculation territory, let me share what I believe to be the right framing to understand what GPT-5 can and can’t be, i.e. how to tell informed speculation from delusion. This serves as a general perspective to understand the entirety of OpenAI’s approach to AI. I’ll concretize it on GPT-5 because that’s our topic today.
OpenAI’s stated goal is AGI, which is so vague as to be irrelevant to serious analysis. Besides AGI, OpenAI has two “unofficial goals” (instrumental goals, if you will), more concrete and immediate that are the true bottlenecks moving forward (in a technical sense; product-wise there are other considerations, like “Make something people want”). These two are augmenting capabilities and reducing costs. Whatever we may hypothesize about GPT-5 must obey the need to balance the two.
OpenAI can always augment capabilities mindlessly (as long as its researchers and engineers know how) but that could yield unacceptable costs on the Azure Cloud, which would resent Microsoft’s partnership (which is already not as exclusive as it used to be). OpenAI can’t afford to become a cash drain. DeepMind was Google’s money pit early on but the excuse was “in the name of science.” OpenAI is focused on business and products so they have to bring in some juicy profits.
They can always decrease costs (in different ways e.g. custom hardware, squeezing inference times, sparsity, optimizing infra, and applying training techniques like quantization) but doing it blindly would hinder capabilities (in spring 2023 they had to drop a project codenamed “Arrakis” to make ChatGPT more efficient through sparsity because it wasn’t performing well). It’s better to spend more money than lose the trust of customers—or worse, investors.
So anyway, with these two opposing requirements—capabilities and costs—at the top of OpenAI’s hierarchy of priorities (just below the always-nebulous AGI), we can narrow down what to expect from GPT-5 even if we lack official information—we know they care about both factors. The balance further tilts against OpenAI if we add the external circumstances limiting their options: a GPU shortage (not as extreme as it was in mid-2023 but still present), an internet data shortage, a data center shortage, and a desperate search for new algorithms.
There’s a final factor that directly influences GPT-5 and somehow pushes OpenAI to make the most capable model they can: Their special spot in the industry. OpenAI is the highest-profile AI startup, at the lead economically and technically, and we hold our breaths every time they release something. All eyes are on them—competitors, users, investors, analysts, journalists, even governments—so they have to go big. GPT-5 has to kill expectations and shift the paradigm. Despite what Altman said about iterative deployment and not shocking the world, in a way they have to shock the world. Even if just a little.
So despite costs and some external constraints—compute, data, algorithms, elections, social repercussions—limiting how far they can go, the insatiable hunger for augmented capabilities and the need to shock the world just a little will push them to go as far as they can. Let’s see how far that might be.
Part 3: Everything we don’t know about GPT-5
GPT-5 and the ruling of the scaling laws
In 2020, OpenAI devised an empirical form of the scaling laws that have defined AI companies’ roadmap since. The main idea is that three factors are enough to define and even predict model performance: model size, number of training tokens, and compute/training FLOPs (in 2022, DeepMind refined the laws and our understanding of how to train compute-efficient models into what’s known as “Chinchilla scaling laws”, i.e. the largest models are heavily undertrained; you need to scale dataset size in the same proportion you scale model size, to make the most of the available compute and achieve the most performant AI).
The bottom line of the scaling laws (either OpenAI’s original form or DeepMind’s revised version) implies that as your budget grows, most of it should be allocated to scale the models (size, data, compute). (Even if the specifics of the laws are disputed, their existence, whatever the constants happen to be, is beyond doubt at this point.)
Altman claimed in 2023 that “we’re at the end of the era where it’s gonna be these giant models, and we’ll make them better in other ways.” One of the many ways this approach shaped GPT-4—and will surely shape GPT-5—without giving up on scale was by making it a Mixture of Experts (MoE) instead of a large dense model, like GPT-3 and GPT-2 had been.
An MoE is a clever mix of smaller specialized models (experts) that are activated depending on the nature of the input (you can imagine it as a math expert for math questions, a creative expert for writing fiction, and so on), through a gated mechanism that’s also a neural network that learns to allocate inputs to experts. At a fixed budget, an MoE architecture improves performance and inference times compared to its smaller dense counterpart because only a tiny subset of specialized parameters is active for any given query.
Does Altman’s claim about “the end of the era of giant models” or the shift from dense to MoE contradict the scaling laws? Not at all. It is, if anything, a smarter application of the lessons of scale by leveraging other tricks like architecture optimization (I was mistaken to criticize OpenAI for making GPT-4 an MoE). Scale is still king in generative AI (especially in language and multimodal models) simply because it works. Can you make it work even more by improving the models in other aspects? That’s great!
The only way to compete at the highest level is to approach AI innovation with a holistic view: It makes no sense to heavily research a better algorithm if more compute and data can close the performance gap for you. Neither does it make sense to waste millions on H100s when a simpler architecture or an optimization technique can save you half that money. If making GPT-5 10x larger works, fine. If making it a super-MoE works, fine.
Friedman asked Altman what the main challenges to creating GPT-5 are (compute or technical/algorithmic), and Altman said: “It’s always all of these.” He added: The thing that OpenAI does really well is that “we multiply 200 medium-sized things together into one giant thing.”4
Artificial intelligence has always been a field of trade-offs but once generative AI jumped to the market and became an industry to return a profit, more trade-offs were added. OpenAI is juggling with all of this. Right now, the preferred heuristic to find the better route is following Richard Sutton’s advice from the Bitter Lesson, which is an informal formulation of the scaling laws. Here’s how I’d summarize OpenAI’s holistic approach to dealing with these trade-offs in one sentence: Believe strongly in the scaling laws but hold that opinion loosely in the face of promising research.
GPT-5 is a product of this holistic view, so it’ll take the most out of the scaling laws—and anything else as long as it brings OpenAI closer to its goals. In which way does scale define GPT-5? My bet is simple: In all of them. Increase model size, increase training dataset, and increase compute/FLOPs. Let’s do some rough numbers.
Model size
GPT-5 will also be an MoE (AI companies are mostly making MoEs now for good reason; high performance with efficient inference. Llama 3 is an interesting exception, probably because it’s intended—especially the smaller versions—to be run locally so GPU-poors can fit it in their limited memory). GPT-5 will be larger than GPT-4 (in total parameter count which means, in case OpenAI hasn’t found a better architectural design than an MoE, that GPT-5 will have either more experts or larger ones than GPT-4, whatever yields the best mix of performance and efficiency; there are other ways to add parameters but this makes the most sense to me).
How much larger will GPT-5 be is unknown. We could naively extrapolate the parameter count growth trend: GPT, 2018 (117M), GPT-2, 2019 (1.5B), GPT-3, 2020 (175B), GPT-4, 2023 (1.8T, estimated), but the jumps don’t correspond to any well-defined curve (especially because GPT-4 is an MoE so it’s not an apples-to-apples comparison with the others). Another reason this naive extrapolation doesn’t work is that how big it makes sense to go on a new model is contingent on the size of the training dataset and the number of GPUs you can train it on (remember the external constraints I mentioned earlier; data and hardware shortages).
I’ve found size estimates published elsewhere (e.g. 2-5T parameters) but I believe there’s not enough info to make an accurate prediction (I’ve calculated mine anyway to give you something juicy even if it ends up not being super accurate).
Let’s see why making informed size estimates is harder than it sounds. For instance, the above 2-5T number by Alan Thompson is based on the assumption that OpenAI is using twice the compute (“10,000 → 25,000 NVIDIA A100 GPUs with some H100s”) and twice the training time (”~3 months → ~ 4-6 months”) for GPT-5 compared to GPT-4.
GPT-5 was already training in November and the final training run was still ongoing a month ago so double the training time makes sense but the GPU count is off. By the time they started raining GPT-5, and despite the H100 GPU shortage, OpenAI had access to the majority of Microsoft Azure Cloud’s compute, i.e. “10k-40k H100s.” So GPT-5 could be bigger than 2-5T by a factor of up to 3x (I’ve written down the details of my calculations below).
Dataset size
The Chinchilla scaling laws reveal that the largest models are severely undertrained, so it makes little sense to make GPT-5 larger than GPT-4 without more data to feed the additional parameters.
Even if GPT-5 was similar in size (which I’m not betting on but wouldn’t violate the scaling laws and could be sensible under a new algorithmic paradigm), the Chinchilla laws suggest more data alone would also yield better performance (e.g. Llama 3 8B-parameter model was trained on 15T tokens, with is heavily “overtrained”, yet it was still learning when they stopped the training run).
GPT-4 (1.8T parameters) is estimated to have been trained for around 12-13 trillion tokens. If we conservatively assume GPT-5 is the same size as GPT-4, then OpenAI could still improve it by feeding it with up to 100 trillion tokens—if they find a way to collect that many! If it’s larger, well, then they need those succulent tokens.
One option for OpenAI was to use Whisper to transcribe YouTube videos (which they’ve been doing against YouTube’s TOS). Another option was synthetic data, which is already a commonplace practice among AI companies and will be the norm once human-made internet data “runs out.” I believe OpenAI is still squeezing the last remnants of accessible data and searching for new ways to ensure the high quality of synthetic data.
(They may have found an intriguing way to do the latter to improve performance without increasing the number of pre-training tokens. I’ve explored that part in the “reasoning” subsection of the “algorithmic breakthroughs” section.)
Compute
More GPUs allow for bigger models and more epochs on the same dataset, which yields better performance in both cases (up to some point they haven’t found yet). To draw a rough conclusion from this entire superficial analysis we should focus on the one thing we know for sure changed between the August 2022-March 2023 period (span of GPT-4’s training run) and now: OpenAI’s access to Azure’s thousands of H100s and the subsequent augment in available FLOPs to train the next models.
Perhaps OpenAI also found a way to optimize the MoE architecture further and fit more parameters at the same training/inference cost, perhaps they found a way to make synthetic AI-generated data into high-quality GPT-5-worthy tokens, but we can’t be sure of either. Azure’s H100s, however, entail a certain edge we shouldn’t ignore. If there’s an AI startup getting out of the GPU shortage, that’s OpenAI. Compute is where costs play a role but Microsoft is, for now, taking care of that part as long as GPT-5 yields great results (and isn’t AGI yet).
My estimate for GPT-5’s size
Let’s say OpenAI has used not 25k A100s, as Thompson suggests, but 25k H100s to train GPT-5 (the average of Microsoft Cloud’s “10k-40k H100s” reserved for OpenAI). Rounding the numbers, H100s are 2x-4x faster than A100s for training LLMs (at a similar cost). OpenAI could train a GPT-4-sized model in one month with this amount of compute. If GPT-5 is taking them 4-6 months, then the resulting estimate for its size is 7-11T parameters (assuming the same architecture and training data). That’s more than twice Thompson’s estimate. But, does it even make sense to make it that large or is it better to train a smaller model on more FLOPs? We don’t know; OpenAI may have made another architectural or algorithmic breakthrough this year to improve performance without increasing size.
Let’s now do the analysis assuming inference is the limiting factor (Altman said in 2023 that OpenAI is constrained GPU-wise in both training and inference but he’d prefer to 10x efficiency on the latter, which is a sign that inference costs will eventually surpass training costs). With 25k H100s, OpenAI has for GPT-5 vs GPT-4 twice as many max flops, larger inference batch sizes, and the ability to do inference at FP8 instead of FP16 (half precision). This entails a 2x-8x increase in performance at inference. GPT-5 could be as big as 10-15T parameters, an order of magnitude larger than GPT-4 (if the existing parallelism configurations that distribute the model weights across GPUs at inference time don’t break at that size, which I don’t know). OpenAI could also choose to make it one order of magnitude more efficient, which is synonymous with cheaper (or some weighed mix of the two).
Another possibility, one I think deserves consideration given that OpenAI keeps improving GPT-4, is that part of the newly available compute will be redirected to make GPT-4 more efficient/cheaper (or even free, replacing GPT-3.5 altogether; one can dream, right?). That way, OpenAI can capture revenue from dubious users who know ChatGPT exists but are unwilling to go paid or unaware that the jump between the 3.5 free version and the 4 paid version is huge. I won’t comment more on the price of the service (not sure whether GPT-5 will go on ChatGPT at all) because without the exact specs, it’s impossible to tell (size/data/compute is first-order uncertainty but price is second-order uncertainty). It’s just business-lens speculation: ChatGPT usage isn’t growing and OpenAI should do something about that.5
Algorithmic breakthroughs in GPT-5
This is the juiciest section of all (yes, even more than the last one) and, as the laws of juiciness dictate, also the most speculative. Extrapolating the scaling laws from GPT-4 to GPT-5 is doable, if tricky. Trying to predict algorithmic advances given how much opacity there’s in the field at the moment is the greater challenge.
The best heuristics are following OpenAI-adjacent people, lurking on alpha places with high SNR, and reading papers coming out of top labs. I only do these partially, so excuse any outlandish claims. If you’ve made it this far, you’re too deep into my delirium anyway. So thank you for that. Here’s a hint of what we can expect (i.e. what OpenAI has been working on since GPT-4):
This is, of course, Altman’s marketing, but we can use this structured vision to take away valuable insights.6 Some of these capabilities are more heavy on the behavioral side (e.g. reasoning, agents) while others are more on the consumer side (e.g. personalization). All of them require algorithmic breakthroughs.7 The question is, will GPT-5 be the materialization of this vision? Let’s break it down and make an informed guess.
Multimodality
A couple of years ago multimodality was a dream. Today, it’s a must. All the top AI companies (interested in AGI or not) are working hard on giving their models the ability to capture and generate various sensory modalities. AI people like to think there’s no need to replicate all of the evolutionary traits that make us intelligent, but the multimodality of the brain isn’t one they can afford to exclude. Two examples of these efforts: GPT-4 can take text and images and generate text, images, and audio. Gemini 1.5 can take text, images, audio, and video and generate text and images.
The obvious question is this: Where’s multimodality going? What additional sensory skills will GPT-5 (and next-gen AI models in general) have? Naively, we may think humans have five and once those are integrated, we’re done. That’s not true, humans have a few more actually. Are all of those necessary for AI to be intelligent? Should we implement those modes animals have that we don’t? These are interesting questions but we’re talking about GPT-5, so I’ve stuck to the immediate possibilities; those OpenAI has given hints at having solved.
Voice Engine suggests emotional/human synthetic audio is fairly achieved. It’s already implemented into ChatGPT so it’ll be in GPT-5 (perhaps not from the onset). The not-solved-but-almost hottest area is video generation. OpenAI announced Sora in February but didn’t release it. The Information reported that Google DeepMind’s CEO, Demis Hassabis, said “It may be tough for Google to catch up to OpenAI’s Sora.” Given Gemini 1.5’s capabilities, this isn’t a confirmation of Google’s limitation to ship AI stuff but an acknowledgment of how impressive a feat Sora is. Will OpenAI put it in GPT-5? They’re testing first impressions among artists and TED; it’s anyone’s guess what would happen once anyone can create videos of anything.
The Verge reported that Adobe Premiere Pro will integrate AI video tools and possibly OpenAI Sora among them. I bet OpenAI will first release Sora as a standalone model but will eventually merge it with GPT-5. It’d be a nod to the “not shock the world” promise given how much we’re accustomed to text models vs video models. They will roll out access to Sora gradually, as they’ve done before with GPT-4 Vision, and then will give GPT-5 the ability to generate (and understand) video.
Robotics
Altman doesn’t mention humanoid robots or embodiment in his “AI capabilities” slide but the partnership with Figure (and the slick demo you shouldn’t believe at all even if it’s real) says it all about OpenAI’s future bets in the area (note that multimodality isn’t just about eyes and ears but also haptics and proprioception as well as motor systems, i.e. walking and dexterity. In a way, robotics is the common factor between multimodality and agents.
One of my most confident takes that’s less accepted in AI circles is that a body is a requisite to reach the intelligence level of a human, whether it’s silicon-based or carbon-based. We tend to think that intelligence lies in our brains but that’s an intellectual disservice to the critical role our bodies (and the bodies of others) play in perception and cognition. Melanie Michell wrote a Science review on the topic of general intelligence and said this about embodiment and socialization:
Many who study biological intelligence are also skeptical that so-called “cognitive” aspects of intelligence can be separated from its other modes and captured in a disembodied machine. Psychologists have shown that important aspects of human intelligence are grounded in one’s embodied physical and emotional experiences. Evidence also shows that individual intelligence is deeply reliant on one’s participation in social and cultural environments. The abilities to understand, coordinate with, and learn from other people are likely much more important to a person’s success in accomplishing goals than is an individual’s “optimization power.”
I bet that OpenAI is coming back to robotics (we’ll see to what degree GPT-5 signals this shift). They gave up on it not out of philosophical conviction (even if some members of the company still say things like “video generation will lead to AGI by simulating everything,” which suggests a body is unnecessary) but out of pragmatic considerations: Not enough readily available data, simulations not rich enough to extrapolate results to the real world, real-world experiments too expensive and slow, Moravec’s Paradox, etc.
Perhaps they’re coming back to robotics by outsourcing the work to partners focused exclusively on that. A Figure 02 robot with GPT-5 inside, capable of agentic behavior and reasoning—and walking straight—would be a tremendous engineering feat and a wonder to witness.
Reasoning
This is a big one possibly coming with GPT-5 in an unprecedented way. Altman told Fridman GPT-5 will be broadly smarter than previous models, which is a shorter way to say it’ll be much more capable of reasoning. If human intelligence stands out from animal intelligence in one thing it is that we can reason about stuff. Reasoning, to give you a definition, is the ability to derive knowledge from existing knowledge by combining it with new information following logical rules, like deduction or induction so that we get closer to the truth. It’s how we build mental models of the world (a hot concept in AI right now), and how we develop plans to reach goals. In short, it’s how we’ve built the wonders around us we call civilization.
Conscious reasoning is hard. To be precise, it feels hard to us. Rightfully so because it’s cognitively harder than most other things we do; multiplying 4-digit numbers in the head is an ability reserved for the most capable minds. If it’s so hard, how can naive calculators do it instantly with larger numbers than we know how to name? This goes back to Moravec’s Paradox (which I just mentioned in passing). Hans Moravec observed that AI can do stuff that seems hard to us, like high number arithmetic, very easily yet it struggles to do the tasks that seem most mundane, like walking straight.
But then, if dumb devices can do god-level arithmetic instantly, why does AI struggle to reason to solve novel tasks or problems much more than humans do? Why is AI’s ability to generalize so poor? Why does it show superb crystallized intelligence but terrible fluid intelligence? There’s an ongoing debate on whether current state-of-the-art LLMs like GPT-4 or Claude 3 can reason at all. I believe the interesting data point is that they can’t reason like we do, with the same depth, reliability, robustness, or generalizability but only “in extremely limited ways,” in Altman’s words. (Scoring rather high in “reasoning” benchmarks like MMLU or BIG-bench isn’t the same as being capable of human-like reason; it can be shortcutted with memorization and pattern matching not to mention tainted by data contamination.)
We could argue it’s a “skill issue” or that “Sampling can prove the presence of knowledge, but not its absence,” which are both fair and valid reasons but can’t quite explain GPT-4’s absolute failure with e.g. the ARC challenge that humans can solve. Evolution may have provided us with unnecessary hurdles to reason because it’s an ineffective optimization process, but there’s plenty of empirical evidence that suggests AI is still behind us in ways Moravec didn’t predict.8
All this is to introduce you to what I believe are deep technical issues underpinning AI’s reasoning flaws. The biggest factor I see is that AI companies have focused too heavily on imitation learning, i.e. taking vast amounts of human-made data on the internet and feeding huge models with it so they can learn by writing like we write and solving problems like we solve problems (that’s what pure LLMs do). The rationale was that by feeding AI with human data created throughout centuries, it’d learn to reason like we do, but it’s not working.
There are two important limitations to the imitation learning approach: First, the knowledge on the internet is mostly explicit knowledge (know-what) but tacit knowledge (know-how) can’t be accurately transmitted with words so we don’t even try—what you find online is mostly the finished product of a complex iterative process (e.g. you read my articles but you’re blissfully unaware of the dozens of drafts I had to go through). (I get back to the explicit-tacit distinction in the agents’ section.)
Second, imitation is only one of the many tools in the human kid’s learning toolkit. Kids also experiment, do trial and error, and self-play—we enjoy several means to learn beyond imitation by interacting with the world through feedback loops that update knowledge and integration mechanisms that stack it on top of existing knowledge. LLMs lack these critical reasoning tools. However, they’re not unheard of in AI: It’s what DeepMind’s AlphaGo Zero did to destroy AlphaGo 100-0—without any human data, just playing games against itself leveraging a combination of deep reinforcement learning (RL) and search.
Besides this powerful loop mechanism of trials and errors, both AlphaGo and AlphaGo Zero have an additional feature that, once again, not even the best LLMs (GPT-4, Claude 3, etc.) have today: the ability to ponder about what to do next (which is a mundane way to say they use a search algorithm to discern between bad, good, and better options against a goal by contrasting and integrating new information with prior knowledge). The ability to distribute computing power according to the complexity of the problem at hand is something humans do all the time (DeepMind has already tested this approach with interesting results). It’s what Daniel Kahneman called system 2 thinking in his popular book Thinking, Fast and Slow. Yoshua Bengio and Yann LeCun have tried to give AI “system 2 thinking” abilities.
I believe these two features—self-play/loops/trial and error and system 2 thinking—to be promising research venues to start closing the reasoning gap between AIs and humans. Interestingly, the very existence of AIs that have these abilities, like DeepMind’s AlphaGo Zero—also AlphaZero and MuZero (which wasn’t even given the rules of the games)—contrasts with the fact that the most recent AI systems today, like GPT-4, lack them. The reason is that the real world (even just the linguistic world) is much harder to “solve” than a chessboard: a game of imperfect information, ill-defined rules and rewards, and an unconstrained action space with quasi-infinite degrees of freedom are the closest to an impossible challenge you will find in science.
I believe bridging this gap between reasoning game-player AIs and reasoning real-world AIs is what all the current reasoning projects are about (I believe Gemini has something of this already but I don’t think it’s shown satisfactory results yet). Evidence leads me to think OpenAI has been focused particularly on leaving behind pure imitation learning by integrating the power of search and RL with LLMs. That’s what the speculation about Q* suggests and what public clues from leading researchers quietly scream. Perhaps the key person to look for at OpenAI for hints on this is Noam Brown, an expert in AI reasoning who joined the company from Meta in June 2023. In his announcement tweet he said this:
For years I’ve researched AI self-play and reasoning in games like Poker and Diplomacy. I’ll now investigate how to make these methods truly general. If successful, we may one day see LLMs that are 1,000x better than GPT-4. In 2016, AlphaGo beat Lee Sedol in a milestone for AI. But key to that was the AI’s ability to "ponder" for ~1 minute before each move … if we can discover a general version, the benefits could be huge. Yes, inference may be 1,000x slower and more costly, but what inference cost would we pay for a new cancer drug? Or for a proof of the Riemann Hypothesis?
I guess he just lays it all out once you have the background I provided above. More recently, in a tweet that has been since deleted, he said, “You don’t get superhuman performance by doing better imitation learning on human data.”
In a recent talk at Sequoia, Andrej Karpathy, who left OpenAI recently, said something along the same lines:
I think people still haven’t really seen what’s possible in the space … I think we’ve done step one of AlphaGo. We’ve done the imitation learning part. There’s step two of AlphaGo which is the RL and people haven’t done that yet … this is the part that actually made it work and made something superhuman. … The model needs to practice itself … it needs to figure out what works for it and what does not work for it [he suggests that our teaching ways aren’t adapted to the psychology of AIs].
Brown and Karpathy’s remarks on the limits of imitation learning echo something DeepMind’s cofounder Shane Legg said on Dwarkesh Patel’s podcast, again referencing AlphaGo:
To get real creativity you need to search through spaces of possibilities and find these sorts of hidden gems [he’s talking about the famous move 37 on AlphaGo’s second match against Lee Sedol] … I think current language models … don’t really do that kind of thing. They really are mimicking the data … the human ingenuity … that’s coming from the internet.
So to go beyond imitation learning you have to integrate it with search, self-play, reinforcement learning, etc. That’s what people believe Q* is. That’s what I believe Q* is. There are a few papers on how to introduce search abilities into LLMs or how to generalize self-play across games but I haven’t found conclusive evidence of what exactly OpenAI is using to add reasoning skills to GPT-5.
Will Q*/GPT-5 with reasoning be as impressive as the above suggests? Yann LeCun said we should “ignore the deluge of complete nonsense about Q*,” claiming that all top AI labs are working on similar things (technology converges on what’s possible so that makes sense). He accused Altman of having “a long track record of self-delusion,” as a criticism of Altman’s words presumably on Q* one day before he was fired in the boardroom drama: “[for the fourth time] I’ve gotten to be in the room when we pushed the veil of ignorance back and the frontier of discovery forward.”
But LeCun may also be trying to defend Meta’s work or perhaps he’s just bitter that OpenAI got Brown, who created Libratus (Poker) and CICERO (Diplomacy) at LeCun’s FAIR lab. (In favor of LeCun’s warning, we should also note that Karpathy says it’s not done yet and Brown was merely hinting at his future work, not something that already exists.)
As far as real results go, and with the amount of background and evidence we now have on AI reasoning, this comment by Flowers, who’s a half-reliable OpenAI leaker, suggests the latest GPT-4 turbo version is OpenAI’s current state-of-the-art on this. The Information reported that Q* can solve previously unseen math problems and, as it happens, the new GPT-4 turbo has improved the most on math/code problems (math tasks give the best early signals of reasoning ability). It also makes sense that OpenAI has chosen this low-key preview to assess Q* as a reasoning-focused model through GPT-4, to make an intermediate “non-shocking” public release before giving GPT-5 this kind of intelligence.
I bet that GPT-5 will be a pure LLM with notably enhanced reasoning abilities, borrowing them from a Q*-like RL model.9 Beyond that, OpenAI will keep further exploring how to bring together these two lines of research whose complete merging remains elusive.
Personalization
I’ll keep this one short. Personalization is all about empowering the user with a more intimate relationship with the AI. Users can’t make ChatGPT their customized assistant to the degree they may want to. System prompts, fine-tuning, RAG, and other techniques allow users to steer the chatbot to their desired behavior but that’s insufficient in terms of both the knowledge the AI has of the user and the control the user has of the AI (and of the data it sends to the cloud to get a response from the servers). If you want the AI to know more about you, you need to provide more data, which in turn lowers your privacy. That’s a key trade-off.
AI companies need to find a compromise solution that satisfies them and their customers if they don’t want them to take the chance to go open-source even if that entails more effort (Llama 3 makes that shift more attractive than ever). Is there a satisfactory middle ground between power and privacy? I don’t think so; if you go big, you go cloud. OpenAI isn’t even trying to make personalization GPT-5’s strength. For one reason: The model will be extremely large and compute-heavy, so forget about local processing and data privacy (most enterprises won’t be comfortable sending OpenAI their data).
There’s something else besides privacy and on-device processing that will unlock a new level of personalization (achieved by other companies already, Google and Magic in particular, although only Google has released publicly a model with this feature): several-million-token context windows.
There’s a big jump in applicability when you go from asking ChatGPT a two-sentence question to being able to fill the prompt window with a 400-page PDF that contains a decade’s worth of work so that ChatGPT can help you retrieve whatever may be hidden in there. Why wasn’t this available already? Because doing inference on so many input prompts was expensive in a way that became quadratically more unaffordable with every additional word you added. That’s known as the “quadratic attention bottleneck.” However, it seems the code has been cracked; new research from Google and Meta suggests the quadratic bottleneck is no more.
Ask Your PDF is a great app once the PDFs can be infinite in length but there’s something new that is now possible with million-token windows that wasn’t with hundred-thousand-token-windows: The “Ask My Life” category of apps. I’m not sure what will be GPT-5’s context window size, but given that a young startup like Magic seems to have achieved great results with many-million-token windows—and given Altman’s explicit reference to personalization as a must-have AI capability—OpenAI must, at least, match that bet.
Reliability
Reliability is the skeptic’s favorite. I think LLMs being unreliable (e.g. hallucinations) is one of the main reasons why people don’t see the value proposition of generative AI clear enough to go paid, why growth has stalled and use has plateaued, and why some experts consider them a “fun distraction” but not productivity-enhancing (and when they are, it doesn’t always go well). This isn’t everyone’s experience with LLMs, but it’s sufficiently salient that companies shouldn’t deny reliability is a problem they need to tackle (especially if they expect humanity to use this technology to help in high-stakes category cases).
Reliability is key for any tech product so why is it so hard to get it right with these large AI models? A conceptualization I’ve found useful to understand this point is that things like GPT-5 are neither inventions nor discoveries. They’re best portrayed as discovered inventions. Not even the people more closely building modern AI (much less users or investors) know how to interpret what’s going on inside the models once you input a query and get an output. (Mechanistic interpretability is a hot research area aimed at this problem but still in its early days. Read Anthropic’s work if you’re interested in this.)
It is as if GPT-5 and its ilk were ancient devices left behind by an advanced civilization and we happened to find them serendipitously in our archaeological silicon digs. They’re inventions we’ve discovered and now we’re trying to figure out what they are, how they work, and how we can make their behavior explainable and predictable. The unreliability we perceive is merely a downstream consequence of not understanding the artifacts well. That’s why this flaw remains unsolved despite costing companies millions in customer churn and enterprise doubt.
OpenAI is trying to make GPT-5 more reliable and safe with heavy guardrailing (RLHF), testing, and red-teaming. This approach has shortcomings. If we accept, as I explained above, that AI’s inability to reason is because “Sampling can prove the presence of knowledge, but not its absence,” we can just apply the same idea to safety testing: Sampling can prove the presence of safety cracks, but not their absence. This means that no matter how much testing OpenAI does, they won’t ever be sure their model is perfectly reliable or perfectly safe against jailbreaks, adversarial attacks, or prompt injections.
Will OpenAI improve reliability, hallucinations, and external attack vectors for GPT-5? The GPT-3 → GPT-4 trajectory suggests they will. Will they solve them? Don’t count on it.
Agents
This section is, in my opinion, the most interesting of the entire article. Everything I’ve written up to this point matters, in one way or another, for AI agents (with special emphasis on reasoning). The big question is this: Will GPT-5 have agentic capabilities or will it be, like the previous GPT versions, a standard language model that can do many things but not make plans and act on them to achieve goals? This question is relevant for three reasons I’ve broken down below: First, the importance of agency for intelligence can’t be overstated. Second, we know a primitive version of this is somewhat possible. Third, OpenAI has been working on AI agents.
Many people believe agency—described as the ability to reason, plan, and act autonomously over time to reach some goal, using the available resources—is the missing link between LLMs and human-level AI. Agency, even more so than pure reasoning, is the landmark of intelligence. As we saw above, reasoning is the first step to getting there—a key ability for any intelligent agent—but not enough. Planning and acting in the real world (for AIs a simulated environment can work well as a first approximation) are skills all humans have. Early on we start to interact with the world in a way that reveals a capacity for sequential reasoning targeted to predefined goals. At first, it’s unconscious and there’s no reasoning involved (e.g. a crying toddler) but as we grow it becomes a complex, conscious process.
One way to explain why agency is a must for intelligence and reasoning in a vacuum isn’t that useful is through the difference between explicit and tacit/implicit knowledge. Let’s imagine a powerful reasoning-capable AI that experiences and perceives the world passively (e.g. a physics expert AI). Reading all the books on the web would allow the AI to absorb and then create an unfathomable amount of explicit knowledge (know-what), the kind that can be formalized, transferred, and written down on papers and books. However, no matter how smart at physics the AI might be, it’d still lack the ability to take all those formulas and equations and apply them to, say, secure funding for a costly experiment to detect gravitational waves.
Why? Because that requires understanding the socioeconomic structures of the world and applying that knowledge in uncertainly novel situations with many moving parts. That kind of applied ability to generalize goes beyond what any book can cover. That’s tacit knowledge (know-how); the kind you only learn by doing and by learning directly from those who already know how to do it.10 The bottom line is this: No AI can be usefully agentic and achieve goals in the world without the ability to acquire know-how/tacit knowledge first, however great it might be at pure reasoning.11
To acquire know-how, humans do stuff. But “doing” in a way that’s useful to learn and understand requires following action plans toward goals mediated by feedback loops, experimentation, tool use, and a way to integrate all that with the existing pool of knowledge (which is what the kind of targeted reasoning beyond imitation learning that AlphaZero does is for). So reasoning, for an agent, is a means to an end, not an end in itself (that’s why it’s useless in a vacuum). Reasoning provides new explicit knowledge that AI agents then use to plan and act to acquire the tacit knowledge required to achieve complex goals. That’s the quintessence of intelligence; that is AI’s ultimate form.
This kind of agentic intelligence contrasts with LLMs like GPT-4, Claude 3, Gemini 1.5, or Llama 3 which are bad at conducting plans satisfactorily (early LLM-based agentic attempts like BabyAGI and AutoGPT or failed autonomy experiments are evidence for that). The current best AIs are sub-agentic or, to use a more or less official nomenclature, they’re AI tools (Gwern has a good resource on AI tool vs AI agent dichotomy).
So, how do we go from AI tools to AI agents that can reason, plan, and act? Can OpenAI close the gap between GPT-4, an AI tool, to GPT-5, potentially an AI agent? To answer that question we need to walk backward from OpenAI’s current focus and beliefs on agency and consider whether there’s a path from there. In particular, OpenAI seems to be convinced that LLMs—or more generally token-prediction algorithms (TPAs), which is an overarching term that includes models for other modalities, e.g. DALL-E, Sora, or Voice Engine—are enough to achieve AI agents.
If we are to believe OpenAI’s stance, we need to first answer this other question: Can AI agents emerge from TPAs, bypassing the need for tacit knowledge or even handcrafted reasoning features?12
The rationale behind these questions is that a great AI predictor/simulator—which is theoretically possible—must have developed, somehow, an internal world model to make accurate predictions. Such a predictor could bypass the need to acquire tacit knowledge just by having a deep understanding of how the world works. For instance, you don’t learn to ride a bike from books, you have to ride it, but if you could somehow predict what’s going to happen next with an arbitrarily high level of detail, that might be enough to nail it on your first ride and all subsequent rides. Humans can’t do that so we need practice, but could AI?13 Let’s shed some light on this before going on real examples of AI agents, including what OpenAI is working on.
Token-prediction algorithms (TPAs) are extremely powerful. So powerful that the entirety of modern generative AI is built on the premise that a sufficiently capable TPA can develop intelligence.14 GPT-4, Claude 3, Gemini 1.5 and Llama 3 are TPAs. Sora is a TPA (whose creators say “will lead to AGI by simulating everything”). Voice Engine and Suno are TPAs. Even unlikely examples like Figure 01 (“video in, trajectories out”) and Voyager (an AI Minecraft player that uses GPT-4) are essentially TPAs. But a pure TPA is perhaps not the best solution to do everything. For instance, DeepMind’s AlphaGo and AlphaZero aren’t TPAs but, as I said in the “reasoning” section, a clever combination of reinforcement learning, search, and deep learning.
Can an intelligent AI agent emerge out of a GPT-5 trained like GPT-4, as a TPA, or is it the case that to make GPT-5 an agent OpenAI needs to find a completely different function to optimize or even a new architecture? Can a (much) better GPT-4 eventually develop agentic capabilities or does an AI agent need to be something else entirely? Ilya Sutskever, the scientific mind behind OpenAI’s earlier successes, has little doubt about the power of TPAs:
… When we train a large neural network to accurately predict the next word in lots of different text from the internet … we are learning a world model … it may look on the surface that we are just learning statistical correlations in text but it turns out that to “just learn” statistical correlations in text, to compress them really well, what the neural network learns is some representation of the process that produces the text. This text is actually a projection of the world… this is what’s being learned by accurately predicting the next word.
Bill Peebles, one of the Sora creators, went even further in a recent talk:
As we continue to scale this paradigm [TPAs], we think eventually it’s going to have to model how humans think. The only way you can generate truly realistic video with truly realistic sequences of actions is if you have an internal model of how all objects, humans, etc., environments work.
You may not buy this view but we can safely extrapolate Sutskever and Peebles’ arguments to understand that OpenAI is, internal debates aside, in agreement. If successful, this approach would debunk the idea that AIs need to capture tacit knowledge or specific reasoning mechanisms to plan and act to achieve goals and be intelligent. Perhaps it’s just tokens all the way.
I don’t buy OpenAI’s view for one reason: They don’t bypass the tacit knowledge challenge. They simply move it somewhere else. Now the problem is not learning to reason, plan, and act but simulating worlds. They want to solve, quite literally, precognition. Peebles goes over this so casually that it seems unimportant. But, isn’t it even harder to create a perfect predictor/simulator than an entity that can plan and act in the world? Is it even possible to create an AI that can simulate “truly realistic sequences of actions,” as Peebles claimed in his talk? I don’t think so—I don’t think we can build that and I don’t think we could assess such an ability anyway. Perhaps OpenAI’s trust and reliance on the Bitter Lesson goes too far (or perhaps I’m wrong, we’ll see).
Anyway, AI companies’ options are narrow nowadays—no one knows how to build plan/act systems although Yann LeCun keeps trying—so they’re approaching the agency challenge with transformer-based TPAs in the form of LLMs (including OpenAI) whether they like it or not because it’s the best technology they have at their disposal. Let’s start with existing prototypes and then jump to what we know about OpenAI’s efforts.
Besides the examples I shared above (e.g. BabyAGI, AutoGPT, Voyager, etc.) there are other LLM-based agentic attempts. The first one that grabbed my attention was pre-ChatGPT. In September 2022, Adept AI announced the first version of what they called the Action Transformer, “a large-scale transformer trained to use digital tools” by watching videos of people. They released a few demos, but little beyond that. A year ago two co-founders left the company, which isn’t a good sign at all (The Information reported that Adept is preparing the launch of an AI agent in the summer. We’ll see how it goes). Another young startup that has recently joined the AI agents gold rush is Cognition AI, best known as the creator of Devin, “the first AI software engineer” (which has now an open-source cousin, OpenDevin). It went well at first, but then a review video entitled “Debunking Devin” came out and went viral by exposing Cognition’s overhyping of Devin’s abilities. The result? Cognition had to publicly acknowledge that Devin isn’t good enough to “make money taking on messy Upwork tasks.”
Those are purely software agents. There’s another branch, admittedly even harder to accomplish: AI agent devices. The best-known examples are the Rabbit R1 and Humane AI Pin. The reviews on R1 are coming out, so we’ll wait for them (around the same day this post is scheduled for publication). The reviews on Humane AI Pin came out last week and they’re absolutely devastating. In case you didn’t read my “Weekly Top Picks #71,” you can read The Verge’s review here or watch Marques Brownlee’s here.
Just know that the conclusion, taking into account all the above evidence, is that LLM-based AI agents aren’t a thing yet. Can OpenAI do better?
We know very little about OpenAI’s attempts at agents. We know that Andrej Karpathy was “building a kind of JARVIS” before he left OpenAI (why would he leave if he was working at the best AI company on the most promising future for AI?) Business Insider reported that GPT-5 will have the “ability to call AI agents being developed by OpenAI to perform tasks autonomously,” which is as vague as it gets. The Information reported some new info earlier this week:
OpenAI is quietly designing computer-using agents that could take over a person’s computer and operate different applications at the same time, such as transferring data from a document to a spreadsheet. Separately, OpenAI and Meta are working on a second class of agents that can handle complex web-based tasks such as creating an itinerary and booking travel accommodations based on it.
But even if these projects succeeded, this isn’t really what I described above as AI agents with human-like autonomous capabilities that can plan and act to reach goals. As The Information says, companies are using their marketing prowess to dilute the concept, turning “AI agents” into a “catch-all term,” instead of backing off from their ambitions or rising up to the technical challenge. OpenAI’s Ben Newhouse says they’re building what “could be an industry-defining zero to one product that leverages the latest and greatest from our upcoming models.” We’ll see about that.
As a conclusion to this subsection on agents, I believe OpenAI isn’t ready to make the final jump to AI agents with its biggest release just yet. A lot of work is left to be done. TPAs, despite being the only potential solution for now (until the reasoning challenges I described above are solved), won’t be enough by themselves to achieve the sought-after agentic capabilities in a way that people consider using them for serious projects.
I bet GPT-5 will be a multimodal LLM like those we’ve seen before—an improved GPT-4 if you will. It’ll probably be surrounded by systems that don’t exist yet in GPT-4, including the ability to connect to an AI agent model to do autonomous actions on the internet and your device (but it’ll be far from the true dream of a human-like AI agent). Whereas multimodality, reasoning, personalization, and reliability are features of a system (they will all be improved in GPT-5), an agent is an entirely different entity. GPT-5 doesn’t need to be an agent to enjoy the power of agency. It will likely be a kind of primitive “AI agent manager,” perhaps the first we consensually recognize as such.
OpenAI will integrate GPT-5 and AI agents at the product level to test the waters. They will also not release GPT-5 and the AI agent fleet at once (as an antecedent, GPT-4 and GPT-4V were separated for a while). I assume OpenAI considers the agentic capabilities harder to control than “just” a better multimodal LLM so they will roll out AI agents much more slowly. Let me repeat, with emphasis, the above quote by Newhouse to make it clear why I believe this is the case: “We’re building what … could be an industry-defining zero to one product that leverages the latest and greatest from our upcoming models [emphasis mine].” A product (AI agent) that leverages the greatest from the upcoming models (GPT-5).
In closing
So that was it.
Congratulations, you just read 14,000 words on GPT-5 and surroundings!
Hope it helped you get a better understanding not just of GPT-5 itself (we’ll get the full picture once it’s out) but of how to think about these things, the many parts that have to move in harmony to make it possible, and the many considerations that are necessary to have a better picture of the future.
It was a fun experiment to dive this deep into a topic (if you like super long-form articles like this, I’ll do more as time permits). Hope it was fun, interesting, and useful for you as well.
Despite Meta having been working on AI before OpenAI even existed, they got a slow start into the LLM race (i.e. Zuck and the metaverse) so they decided to lean on the open-source community, which is perhaps the better moat. They’re not really trying to surpass OpenAI because they’re not true competitors. Instead, they’re commoditizing the largest models by making them available for everyone (i.e. everyone who can fit them into their local GPUs; serious researchers, enterprises, and other big companies, like Apple, that may have gotten even later into the generative AI boom). If you can download Llama 3 to your PC, why would you pay OpenAI for a similar GPT-4-class model? Now, the question: Is Llama 3 the best Meta can do? For now, it is. But Zuckerberg hinted in the Dwarkesh Patel podcast that they might be already hypothesizing about Llama 4. Can they iterate faster than OpenAI? They don’t care, it’s OpenAI that’s working against the clock.
OpenAI is much better than the others at something but that’s not building models. They’re much better at understanding their audience, at Marketing 101. Google is too rigid, bureaucratic, uncharismatic, and unnatural. While Altman writes in lowercase and Mistral drops magnet links Google does everything through official releases. They’re boomers that haven’t noticed it yet. Anthropic is closer to OpenAI (they were the same thing once) but they’re too quiet, too press-shy. That’s good to not get distracted but it harms the company’s relationship with the public. Can their products speak for them? Perhaps. Is Altman wielding a double-edged sword? Maybe, if people grow tired of his presence. Still, OpenAI’s marketing tactics have worked wonders so far.
The open release of Llama 3 and particularly the amazing eval scores of the 405B model—which is still training but already at GPT-4-class level (Meta has provided numbers for the April 15th checkpoint)—may force OpenAI’s hand and push them to advance the next releases much sooner than they’d want, even as soon as in the coming weeks if they’re willing to cut down on safety for the sake of exploiting a competitive advantage and, more critically, if they’re willing to let the world believe they’ve become reactive (I feel this isn’t Altman’s style).
Speculation time: What if the doubling down on iterative deployment and the “releasing GPT-5 in a different way” Altman mentioned is coming together here? What if, as I hypothesized, OpenAI is releasing those medium-sized parts as standalone things—e.g. Sora, Voice Engine, GPT-4 math reasoning improvements—to test them without hinting at GPT-5 only to bring them all together into “one giant thing” at the end of the year?
OpenAI may be doubling down on enterprise customers (or tripling down) who prefer an expensive high-quality service over a cheap one. The startup is facing a decisional trade-off between casual users who care less about the 0.1% edge failures but can’t afford to pay much and enterprise users who often require ~100% reliability but are willing to pay up to 10x or even more for a robust patch on that 0.1%.
This categorization is for the sake of clarity but not set in stone: things that matter for reasoning matter for agents and vice versa; you can’t personalize a product if it isn’t reliable; and how can you expect an AI to plan and act in the world if it doesn’t have a pair of eyes and ears? So don’t take this separation to heart, these things are deeply entwined.
Let me remind you that although most AI people love the scaling laws and the Bitter Lesson, it’s undeniable that algorithmic breakthroughs—advances that researchers and engineers come up with after grinding day and night and squeezing their very human ingenuity—are the only way to resolve the obstacles brute-force approaches can’t handle. You can build a planet-size datacenter to host a gargantuan token-predicting language model but if you don’t give it eyes, it won’t see.
If you believe “humans use the same reasoning tricks and shortcuts AI does” is a good counter, I recommend you read this excellent essay by Melanie Mitchell: “Can Large Language Models Reason?”
Research suggests that fine-tuning GPT-4 on Q*’s outputs can be done successfully. I just don’t think they’ll stop there, as that approach doesn’t create the AI reasoning dream that both Brown and Karpathy want but merely upgrades imitation learning.
The explicit-tacit knowledge gap is the source of the book smart vs street smart dichotomy; bookworms know a lot of facts and info but the real world has a knack for often thwarting their plans whereas default-to-action high-agency people tend to find a way out.
Karl Polanyi said humans “can know more than we can tell,” in contrast with animals which can tell us nothing about the things they know; such a physics-all-knowing AI would be the opposite of animals: capable of talking about everything it knows but incapable of knowing anything beyond what it can talk about.
We can even reformulate this question to put humans at the center: When humans, the epitome of natural agency, plan and act toward some goal, are they just predicting reality? Karl Friston’s Free Energy Principle, a framework that aims to explain how the brain works, is, hyper-simplifying, this idea: The brain is a surprise-reducting machine that makes predictions based on internal models and updates them using sensory input.
“Predicting what’ll happen next” sounds delusional but it’s not that different from how our evolutionary endowment makes us so adept at launching or catching stuff with ease.
The hypothesis that the details that make humans intelligent can be obviated in AI (e.g. by using only TPAs to achieve agency) is a common one. Other examples are the scaling laws, Sutton’s Bitter Lesson, and another of Sutton’s papers, “Reward is enough,” where DeepMind researchers argue that intelligence can emerge from the maximization of reward through reinforcement learning.
Phew! It took me several separate reading sessions, but I'm finally done.
Thanks, Alberto, for this colossal piece. (I can only imagine how much effort and time this must've taken.)
I certainly have a newfound level of understanding of the capabilities and limitations of the current TPA / LLM paradigm as well as the overall landscape of models and challenges involved.
It really feels like the next paradigm-shifting leap will be "true" independent agents, while, if your assumptions are correct, GPT-5 will "just" be a much more capable model within the current constraints. (Which doesn't make it any less exciting.)
I'll probably need a bit more time to re-read and process many of the sections and concepts, but I just wanted to let you know that I appreciate the work you put into making this available for us!
Great article. Thorough and interesting. I learned a lot and my "go read this list" just got even bigger! Thanks.