As an extended Christmas gift, OpenAI is holding a 12-day shipping streak. They will announce, demo, and/or release new stuff.
Today is day one and they’ve given us a heavyweight release: ChatGPT Pro and the full o1 model (here’s o1’s system card). The presentation was short but they seem to have turned the tides on the competition (we’ll have to wait for independent testing results).
Let’s start with the new model, o1.
On September 12, OpenAI announced o1-preview, an AI model like GPT but with the ability to reason (thanks to reinforcement learning through chains of thought) and think through your questions (thanks to more test-time compute).
Most importantly (at least from a scientific point of view), o1-preview was the first step into replacing the old pre-training scaling laws that we’ve learned are plateauing, with a novel set of laws that scale test compute instead of training compute. The change is best depicted in this graph:
But o1-preview was never the definitive model. OpenAI insisted that it paled in comparison to the full thing, o1, which they’re releasing today. Plus and Team users are getting it in replacement for o1-preview right away. Enterprise and Edu will get it next week. In terms of benchmark performance, this is the difference between the three o1 versions (we’ll get to o1 pro mode soon):
The above depicts pass@1 accuracy, which reflects how well the models did in their first attempt at the test questions. OpenAI repeated this comparison between the three o1 versions on a stricter benchmark they call “4/4 reliability,” which requires the model to get right four attempts out of four:
Here, o1 and o1 Pro shine compared to o1-preview. Remember that the previous best reasoning models were o1-preview and two Chinese models, DeepSeek R1-Lite-preview and Qwen’s QwQ, which are around o1-preview levels. This graph puts them to shame, especially for math and coding problems.
Hands-on comparisons between o1-preview and o1 will start to arrive soon (if people dare pay the price!). OpenAI’s Noam Brown generated an entire essay without the letter “e”, probably to mock back those still asking ChatGPT to count the R’s in “strawberry” (GPT-4o fails the task):
This one is also quite impressive:
What’s the value of this? I’m not sure yet. I will update this post once I’ve gathered more examples. For now, let me share o1’s qualitative features that OpenAI demoed during the live stream:
o1 is better. The benchmark results say it all. OpenAI also says that “o1 outperforms o1-preview, reducing major errors on difficult real-world questions by 34%.” It’s quite a vague statement (what are “real-world questions”?) but I imagine users—at least those for which o1 is intended—will notice the improvement, especially for harder prompts.
o1 is adaptive. It answers easy questions faster and hard questions slower. This was one of o1-preview’s main flaws. I’m constantly changing between GPT-4o and o1-preview because the latter always, regardless of the question, takes plenty of time to answer. That’s too much friction for me. If OpenAI plans to move on with the family of reasoning models (eventually even deprecating GPTs), then this unnecessary slowness has to be patched. They did just that: o1 can be up to 50-60% faster than o1-preview if the question doesn’t require much thinking.
o1 is multimodal. State-of-the-art models must be multimodal. GPT-4o is multimodal, as are Google’s Gemini Pro 1.5, Anthropic’s Claude Sonnet 3.5, and Meta’s Llama 3.2. Now, OpenAI o1 is as well. The power of using o1’s multimodal capabilities is that it can reason with images as it does with text. OpenAI researchers showed how it solved a thermodynamics problem from a drawing.
o1 is not just for math. OpenAI researchers explicitly said that o1 is not just better at math or coding but also at everyday tasks (including writing). I wonder how this assertion translates to the average user. It may just be a way to lure in people who feel alienated by o1-preview’s focus on math and coding.
o1 has been a good model. From the system card: o1 shows “state-of-the-art performance on certain benchmarks for risks such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks.” This means it’s more compliant than its predecessors to OpenAI policies. It was also found to be safer than GPT-4o 60% of the time. However, being more intelligent makes o1 more dangerous once you manage to pass its greater behavioral guardrails—I guess we gotta wait for Pliny.
The other news is ChatGPT Pro.
Pro is a new tier above Plus—way above—at $200/month (Plus stays at $20/month). The question we’re all asking ourselves is: How does OpenAI justify such a high price?
Unlimited access to the best models: o1, o1-mini, GPT-4o, and Advanced Voice.
Exclusive access to o1 pro mode. Do the benchmark results above justify the 10x price jump from o1 to o1 Pro? I’m not sure. Pro is only slightly better than the standard o1 and mainly in problems that require it (hard math, hard code, PhD science questions…). OpenAI didn’t disclose any specific difference so I advise you to imagine there’s none. Average users just don’t need o1 Pro.
OpenAI said they “expect to add more powerful, compute-intensive productivity features to this plan.” (Who’s paying $200 for a promise?)
Whether o1 Pro is worth it is yet to be seen. The unlimited access feature is attractive for power users who work with ChatGPT on hard problems, reaching the message limit every single day. Otherwise, I don’t see it. And even for ChatGPT addicts, the API may be a much better option (of course, most people don’t know what an API even is, so for the average user that’s out of the question).
REMINDER: The Christmas Special offer—20% off for life—is running from Dec 1st to Jan 1st. Lock in your annual subscription now for just $40/year (or the price of a cup of coffee a month). Starting Jan 1st, The Algorithmic Bridge will move to $10/month or $100/year. If you’ve been thinking about upgrading, now’s the time.
5 implications from these announcements
A release without context is useless. I want to give you a preview of what I think all this means for OpenAI, its competitors, and the future of the field.