OMG ALBERTO what the heckkkk! This is absolutely insane. If I had someone in my life that actually understood this I’d be shaking them right now screaming “it’s happening”.
The next few months and years are going to be incredibly interesting indeed.
An angle for this is o3 still requires fine tuning to certain domains for best performance (for example the arc-AGI test) but I guess once you have them trained up and potentially have them communicate then that would potentially be a sufficient way to create a more generalised model?
Very convincing data. Slightly terrifying that these models can start hoovering up budget that would have gone to programmer budget.
This maths focus makes me think that one reason why AIs are still average to poor at tasks in my field of market research (eg the questionnaires they write generally contain 1-2 critical errors) is because attention has not been paid to my small field. What comes after all the maths tasks I wonder…
100%. Thankfully, I don't think they will pay attention to everything. The question is: can a super smart model figure out a way to never make mistakes?
I guess it depends on what they are trained on. In research surveys you would need to have a theory of mind and also close contextual technical knowledge in order to avoid any mistakes, e.g. identify what would be a leading question and what would be relevant answer codes. I personally don’t think any model could get to 100% on that as the mistakes are blurry and at least partially subjective. Maybe the “fuzzy” professions are best protected.
But then the interesting Q for me is what fields could a model get to “no mistakes” on. E.g. is coding a website with no bugs a realistic objective? Or creating a protein chemistry research program? As if that kind of scientific/code performance becomes good enough, it would be earth shattering.
Excellent summary - I had spent the evening yesterday (I’m in Germany) tracking down all the quotes you have in your article, and they together with the o3 performance results made it very clear that OpenAI has just performed another “quantum leap” - a fitting conclusion to the 2024 leap year. It will be interesting to see all the frantic attempts by the “stochastic parrot” crowd to move the goalposts, just so that the “crown of creation” can feel smug about themselves and go back to navel gazing, head bashing, environmental destruction, and illogical theorizing about made up phenomena. Oh, and in Europe, our “friends in Brussels” are now looking at the Brussels sprouts mess called the EU AI Act, which needs to be rewritten / renegotiated / re-whatevered prontissimo. While they are at it, they should also re-examine their confabulation that the Act will serve as a springboard for European innovation. — Just one little quibble about percentages. In discussing the SWE-Bench Verified results, you state: “We’ve never seen a direct 20% jump before …”. This is an understatement. We see a jump of 20 percentage points from 51% to 71%, which is a 20/51 or roughly 40% jump. — Anyway, Frohe Weihnachten, Feliz Navidad, Happy/Merry Christmas, and a smooth transition into 2025.
I realised in August that I won't have a career at this rate. I am a media buyer (Google Ads) and web designer. Fascinating and exciting, and terrifying.
I have to say that the numbers look impressive and even surprising.
Maybe Alberto could help us reason through some of the consequences for the use of AI if these numbers hold up in other real world tasks, which is what we all want to get to ASAP.
So, Alberto, here's the prompt: what do you see are the likely consequences from this advance?
I will write more on these results soon. I need to process what just happened and how that changes my views. For now, I recommend you read Chollet's take on this. He's the one with no reason to hype anything. The others are interesting too but their incentives are the opposite
Hahaha we're that excited aren't we? Just wow, right. I don't know what are you hearing over there, but I was very impressed. Didn't expect these results. I'm not sure how to interpret them or what to expect in the short term. Amazing announcement.
Not excited tbh, since you can hit that metric yourself using multiple inference calls to resolve errors. With o3 it’s more automatic - If it costs 3x as much but solves 10% more use cases is it worth it - no data yet (but OpenAI would prefer you get super excited).
I don't understand you and the like. The whole idea of intelligence is to perform something without being trained on. Just because the authors of Arc AGI claimed that "if you can beat this" you've achieved AGI doesn't make them credible to say that.
Besides, what's the point of measuring Arc AGI results if you have trained it for it specifically.
This reminds me of (bad) students who gathered copies of earlier tests to figure out a pattern in tests.
Are we trying to pass the test or have an intelligence breakthrough here? I'm disappointed.
The definition of intelligence is ability to overcome unknown and unseen scenarios. This definition prevents you from using training and learning to "boost" intelligence. The keyword is unseen and unknown which apriori makes training and learning useless and innate intelligence the quality of this ability.
Maybe someday you'll understand but I doubt.
Also, the MLP perceptron has not changed since more than 30 years so that speaks tons about your ignorance on the topic.
What do you think human toddlers do to become capable of overcoming unknown and unseen scenarios? They observe (training data), imitate (training data), experiment (feedback loop), etc. And what about the innate structure in our brains? Isn't that millions of years of evolutive adaptation? But sure, human intelligence has nothing to do with training
Once again, you conflate learning with intelligence. You don't need learning for intelligence, and likewise you don't need intelligence for learning. it certainly helps but it's not mandatory. That's why we call it machine learning and not machine intelligence. Are you incapable of understanding the simple and basic premise of novelty? Something for which you haven't been taught, nor trained, nor learned. That's intelligence by definition and examples are everywhere throughout history.
If you examine the 3 failures of o3 on the ARC site, these are still astoundingly simple. An index that o3 is doing nothing like human GI. ARC thinks o3 is using small natural language programs that tell the LLM what to do. Implication: if such a program does not exist appropriate to a input-output grid, then failure - making sense of an otherwise head scratching question of why their instruction “find the mapping rule” failed in these absurdly simple cases. Where did these programs come from? Did OpenAI spend the last 4 years writing them? In any case, this hardly is impressive as AGI. ARC notes they have an ARC-2 version ready which they think o3 will do 30% on, and a 3rd much worse version coming. If you guys are interested in a different view of the general problem re human vs AI on these, search “ARC-AGI Why a thorn for AI? (Video).
OpenAI staff stated in unambiguous terms that o3 is not doing that. It's an LLM with reinforcement learning over the CoT process. No MCTS, no program synthesis, etc. Just an standard autoregressive language model that has been given time to think. That said, I agree it should not fail those ARC-AGI tasks. And also, not require that ludicrous amount of compute to solve the others. Something is clearly missing. I want to see what kind of challenge does ARC-AGI v2 present
OMG ALBERTO what the heckkkk! This is absolutely insane. If I had someone in my life that actually understood this I’d be shaking them right now screaming “it’s happening”.
The next few months and years are going to be incredibly interesting indeed.
An angle for this is o3 still requires fine tuning to certain domains for best performance (for example the arc-AGI test) but I guess once you have them trained up and potentially have them communicate then that would potentially be a sufficient way to create a more generalised model?
That's their goal, scale up test time compute and then train them up to GPT-5 level and more
Very convincing data. Slightly terrifying that these models can start hoovering up budget that would have gone to programmer budget.
This maths focus makes me think that one reason why AIs are still average to poor at tasks in my field of market research (eg the questionnaires they write generally contain 1-2 critical errors) is because attention has not been paid to my small field. What comes after all the maths tasks I wonder…
100%. Thankfully, I don't think they will pay attention to everything. The question is: can a super smart model figure out a way to never make mistakes?
I guess it depends on what they are trained on. In research surveys you would need to have a theory of mind and also close contextual technical knowledge in order to avoid any mistakes, e.g. identify what would be a leading question and what would be relevant answer codes. I personally don’t think any model could get to 100% on that as the mistakes are blurry and at least partially subjective. Maybe the “fuzzy” professions are best protected.
But then the interesting Q for me is what fields could a model get to “no mistakes” on. E.g. is coding a website with no bugs a realistic objective? Or creating a protein chemistry research program? As if that kind of scientific/code performance becomes good enough, it would be earth shattering.
Yes 🥰
“Or creating a protein chemistry research program? As if that kind of scientific/code performance becomes good enough, it would be earth shattering.”
Excellent summary - I had spent the evening yesterday (I’m in Germany) tracking down all the quotes you have in your article, and they together with the o3 performance results made it very clear that OpenAI has just performed another “quantum leap” - a fitting conclusion to the 2024 leap year. It will be interesting to see all the frantic attempts by the “stochastic parrot” crowd to move the goalposts, just so that the “crown of creation” can feel smug about themselves and go back to navel gazing, head bashing, environmental destruction, and illogical theorizing about made up phenomena. Oh, and in Europe, our “friends in Brussels” are now looking at the Brussels sprouts mess called the EU AI Act, which needs to be rewritten / renegotiated / re-whatevered prontissimo. While they are at it, they should also re-examine their confabulation that the Act will serve as a springboard for European innovation. — Just one little quibble about percentages. In discussing the SWE-Bench Verified results, you state: “We’ve never seen a direct 20% jump before …”. This is an understatement. We see a jump of 20 percentage points from 51% to 71%, which is a 20/51 or roughly 40% jump. — Anyway, Frohe Weihnachten, Feliz Navidad, Happy/Merry Christmas, and a smooth transition into 2025.
Epoch has said that Tao only saw the hardest tier of questions; presumably o3 solved easier tier questions. See https://www.reddit.com/r/OpenAI/s/VMk8wzY8P4
Still extremely impressive though!
Thank you for the info! Didn't catch that and it's important (yes, I believe o3 solved the easier part of the test).
Much more info in this thread: https://x.com/ElliotGlazer/status/1871812179399479511
Nice post! I will be excited when they show the same level of effort and interest in AI safety and alignment work.
Hard pass on OpenAI products till then for me.
It is clear that OpenAI des not even care about the near term job displacement this poses on society. Disgusting and unethical as heck imo.
I realised in August that I won't have a career at this rate. I am a media buyer (Google Ads) and web designer. Fascinating and exciting, and terrifying.
All three. Agreed. (Good luck with your job, not everyone will lose theirs, the most important thing now is being aware)
I have to say that the numbers look impressive and even surprising.
Maybe Alberto could help us reason through some of the consequences for the use of AI if these numbers hold up in other real world tasks, which is what we all want to get to ASAP.
So, Alberto, here's the prompt: what do you see are the likely consequences from this advance?
I will write more on these results soon. I need to process what just happened and how that changes my views. For now, I recommend you read Chollet's take on this. He's the one with no reason to hype anything. The others are interesting too but their incentives are the opposite
The world is forever changed right in front of our eyes. AGI is knocking at the door.
An interesting ending to 2024
Lmao beat you by 8minutes 🤣.
Wishing you some rest before 2025!!!
Hahaha we're that excited aren't we? Just wow, right. I don't know what are you hearing over there, but I was very impressed. Didn't expect these results. I'm not sure how to interpret them or what to expect in the short term. Amazing announcement.
Most of my reasonable friends have spent all day freaking out. Numbers are bonkers, pace of progress is bonkers.
2025 will give us no rest!!
Not excited tbh, since you can hit that metric yourself using multiple inference calls to resolve errors. With o3 it’s more automatic - If it costs 3x as much but solves 10% more use cases is it worth it - no data yet (but OpenAI would prefer you get super excited).
you should be lol
When you have transparent data, lmk if I should reconsider my take that this is marketing a series of good model call loops.
Frank Herbert tried to warn us.
I don't understand you and the like. The whole idea of intelligence is to perform something without being trained on. Just because the authors of Arc AGI claimed that "if you can beat this" you've achieved AGI doesn't make them credible to say that.
Besides, what's the point of measuring Arc AGI results if you have trained it for it specifically.
This reminds me of (bad) students who gathered copies of earlier tests to figure out a pattern in tests.
Are we trying to pass the test or have an intelligence breakthrough here? I'm disappointed.
Do you even know anything about this topic? All AI models are trained. Literally. All humans are trained. Literally
30 years since I wrote my 1st chat bot using neural networks.
Learning and training are not intelligence! Intelligence is innate untrainable ability. This is ridiculous.
30 years since? Maybe that's the problem haha
The definition of intelligence is ability to overcome unknown and unseen scenarios. This definition prevents you from using training and learning to "boost" intelligence. The keyword is unseen and unknown which apriori makes training and learning useless and innate intelligence the quality of this ability.
Maybe someday you'll understand but I doubt.
Also, the MLP perceptron has not changed since more than 30 years so that speaks tons about your ignorance on the topic.
What do you think human toddlers do to become capable of overcoming unknown and unseen scenarios? They observe (training data), imitate (training data), experiment (feedback loop), etc. And what about the innate structure in our brains? Isn't that millions of years of evolutive adaptation? But sure, human intelligence has nothing to do with training
Once again, you conflate learning with intelligence. You don't need learning for intelligence, and likewise you don't need intelligence for learning. it certainly helps but it's not mandatory. That's why we call it machine learning and not machine intelligence. Are you incapable of understanding the simple and basic premise of novelty? Something for which you haven't been taught, nor trained, nor learned. That's intelligence by definition and examples are everywhere throughout history.
Thanks Alberto, glad to have found your Substack!
What the actual F? Coders gonna be obsolete?
Naming it o3 due to legal restrictions to o2, is still confusing! Merry Xmas!
If you examine the 3 failures of o3 on the ARC site, these are still astoundingly simple. An index that o3 is doing nothing like human GI. ARC thinks o3 is using small natural language programs that tell the LLM what to do. Implication: if such a program does not exist appropriate to a input-output grid, then failure - making sense of an otherwise head scratching question of why their instruction “find the mapping rule” failed in these absurdly simple cases. Where did these programs come from? Did OpenAI spend the last 4 years writing them? In any case, this hardly is impressive as AGI. ARC notes they have an ARC-2 version ready which they think o3 will do 30% on, and a 3rd much worse version coming. If you guys are interested in a different view of the general problem re human vs AI on these, search “ARC-AGI Why a thorn for AI? (Video).
OpenAI staff stated in unambiguous terms that o3 is not doing that. It's an LLM with reinforcement learning over the CoT process. No MCTS, no program synthesis, etc. Just an standard autoregressive language model that has been given time to think. That said, I agree it should not fail those ARC-AGI tasks. And also, not require that ludicrous amount of compute to solve the others. Something is clearly missing. I want to see what kind of challenge does ARC-AGI v2 present