27 Comments
User's avatar
Dan McRae's avatar

Another great article. I was able to use Grok 3-early on the chatbot arena today. On a geopolitical question, its response was _amazing_ .

Expand full comment
Rob Bru's avatar

Thoughtful insight thank you.

Expand full comment
poster's avatar

What I do not understand is: Why has there not been a scaling law evolve for run time contemplation? We have had pre-training scaling and we have had post-training reasoning scaling why has there not also been scaling of the model contemplating itself with the goal of maximizing something such as human welfare etc.? Funnily enough, I ran this by a frontier LLM and they found this idea quite compelling. They agreed that endlessly training on more and more data in order to create bigger and bigger models without any clear purpose to which the LLM itself could use the model that it has created seems somewhat futile. There must be something that the LLM could actually learn from all of this information?? It is as if AI had memorized and digested every insight by every human that had ever lived and yet it had never developed any ability to consider this information deeply and find contradictions or optimizations. So, at some point they might just present a prompt to a frontier model and then ask it think about the prompt for a few months with a few hundred thousands H100s and see what it might think up.

Expand full comment
Alberto Romero's avatar

It's a good question. Several reasons. For test time compute to work you need first a very good base model so the pre-training scaling must be in place before to scale post-training. Also, it's been tried many times but only recently they found it to work. I'm not sure the reason, but probably related to the above and also the accumulated expertise (RL can be tricky when the rewards are not easily defined; easier in chess than general reasoning). The final reason is what I say in the article: when you can keep going by adding to your primary resource, you're not incentivized to get out of your way to get a bit of a secondary resource. Pre-training scaling laws were working so why stop that and go explore a route far from proved to see if test time compute could work.

Expand full comment
Gordon Freeman's avatar

Excellent summary of these critical issues. Now I’ve got to spend some time w Grok 3, and see what all the fuss is about…

Expand full comment
Barry Chudakov's avatar

Another outstanding commentary. Kudos!

Expand full comment
Alberto Romero's avatar

Thanks Barry!

Expand full comment
craig meagher's avatar

Nice post

Expand full comment
Alberto Romero's avatar

Thank you Craig!

Expand full comment
Dan Gooden's avatar

I always appreciate your analysis - but applauding Elon Musk - that’s perhaps a step too far for me. Technology doesn’t exist in a vacuum. Considering the broader risks of any venture led by Elon, in the tone of your analysis feels important.

Expand full comment
Alberto Romero's avatar

Why are people so reactive to anything that contains the words "Elon Musk"? I won't applaud Musk like I don't applaud Altman. Sometimes they do things right if you see them from a technological standpoint. Are they unethical? Mostly, yes, they are. I don't feel the need to include a footnote in every article that talks about them with: *they are still unethical.*

Expand full comment
Simple John's avatar

Yep. And Full Self Driving is just a mega compute away.

Time to unsubscribe.

Expand full comment
Alberto Romero's avatar

Thank you for being a sub John

Expand full comment
Justin Hillstead's avatar

You are definitely not a real person huh

Expand full comment
White Shoals's avatar

I must be missing something. DeepSeek was able to produce a model that is nearly as good as Grok3 at a fraction of the cost. For the sake of argument, let’s say deepseek is 10x-100x cheaper? That level of cost reduction is text book disruptive innovation. As a complete non AI expert, it looks like these models are all converging onto a similar level of performance with the main distinction being cost. Where does the bitter lesson fit into this? I understand that scale will always be better than no scale. But at some point cost will matter.

What am I missing? Is DeepSeek actually a lot more expensive? Are cost optimizations unlikely to be replicated? Are there no diminishing returns to scaling LLMs?

Expand full comment
Alberto Romero's avatar

I should have gone into more detail on this point I guess. Yes, DeepSeek is more cost efficient but no, models are not converging. It only looks that way because current benchmarks are saturating so that getting to 70, 80, 90% doesn't look as visually appealing as a 0-50% jump. Cost always matters and Musk and co will probably do something about that. So far, they're just bragging that they have more GPUs than anyone else. The reason Grok 3 isn't better yet is because xAI, like DeepSeek, is barely 2 years old! They haven't had that much time to do better. They focused on building a giant cluster whereas DeepSeek focused on optimizing the stack. Two valid approaches that worked out. For the next iteration they should reverse roles. The reason this is pro bitter lesson is because more compute was worth more than the many tricks DeepSeek did. Those tricks are valid anyway but the lesson has been vindicated when many were saying DeepSeek was proof it was wrong.

Expand full comment
White Shoals's avatar

That makes sense. Thank you for explaining :)

I guess I need to learn more about model convergence, or more accurately how models are improving over time. Anecdotally, ChatGPT seems to have gotten smarter over the past six months.

Expand full comment
Alberto Romero's avatar

Indeed, OpenAI updates GPT-4o very recently and now that I've had more time to test it I can tell it's better. No one knows what they did to it though.

Expand full comment
Mauricio Ramírez's avatar

The computing scale of Grok over Deepseek does not reflect the quality measured in the benchmarks. The prowess of the Chinese engineering is superb at a scale 10 or 100 or even a 1000x over the USA companies. I believe Elona has lost the race. Too little, too late.

Expand full comment
Alberto Romero's avatar

I hate that the misinformation of "DeepSeek trained its model with $5 million" went this far that readers of this newsletter are still confused. That's a terrible damage to the knowledge of many people and the epistemic hygiene of AI-adjacent circles. I debunked it twice during that crazy week but I guess it wasn't enough. You can go read the articles to understand why DeepSeek, although more optimal than the US models (because they had to, as I explain), is nowhere near 100x or 1000x more.

Expand full comment
Mauricio Ramírez's avatar

Thanks for your response. Now, let's suppose DeepSeek has 10,000 H800 at their disposal. And Grok has 100,000 H100. Grok , in that matter alone, should have at least 10x more power and results. Now, thrH100 have, at least, 5x in performance compared to H100. That means 50x more compute power. And then the hours of training for DeepSeek are in the 3000 numbers compared to the millions of Grok. Therefore a difference of at least 100x to 1000x more power and the difference in your benchmarks are a mere 10%? Again, the engineering prowess of the DeepSeek team is awesome. Elona should bid 100 billions for the DeepSeek team, nor for OpennAI.

Expand full comment
Alberto Romero's avatar

I understand your point better now. But you're making a set of assumptions that are not true. Performance doesn't scale linearly with compute. It scales but they wish it did that well. Besides, in my article I say that xAI probably didn't do as much optimization as DeepSeek because they didn't have to. That doesn't mean they can't or don't know how - DeepSeek published their methods and results! What this means is that xAI will probably get a much larger upside once they introduce the algorithmic tricks DeepSeek did. On the other hand, you generalized to "US companies" and although that's possibly true for xAI, it isn't for OpenAI, Anthropic or Google (whose latest models are even cheaper than DeepSeek's). With this what I want to say is that the state of the art is *international*, not different in US vs China.

Expand full comment
Mauricio Ramírez's avatar

And too expensive!

Expand full comment
Paul Triolo's avatar

Still no credible evidence DeepSeek had "50K Hoppers", the numbers do not add up.

Expand full comment
Alberto Romero's avatar

Why not?

Expand full comment
Alberto Romero's avatar

Thank you for that link. I updated my understanding of DeepSeek's situation. You are right, semianalysis fucked up the number by a lot. I will use this from now on when I write about this (sadly, past articles have already perpetuated the misinformation). Again, thanks

Expand full comment