Anthropic CEO: is DeepSeek-R1 a revolution in AI?
Dario Amodei's essay on the model that crashed Wall Street. π
Dario Amodei, CEO of Anthropic, just published his view of the 10-days old Chinese model that created a panic in Wall Street this week, with NVIDIA losing 17% of its value in 1 day. In short, Amodei says: "This model is not that exceptional, but it's good, and let's not fear DeepSeek, but strengthen export restrictions on China, just in case".
Let's untangle this and put it in contextππ
Amodei starts with a high-level perspective, laying out 3 dynamics of AI Development:
- Scaling laws: everyone know them, first discovered for training: as you 10x compute (assuming you're not using it poorly), perf will reliably increase by a fixed step.
- Shifting the curve: algorithmic efficiency keeps improving, thus improving the results that we get for a fixed cost. Back in 2020, a paper had quantified the efficiency gains at 1.68x /year, now Amodei puts that at 4x / year. Also, he mentions "Jevons paradox" that everyone was talking about on Ywitter: price/energy expense keeps decreasing for a fixed performance, but this is gain is instantly used up to increase performance - because potential gains are immense
- Paradigm shifts: back in 2020, the paradigm was making bigger pretrained models. Now adding RL to the mix unlocked a new scaling law, thus allowing model performances to jump up.
The paradigm shift of RL was really discovered with o1. In that regard, Amodei thus does not regard R1 as an engineering breakthrough: its base model, DeepSeek-V3, published months ago, was more of a real breakthrough.
β‘οΈ Here, let's not forget this is the CEO of Anthropic talking: "reasoning is not that hard, base model is more important". Let's remember that Anthropic offers no reasoning model (yet), their flagship is still Claude-3.5-Sonnet (which is awesome), so they could have an interest in minimizing reasoning models.
The insane pace of progress of Chinese open models
Amodei acknowledges two great elements in DeepSeek-R1's engineering:
- Good KV cache management
- Good usage of Mixture of Experts, an architecture that allows tokens to be dynamically routed to any of several expert models for better processing: this architecture lets different areas of the network specialize for different tasks, thus theoretically equalling the accuracy of a dense model while activating much fewer parameters.
But then, he goes on to say that DeepSeek-R1 is not that exceptional. Although the Twitter crowd made fun of him for coping so hard with reality, I think he makes some great points in there.
π° Is DeepSeek-R1 really crazy cheap compared to OpenAI/Anthropic/Google/Meta models? How much did Claude-3.5-Sonnet cost? => "a few $10M's to train". So the training price of 5.5 M USD for training DeepSeek is low, but not exceptionally low. Arguably the main costs occurred by big AI labs experimentations and pre-training model. Also, did DeepSeek really use only a few thousand old A100 processors? Amodei seems to doubt it (he's far from being along in that): rumor is that DeepSeek probably used 50k GPUs of the Hopper generation: not necessarily H100s, could be H20 to H800s, depending on how DeepSeek managed to get around US export restrictions.
Amodei proposes to put model performances back in view of the general fast progress in the field: compared to this curve of improvement, he says that R1's achievements on accuracy are not exceptional, worse, they're late compare to american models. The achievement on price is impressive, but not a breakthrough either.
β‘οΈ This is where the cope hits hardest. Amodei argues that DeepSeek V3 was not as good upon as the original Sonnet 3.5, which was "7-10 months older". But the Sonnet he mentions as being better at key tasks like coding is actually the more recent version of Sonnet-3.5, sometimes informally called 3.6. You always need a healthy dose of chauvinism!
β‘οΈ Also, while his point on seing models on a curve is good, I think there are actually two curves: the curve for Chinese, open models now caught up with American, closed models. And given their momentum, Chinese models might be taking the lead soon.
This is probably why, after saying "meh, DeepSeek's tech isn't that good", Amodei's essay ends up as a strong advocacy for export controls, in the vein of "Let's not fear DeepSeek researchers, but the autocratic government that controls them, and consolidate the West's advantage through export controls".
β‘οΈ There is probably some validity to that, given the huge difference that a few years of advance in military tech can make (checkout the Gulf War).
Notwithstanding the caveats above, I found his short essay really interesting, you should go have a read! π https://darioamodei.com/on-deepseek-and-export-controls
(also his previous essay, Machines of loving grace, is amazing)