DeepSeek announces a more powerful update of its DeepSeek v3 model

The Chinese start-up DeepSeek has quietly launched DeepSeek-V3-0324, an update of its eponymous open-source model DeepSeek-V3. This new version, with optimized capabilities in mathematics and programming, could foreshadow DeepSeek R2, which is expected to be released soon. The model, published under the MIT license, one of the most permissive, is available on Hugging Face.

This launch was not accompanied by any statement from the start-up created in May 2023 in Hangzhou, led by Liang Wenfeng and a subsidiary of the hedge fund High-Flyer.

While DeepSeek-V3 has 671 billion parameters, DeepSeek-V3-0324 boasts 685 billion and is powered by a cluster of 32,000 GPUs, making it one of the most powerful open-source models in its category. It relies on a Mixture-of-Experts architecture developed for its predecessors, DeepSeekMoE, composed, as its name suggests, of various specialized experts. These are activated according to the specific needs of requests through an intelligent routing mechanism, allowing the model to efficiently manage a variety of tasks while reducing computational load.

It is also expected to adopt their innovative Multi-head Latent Attention (MLA) architecture, an approach that allows for jointly compressing keys and attention values, thus reducing the size of the Key-Value (KV) cache during inference, reducing memory usage while improving processing efficiency.

While DeepSeek presents this version as a minor update of DeepSeek V3 on X, early comments, just a few hours after the launch, highlight real advances, especially in mathematics and programming.
DeepSeek's performance continues to fuel speculation. The DeepSeek R1 model, the startup's first reasoning model, based on V3, with its advanced reasoning capabilities that surprised experts with training and usage costs significantly lower than those of its American competitors, managed to disrupt Wall Street.

According to an article published by La Tribune yesterday, the United States wants to find an explanation for the "DeepSeek mystery" in a possible smuggling of Nvidia chips. No technical hypothesis should be ruled out, but it would be a mistake not to see DeepSeek as a new reference player in Open Source AI, as evidenced by this extremely interesting GitHub repository.

Translated from DeepSeek annonce une mise à jour plus puissante de son modèle DeepSeek v3

To better understand

What is the Mixture-of-Experts architecture used in DeepSeek-V3-0324?

The Mixture-of-Experts (MoE) architecture involves using different specialized sub-models, or 'experts', activated based on task demands. This allows efficient computational resource allocation, optimizing performance for specific tasks without overloading the system.

How does Multi-head Latent Attention (MLA) work in DeepSeek-V3-0324?

The Multi-head Latent Attention (MLA) in DeepSeek-V3-0324 enhances processing by jointly compressing attention keys and values. This reduces the Key-Value cache size during inference, optimizing memory usage while maintaining high processing efficiency.