Deepseek AI Made Quite the Splash πŸ‹

Image Credit: Imagen 3; prompt: generate an image of deepseek whale

What’s New

β€’ Mixture of Experts (MoE): Modifies the feedforward block of the neural network by duplicating them into several experts and intelligently routing each input token to an expert and bringing together the sequence before the attention block.

β€œMixture-of-Experts is a simple extension of transformers which is rapidly establishing itself as being the go-to architecture for mid-to-large size LLM (20B-600B params).”
— Thomas Wolfe, Hugging Face

β€’ Dynamic Load Balancing: In tandem with MoE, dynamic load balancing essentially adds a bias term to each expert’s router scores. As training progresses, expert loads are mathematically guaranteed to reach a perfect balance.

β€’
Inferior Semiconductors: In lieu of being able to access US chips due to export bans, DeepSeek was able to train their models at a fraction of the cost compared to other SOTA frontier models – reportedly for only ~$6M in compute costs. This resulted in substantial cost savings when combined with the cost of the chips that were used, relative to US companies.

β€’ Reinforcement Learning Approach: Perhaps the most notable achievement was achieved by largely skipping the traditional supervised fine-tuning stage in training R1. DeepSeek instead relied heavily on reinforcement learning (RL) to develop reasoning abilities, which means it learns primarily through rewards via trial and error, without needing large amounts of pre-labeled data, significantly reducing training costs and time.

β€’ Open-source: The company released their models to the open-source community which allowed everyone to take advantage of the innovative advancements, and they continue to release open-source updates.

Last week DeepSeek released its dual pipeline parallelism approach which addresses inefficiencies in pipeline parallelism with a bidirectional approach (a full overlap of forward and backward computation-communication phases).

Check out their GitHub repo: https://lnkd.in/e3Yn7yBt


Callouts

While distilled versions of the model have since been released on Hugging Face, R1 and V3 had some questionable responses that omitted key facts when asked about China’s history (ie. Tiananmen Square protests in 1989).


Why This Matters

As US companies continue to ramp up Gen AI expenditures, which often range in the billions, DeepSeek proved that open-source alternatives can be developed at a fraction of the cost, and a much faster rate, renewing interest in the vast possibilities for SLMs and open-source alternatives.

Previous
Previous

AI UX: Building at the Speed of Trust

Next
Next

The Unsung Heroes Orchestrating ROI