Deepseek AI Made Quite the Splash π
Image Credit: Imagen 3; prompt: generate an image of deepseek whale
Whatβs New
β’ Mixture of Experts (MoE): Modifies the feedforward block of the neural network by duplicating them into several experts and intelligently routing each input token to an expert and bringing together the sequence before the attention block.
βMixture-of-Experts is a simple extension of transformers which is rapidly establishing itself as being the go-to architecture for mid-to-large size LLM (20B-600B params).β
β’ Dynamic Load Balancing: In tandem with MoE, dynamic load balancing essentially adds a bias term to each expertβs router scores. As training progresses, expert loads are mathematically guaranteed to reach a perfect balance.
β’ Inferior Semiconductors: In lieu of being able to access US chips due to export bans, DeepSeek was able to train their models at a fraction of the cost compared to other SOTA frontier models β reportedly for only ~$6M in compute costs. This resulted in substantial cost savings when combined with the cost of the chips that were used, relative to US companies.
β’ Reinforcement Learning Approach: Perhaps the most notable achievement was achieved by largely skipping the traditional supervised fine-tuning stage in training R1. DeepSeek instead relied heavily on reinforcement learning (RL) to develop reasoning abilities, which means it learns primarily through rewards via trial and error, without needing large amounts of pre-labeled data, significantly reducing training costs and time.
β’ Open-source: The company released their models to the open-source community which allowed everyone to take advantage of the innovative advancements, and they continue to release open-source updates.
Last week DeepSeek released its dual pipeline parallelism approach which addresses inefficiencies in pipeline parallelism with a bidirectional approach (a full overlap of forward and backward computation-communication phases).
Check out their GitHub repo: https://lnkd.in/e3Yn7yBt
Callouts
While distilled versions of the model have since been released on Hugging Face, R1 and V3 had some questionable responses that omitted key facts when asked about Chinaβs history (ie. Tiananmen Square protests in 1989).
Why This Matters
As US companies continue to ramp up Gen AI expenditures, which often range in the billions, DeepSeek proved that open-source alternatives can be developed at a fraction of the cost, and a much faster rate, renewing interest in the vast possibilities for SLMs and open-source alternatives.