Combines Aspects of Large Language Models and Diffusion Models
LLaDA introduces a fundamentally different approach to language generation by replacing traditional autoregression with a “diffusion-based” process (we will dive later into why this is called “diffusion”).
1. We fix a maximum length (similar to ARMs). Typically, this could be 4096 tokens. 1% of the time, the lengths of sequences are randomly sampled between 1 and 4096 and padded so that the model is also exposed to shorter sequences.
2. We randomly choose a “masking rate”. For example, one could pick 40%.
3. We mask each token with a probability of 0.4. What does “masking” mean exactly? Well, we simply replace the token with a special token: . As with any other token, this token is associated with a particular index and embedding vector that the model can process and interpret during training.