Learning-Deep-Learning

Fast-dLLM v2: Efficient Block-Diffusion LLM

January 2026

tl;dr: Adapt AR into a block diffusion model (hybrid model, blockwise AR + intra-block diffusion), and inference intra-block with Fast-dLLM.

Both v1 and v2 enforce left-to-right blockwise AR dependency. But v2 adapts the training of the model and aligns it better with inference (训推一致).

Note that Fast-dLLM v2 is a hybrid model (block diffusion model) and is NOT a Full Diffusion Model.

V2 is a lightweight adaptation of an AR to a intra-block diffusion model.
- v2 does not modify the pre-trained AR LLM’s architectur
Specialized attention mask of 2Lx2L size.
- Noised sequence is concatenated with clean sequence and thus the 2L length
- Vectorized training via special pattern.

Whether to refresh for decoded blocks in infernece
- v1 refresh the KV cache of decoded block
- v2 inference does NOT update the kV cash after a block is decoded, as the model is decoding blockwise in an AR way
V2 inference
- Inter-block (from block to block), use standard KV cache.
- Intra-block (within a block), V2 also uses the inference techniques for V1. V2 devides each block into subblocks, and the dualcache happens on the subblock level (v2 subblock == v1 block, smallest unit of diffusion-style unmasking).

DualCache for sub-blocks does NOT include the suffix cache for undecoded blocks beyond the current block being decoded.
- Inter-block: Strict left-to-right AR dependency (no access to future blocks).
- Intra-block: Bidirectional attention (access to all tokens within the current block, including masked/suffix tokens in the same block).