Learning-Deep-Learning

RHO-1: Not All Tokens Are What You Need

January 2026

tl;dr: Selective Language Modeling (SLM) only backprop loss on valueable tokens.

Overall impression

“Rho” denotes selective modeling of tokens with higher information “density (ρ).

Rho-1 is a great way to wash pretraining dataset.

The paper belong to the idea family that training on high quality tokens matter a lot. This is again quality > quantity. The results look very promising with much reduced training compute (matching DeepSeekMath with only 3% of the pretraining tokens).

Key ideas

Technical details

Notes