Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning

February 2023

tl;dr: We will run out of high quality language data by 2026, and low-quality language data will run out by 2030-2050. Image data will run out much slower, perhaps by 2030-2060. We need to boost data efficiency in the long run to keep ML algorithm going.

Overall impression

The paper is full of interesting statistics for language datasets (web, papers, books, etc), and inspiring statistical modeling of growth trends.

The current trend of ever-growing ML model may slow down significantly if data efficiency is not drastically improved or new data sources become available.

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. – Richard Sutton, The Bitter Lesson

Key ideas

Technical details