A major AI training data set contains millions of examples of personal data

MIT Technology Review - AI
Jul 18, 2025 13:08
Eileen Guo
1 views
airesearchtechnology

Summary

New research reveals that the DataComp CommonPool, a major open-source AI image training set, contains millions of images with sensitive personal data such as passports, credit cards, and birth certificates. This discovery raises serious privacy concerns and highlights the urgent need for stricter data vetting and ethical standards in AI training practices.

Millions of images of passports, credit cards, birth certificates, and other documents containing personally identifiable information are likely included in one of the biggest open-source AI training sets, new research has found. Thousands of images—including identifiable faces—were found in a small subset of DataComp CommonPool, a major AI training set for image generation scraped from…

Related Articles