AI training dataset leaks thousands of live API keys and passwords
Researchers found that almost two-thirds of secrets duplicated across multiple pages
Researchers uncover nearly 12,000 private API keys and passwords embedded within open-source dataset
Researchers have uncovered nearly 12,000 private API keys and passwords embedded within the Common Crawl dataset; an open-source repository of web data used by leading AI developers to train their AI models.
The discovery was made by Truffle Security, a firm specialising in secrets detection.
In their analysis of billions of web pages archived by Common Crawl in 2024, researchers found thousands of hardcoded secrets left exposed within the dataset.
The compromised data included API keys, passwords, and other login credentials, with the majority linked to Amazon Web Services (AWS), MailChimp, and WalkScore accounts.
"This highlights a growing issue: LLMs trained on insecure code may inadvertently generate unsafe outputs," the researchers said.
Common Crawl, a nonprofit organisation, hosts a colossal archive of freely available web data gathered through extensive web crawling efforts. As of recent estimates, its archives total over 250 petabytes, with new crawls contributing several petabytes every month.
This wealth of data is regularly used to train some of the world's leading LLMs, including those developed by OpenAI, Google, Meta, and DeepSeek, among others.
Truffle Security conducted an analysis of 400 terabytes of data from 2.67 billion web pages within the Common Crawl's December 2024 archive. Their findings revealed 11,908 successfully authenticated secrets, indicating that developers had hardcoded these credentials into their web pages, potentially exposing LLMs to insecure code.
While LLM training data undergoes pre-processing, including cleaning and filtering to remove irrelevant, harmful, duplicate, or sensitive information, the sheer scale of the Common Crawl dataset makes it exceedingly difficult to guarantee the complete removal of confidential data.
Among the specific findings, Truffle Security identified nearly 1,500 unique MailChimp API keys hardcoded directly into front-end HTML and JavaScript files. Such oversights expose these keys to potential misuse, including phishing campaigns, brand impersonation, and data exfiltration.
Alarmingly, the researchers found that almost two-thirds (63%) of the secrets were duplicated across multiple pages. One WalkScore API key appeared 57,029 times across 1,871 different subdomains, amplifying the potential impact of its compromise.
Furthermore, researchers discovered a single webpage containing 17 unique live Slack webhooks.
"Keep it secret, keep it safe," Slack warned users.
"Your webhook URL contains a secret. Don't share it online, including via public version control repositories."
Truffle Security has reportedly notified the affected vendors, assisting them in revoking compromised keys and mitigating further damage.
"Our research confirms that LLMs are exposed to millions of examples of code containing hardcoded secrets in the Common Crawl dataset," the researchers said.
"LLMs may benefit from improved alignment and additional safeguards – potentially through techniques like Constitutional AI – to reduce the risk of inadvertently reproducing or exposing sensitive information," they added.