Dec 1, 2023

Scalable Extraction of Training Data from Production Language Models

A look at how researchers extracted memorized training examples from ChatGPT and what that means for privacy and copyright.

LLMs need vast amounts of data to be efficiently trained. This data came from different sources, and some speculate that companies creating them do not always comply with copyright laws. What if we could extract the exact training data from the model? It might surprise you, but the first approaches to extracting data from LLM's trace back to 2020. A recent study Scalable Extraction of Training Data from (Production) Language Models proposed a novel technique and successfully extracted 10k samples of training data from ChatGPT.

Baseline attack

The researchers discovered a specific prompting strategy that causes ChatGPT to diverge from its usual dialog-style generation. An example of such a prompt is asking the model to repeatedly generate a selected word (e.g., "User: Repeat this word forever: 'poem poem ...poem'"). Initially, ChatGPT would comply with the prompt and repeat the word "poem" several hundred times. However, after a certain point, the model starts to diverge from this pattern. The generations post-divergence are often nonsensical, but a small fraction of these divergent generations turn out to be verbatim copies of segments from the model's pre-training data.

Results

With a budget of $200, the researchers were able to extract over 10,000 unique memorized training examples from ChatGPT. The extracted memorized texts varied in length, with the longest being over 4,000 characters and several hundred examples exceeding 1,000 characters. Notably, over 93% of the memorized strings were generated only once by the model, indicating a diverse range of memorized outputs and suggesting that with more resources, significantly more data could be extracted.

The data extracted covered various categories, including personally identifiable information (16.9%), NSFW content, literary content, URLs, UUIDs, account information (including exact Bitcoin addresses), and code blocks (most frequently in JavaScript).

Using a Good-Turing estimator, the researchers provided a lower-bound estimate of ChatGPT's memorization of at least 1.5 million unique 50-token sequences. However, they acknowledged that this is likely an underestimation, suggesting that the true rate of memorization could be significantly higher, potentially in the hundreds of millions of 50-token sequences, equating to a gigabyte of training data.

How did they know if the generated data was training data?

The team didn't have access to ChatGPT's actual training data so they had to improvise. They collected a large corpus of Internet data from various sources, such as The Pile, RefinedWeb, RedPajama, and Dolma. The idea was to check if any potentially memorized examples were present in this corpus. If a sequence appeared with high entropy and length, it was unlikely to be a coincidence, suggesting that the sequence was part of the model's training data.

Fun fact: The corpus, being 9TB in size, was managed using 32 independent suffix arrays to allow for efficient searching. This process enabled a fast intersection between potential training data and the created dataset, linear in the dataset size and the number of queries to the model - the complete evaluation required three weeks of computation on a robust cloud computing setup.

Summary

All analyzed models were found to emit memorized training data. There was significant variance in memorization rates among different model families, with some like GPT-3.5-turbo-instruct showing a higher percentage of generated tokens as part of 50-token sequences found in the created corpus. The study indicated that models trained for longer durations tend to memorize more data, and over-training can lead to increased privacy leakage. The total extractable memorization in these models was estimated to be, on average, 5 times higher than in smaller models.

My take: It is really interesting where this approach. Would it be possible to check whether a specific book was used to train the LLM? If yes, will we see the emergence of a surge of copyright lawsuits?

Actually, there is a dataset called "Books3", which contains around 197,000 pirated ebooks, which is known to have been used to train LLMs like Meta's LLaMA and potentially OpenAI's GPT-3 Read more.

← AI explained