🔍Larger language models tend to memorize more training data and can regurgitate it when prompted with specific inputs.
📉There is a substantial gap between lower bounds on extractable memorization and upper bounds assuming full access to the training set.
🔬Better prompt design and improved testing methods are needed to accurately measure and address the issue of training data leakage.
🌐Existing extraction attacks may already make language models regurgitate large amounts of training data, but prior work has not been able to verify this.
⚖️The paper highlights the importance of balancing privacy concerns and the benefits of training large language models.