Contamination Budget: Trade-offs Between Breadth, Depth and Difficulty
Contamination Budget: Trade-offs Between Breadth, Depth and Difficulty
Behzad Mehrbakhsh, Fernando Martínez-Plumed, José Hernández-Orallo
Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 8195-8203.
https://doi.org/10.24963/ijcai.2025/911
Contamination in large language models (LLMs), and machine learning more broadly, refers to the inclusion of equal --or very similar-- examples in both training and test sets. This phenomenon usually translates into better test performance. Here we explore when this contamination is performed intentionally, for purposes that can be malicious (e.g., get better scores in evaluations) or benevolent (e.g., fix some mistakes). These interventions, usually in the form of fine-tuning memorisations, come with a budget in the size of the fine-tuning dataset. Several trade-offs appear between the breadth of the intervention (how many examples to be memorised), its depth (how many repetitions of each example) and the difficulty of the examples. By studying several LLMs and datasets, we observe some monotonic behaviour (more difficult items require more depth to be `fixed') but also some non-monotonic phenomena (very high depth levels have negative effects on non-contaminated examples). This suggests that trade-offs should be found not only in terms of the budget but also according to model specifics, the task and the item difficulty at hand.
Keywords:
Natural Language Processing: NLP: Resources and evaluation
Machine Learning: ML: Benchmarks
Machine Learning: ML: Evaluation
