Enhancing Distilled Datasets Via Natural Data Mixing

  • Ian Pons USP
  • Guilherme B. Stern USP
  • Anna H. Reali Costa USP
  • Artur Jordao USP

Resumo


Dataset distillation emerges as a promising technique to reduce web-scale datasets into a compact version with only a few samples per class. It involves distilling a large dataset into a compact synthetic set that aims to preserve representative information from the original data, offering advantages such as higher training efficiency and data privacy. However, existing techniques fail to fully capture the underlying properties of original (natural) training samples. Hence, learning solely on distilled images-the standard practice-leads models to encounter a notable generalization gap. In this work, we propose a simple yet effective mechanism to enhance distilled images. Our method transfers powerful and discriminative characteristics from natural images to distilled samples through a simple mixing process. Extensive experiments on benchmarks confirm that our method consistently improves generalization accuracy. Notably, we demonstrate that our approach enables distilled sets with only 10 images per class to match or exceed the performance of state-of-the-art methods trained on 50 images per class, representing a 5× gain in training efficiency. On challenging ImageNet subsets, it increases predictive performance by up to 11.5 percentage points. We also confirm that our method more effectively preserves internal representations concerning full dataset training when compared to plain state-of-the-art dataset distillation methods. Crucially, our method achieves these improvements without increasing the size of the distilled set, thus preserving the efficiency and privacy advantages inherent to dataset distillation. Moreover, our method enhances robustness to common corruptions, improving predictive performance by an average of 9.05 percentage points. It also improves accuracy against moderate adversarial attacks. Code is available at: github.com/IanPons/Enhancing-Distilled-Datasets

Palavras-chave: Training, Graphics, Data privacy, Accuracy, Codes, Self-supervised learning, Benchmark testing, Robustness, Standards
Publicado
30/09/2025
PONS, Ian; STERN, Guilherme B.; COSTA, Anna H. Reali; JORDAO, Artur. Enhancing Distilled Datasets Via Natural Data Mixing. In: CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI), 38. , 2025, Salvador/BA. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 140-145.