Realistic Facial Deep Fakes Detection Through Self-Supervised Features Generated by a Self-Distilled Vision Transformer

  • Bruno Rocha Gomes PUC-Rio
  • Antonio J. G. Busson BTG Pactual
  • José Boaro PUC-Rio
  • Sérgio Colcher PUC-Rio


Several large-scale datasets and models to detect deep fake content and aid in combatting its harms have emerged. The best models usually combine Vision Transformers with CNN-based architectures. However, the recent emergence of the so-called Foundation Models (FMs), in which deep learning models are fed with massive amounts of unlabeled data (usually by applying self-supervised techniques), has established a whole new perspective for many tasks previously addressed with specific-tailored models. This work investigates how good FMs can be in DeepFake detection, especially in the case of realistic facial production or adulteration. With this realm, we investigate a model using DINO, a foundation model based on Vision Transformers (ViT) that produces universal self-supervised features suitable for image-level visual tasks. Our experiments show that this model can improve deep fake facial detection in many scenarios with different baselines. In particular, the results showed that models trained with self-attention activation maps had higher AUC and F1-score than the baseline ones in all CNN architectures we used.
Palavras-chave: deep fake detection, self-supervised, vision transfomers, deep learning


