Deepfake Detection with GenConViT and DeepfakeBenchmark: A Comparative Study on the Deepspeak Dataset
Resumo
Deepfake videos have evolved rapidly, exploiting Generative Adversarial Networks (GANs) and Transformer advances to produce persuasive face-swap and lip-sync forgeries. This paper revisits automatic detection through the Generative Convolutional Vision Transformer (GenConViT). After just five epochs of fine-tuning on the recent DeepSpeak challenge, 44 h of video spanning 220 identities, GenConViT-AE achieves an area under the Receiver Operating Characteristic (ROC) curve of 0.993 [0.990, 0.996] and 93.82% [92.53, 95.04] accuracy, outperforming the best baseline by approximately one percentage point in Area Under the Curve (AUC). The detector also identifies 98.8% of forged videos while maintaining an 88.9% true-negative rate, confirming balanced sensitivity and specificity. Ablation studies reveal diminishing returns beyond five epochs, and error analysis highlights persistent challenges of falsification techniques. These results position domain-tuned hybrid CNN-Transformer architectures as robust solutions for large-scale forensic pipelines.
Palavras-chave:
Deepfakes, Computer vision, Accuracy, Forensics, Pipelines, Autoencoders, Detectors, Benchmark testing, Transformers, Forgery
Publicado
30/09/2025
Como Citar
BATISTA, Matheus Martins; DRUMMOND, Isabela Neves; BATISTA, Bruno Guazzelli.
Deepfake Detection with GenConViT and DeepfakeBenchmark: A Comparative Study on the Deepspeak Dataset. In: CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI), 38. , 2025, Salvador/BA.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 289-294.
