Cross-Database in Deepfake Detection Based on a Convolutional Neural Network and Vision Transformer

  • Erikson Eler Ferreira IFES
  • Jefferson Oliveira Andrade IFES
  • Karin Satie Komati IFES


The proliferation of Deepfake techniques has raised concerns due to their potential to generate misleading multimedia content, leading to ethical, social, and political implications. In response to this emerging issue, collaborative efforts between academia and leading technological entities have committed on developing robust detection methods. Initially, Convolutional Neural Networks (CNNs) were prominent, recently proposed methods, which combine features of CNNs with Vision Transformers (ViT) have shown improved performance. This research centers on evaluating the generalization capacity of these advanced models by subjecting them to cross-database tests with different datasets than those used in their training phases. Our analysis reveals that while both models perform well on known datasets, they face challenges related to overfitting when transitioning to new datasets. Consequently, this study underscores the need for further research in Deepfake detection, ensuring its adaptability and effectiveness in diverse scenarios.

Palavras-chave: deepfakes, generalização, cnn, vit, overfitting


Y. Mirsky and W. Lee, “The creation and detection of deepfakes: A survey,” ACM Computing Surveys (CSUR), vol. 54, no. 1, pp. 1–41, 2021.

J. Bakdash, C. Sample, M. Rankin, M. Kantarcioglu, J. Holmes, S. Kase, E. Zaroukian, and B. Szymanski, “The future of deception: Machine-generated and manipulated images, video, and audio?,” in 2018 International Workshop on Social Sensing (SocialSens), pp. 2–2, IEEE, 2018.

S. Vosoughi, D. Roy, and S. Aral, “The spread of true and false news online,” Science, vol. 359, no. 6380, pp. 1146–1151, 2018.

S. R. Ahmed, E. Sonuç, M. R. Ahmed, and A. D. Duru, “Analysis survey on deepfake detection and recognition with convolutional neural networks,” in 2022 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), pp. 1–7, 2022.

L. Deng, J. Wang, and Z. Liu, “Cascaded network based on Efficient-Net and transformer for deepfake video detection,” Neural Processing Letters, 2023.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2017.

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” CoRR, vol. abs/2010.11929, 2021.

M. M. Bejani and M. Ghatee, “A systematic review on overfitting control in shallow and deep neural networks,” Artificial Intelligence Review, pp. 1–48, 2021.

N. Bonettini, E. D. Cannas, S. Mandelli, L. Bondi, P. Bestagini, and S. Tubaro, “Video face manipulation detection through ensemble of CNNs,” CoRR, vol. abs/2004.07676, 2020.

D. Coccomini, N. Messina, C. Gennaro, and F. Falchi, “Combining Efficient-Net and vision transformers for video deepfake detection,” CoRR, vol. abs/2107.02612, 2021.

S. Li and W. Deng, “Deep emotion transfer network for cross-database facial expression recognition,” in 2018 24th International Conference on Pattern Recognition (ICPR), pp. 3092–3099, IEEE, 2018.

B. Song, Y. Zong, K. Li, J. Zhu, J. Shi, and L. Zhao, “Cross-database micro-expression recognition based on a dual-stream convolutional neural network,” IEEE Access, vol. 10, pp. 66227–66237, 2022.

X. Li, T. Pfister, X. Huang, G. Zhao, and M. Pietikainen, “A spontaneous micro-expression database: Inducement, collection and baseline,” in 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–6, IEEE, 2013.

W.-J. Yan, X. Li, S.-J. Wang, G. Zhao, Y.-J. Liu, Y.-H. Chen, and X. Fu, “Casme ii: An improved spontaneous micro-expression database and the baseline evaluation,” PLOS ONE, vol. 9, pp. 1–8, 01 2014.

Y. Luo, Y. Zhang, J. Yan, and W. Liu, “Generalizing face forgery detection with high-frequency features,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16317–16326, 2021.

B. Dolhansky, J. Bitton, B. Pflaum, J. Lu, R. Howes, M. Wang, and C. C. Ferrer, “The deepfake detection challenge (DFDC) dataset,” arXiv preprint arXiv:2006.07397, 2020.

S. Seferbekov, “Deepfake detection (dfdc) solution by @selimsef,” 2020.

A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner, “Faceforensics++: Learning to detect manipulated facial images,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1–11, 2019.

L. Li, J. Bao, H. Yang, D. Chen, and F. Wen, “Faceshifter: Towards high fidelity and occlusion aware face swapping,” CoRR, vol. abs/1912.13457, 2019.

T.Wang, H. Cheng, K. P. Chow, and L. Nie, “Deep convolutional pooling transformer for deepfake detection,” ACM Trans. Multimedia Comput. Commun. Appl., vol. 19, may 2023.

D. Wodajo and S. Atnafu, “Deepfake video detection using convolutional vision transformer,” 2021.

Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu, “Celeb-df: A new dataset for deepfake forensics,” CoRR, vol. abs/1909.12962, 2019.

Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu, “Celeb-df: A large-scale challenging dataset for deepfake forensics,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3207–3216, 2020.

B. Peng, H. Fan, W. Wang, J. Dong, Y. Li, S. Lyu, Q. Li, Z. Sun, H. Chen, B. Chen, Y. Hu, S. Luo, J. Huang, Y. Yao, B. Liu, H. Ling, G. Zhang, Z. Xu, C. Miao, C. Lu, S. He, X. Wu, and W. Zhuang, “DFGC 2021: A deepfake game competition,” CoRR, vol. abs/2106.01217, 2021.

K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016.

S. Visa, B. Ramsay, A. Ralescu, and E. Knaap, “Confusion matrix-based feature selection.,” vol. 710, pp. 120–127, 01 2011.

M. Grandini, E. Bagli, and G. Visani, “Metrics for multi-class classification: an overview,” ArXiv, vol. abs/2008.05756, 2020.

D. A. Coccomini, N. Messina, C. Gennaro, and F. Falchi, “Davide-coccomini/combining-efficientnet-and-vision-transformers-for-video-deepfake-detection: Code for video deepfake detection model from “combining EfficientNet and vision transformers for video deepfake detection” presented at ICIAP 2021..” [link], 2022.

N. Bonettini, C. Bonettini, E. Daniele, Mandelli, Sara, Bondi, Luca, Bestagini, Paolo, Tubaro, and et al., “Polimi-ispl/icpr2020dfdc: Video face manipulation detection through ensemble of cnns.” [link], 2021.

C. F. G. D. Santos and J. P. Papa, “Avoiding overfitting: A survey on regularization methods for convolutional neural networks,” ACM Computing Surveys (CSUR), vol. 54, no. 10s, pp. 1–25, 2022.

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of machine learning research, vol. 15, pp. 1929–1958, 2014.
FERREIRA, Erikson Eler; ANDRADE, Jefferson Oliveira; KOMATI, Karin Satie. Cross-Database in Deepfake Detection Based on a Convolutional Neural Network and Vision Transformer. In: WORKSHOP DE VISÃO COMPUTACIONAL (WVC), 18. , 2023, São Bernardo do Campo/SP. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 60-65. DOI: