research-article

Realistic Facial Deep Fakes Detection Through Self-Supervised Features Generated by a Self-Distilled Vision Transformer

Authors:
Bruno Rocha Gomes

Pontifical Catholic University of Rio de Janeiro, Brazil

Pontifical Catholic University of Rio de Janeiro, Brazil

0009-0000-5136-2210
View Profile

,
Antonio J. G. Busson

BTG Pactual, Brazil

BTG Pactual, Brazil

0000-0001-5394-0707
View Profile

,
José Boaro

Pontifical Catholic University of Rio de Janeiro, Brazil

Pontifical Catholic University of Rio de Janeiro, Brazil

0000-0002-4456-9050
View Profile

,
Sérgio Colcher

Pontifical Catholic University of Rio de Janeiro, Brazil

Pontifical Catholic University of Rio de Janeiro, Brazil

0000-0002-3476-8718
View Profile

WebMedia '23: Proceedings of the 29th Brazilian Symposium on Multimedia and the WebOctober 2023Pages 177–183https://doi.org/10.1145/3617023.3617047

Published:23 October 2023Publication History

WebMedia '23: Proceedings of the 29th Brazilian Symposium on Multimedia and the Web

Pages 177–183

ABSTRACT

Several large-scale datasets and models to detect deep fake content and aid in combatting its harms have emerged. The best models usually combine Vision Transformers with CNN-based architectures. However, the recent emergence of the so-called Foundation Models (FMs), in which deep learning models are fed with massive amounts of unlabeled data (usually by applying self-supervised techniques), has established a whole new perspective for many tasks previously addressed with specific-tailored models. This work investigates how good FMs can be in DeepFake detection, especially in the case of realistic facial production or adulteration. With this realm, we investigate a model using DINO, a foundation model based on Vision Transformers (ViT) that produces universal self-supervised features suitable for image-level visual tasks. Our experiments show that this model can improve deep fake facial detection in many scenarios with different baselines. In particular, the results showed that models trained with self-attention activation maps had higher AUC and F1-score than the baseline ones in all CNN architectures we used.

References

Darius Afchar, Vincent Nozick, Junichi Yamagishi, and Isao Echizen. 2018. Mesonet: a compact facial video forgery detection network. In 2018 IEEE international workshop on information forensics and security (WIFS). IEEE, 1–7.Google ScholarCross Ref
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).Google Scholar
Ben Pflaum Jikuo Lu Russ Howes Menglin Wang Cristian Canton Ferrer Brian Dolhansky, Joanna Bitton. 2020. The DeepFake Detection Challenge Dataset. arxiv:2006.07397 [cs.CV]Google Scholar
Andrew Brock, Jeff Donahue, and Karen Simonyan. 2018. Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096 (2018).Google Scholar
Adrian Bulat and Georgios Tzimiropoulos. 2017. How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE International Conference on Computer Vision. 1021–1030.Google ScholarCross Ref
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the International Conference on Computer Vision (ICCV).Google ScholarCross Ref
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597–1607.Google Scholar
Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. 2018. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8789–8797.Google ScholarCross Ref
François Chollet. 2017. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1251–1258.Google ScholarCross Ref
Davide Coccomini, Nicola Messina, Claudio Gennaro, and Fabrizio Falchi. 2021. Combining efficientnet and vision transformers for video deepfake detection. arXiv preprint arXiv:2107.02612 (2021).Google Scholar
Riccardo Corvi, Davide Cozzolino, Giada Zingarini, Giovanni Poggi, Koki Nagano, and Luisa Verdoliva. 2022. On the detection of synthetic images generated by diffusion models. arXiv preprint arXiv:2211.00680 (2022).Google Scholar
Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A Bharath. 2018. Generative adversarial networks: An overview. IEEE Signal Processing Magazine 35, 1 (2018), 53–65.Google ScholarCross Ref
Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems 34 (2021), 8780–8794.Google Scholar
Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. 2020. The deepfake detection challenge (dfdc) dataset. arXiv preprint arXiv:2006.07397 (2020).Google Scholar
Yao Feng, Fan Wu, Xiaohu Shao, Yanfeng Wang, and Xi Zhou. 2018. Joint 3d face reconstruction and dense alignment with position map regression network. In Proceedings of the European conference on computer vision (ECCV). 534–551.Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.Google ScholarCross Ref
Young-Jin Heo, Young-Ju Choi, Young-Woon Lee, and Byung-Gyu Kim. 2021. Deepfake detection scheme based on vision transformer and distillation. arXiv preprint arXiv:2104.01353 (2021).Google Scholar
Vladimir Iglovikov and Alexey Shvets. 2018. Ternausnet: U-net with vgg11 encoder pre-trained on imagenet for image segmentation. arXiv preprint arXiv:1801.05746 (2018).Google Scholar
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1125–1134.Google ScholarCross Ref
Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision. Springer, 694–711.Google ScholarCross Ref
Andrew Kae, Kihyuk Sohn, Honglak Lee, and Erik Learned-Miller. 2013. Augmenting CRFs with Boltzmann machine shape priors for image labeling. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2019–2026.Google ScholarDigital Library
Hasam Khalid and Simon S Woo. 2020. Oc-fakedect: Classifying deepfakes using one-class variational autoencoder. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. 656–657.Google ScholarCross Ref
Davis E King. 2009. Dlib-ml: A machine learning toolkit. The Journal of Machine Learning Research 10 (2009), 1755–1758.Google ScholarDigital Library
Oliver Langner, Ron Dotsch, Gijsbert Bijlstra, Daniel HJ Wigboldus, Skyler T Hawk, and AD Van Knippenberg. 2010. Presentation and validation of the Radboud Faces Database. Cognition and emotion 24, 8 (2010), 1377–1388.Google Scholar
Mu Li, Wangmeng Zuo, and David Zhang. 2016. Deep identity-aware transfer of facial attributes. arXiv preprint arXiv:1610.05586 (2016).Google Scholar
Yuezun Li, Pu Sun, Honggang Qi, and Siwei Lyu. 2022. Toward the Creation and Obstruction of DeepFakes. In Handbook of Digital Face Manipulation and Detection. Springer, Cham, 71–96.Google Scholar
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep Learning Face Attributes in the Wild. In Proceedings of International Conference on Computer Vision (ICCV).Google ScholarDigital Library
Brianna Maze, Jocelyn Adams, James A Duncan, Nathan Kalka, Tim Miller, Charles Otto, Anil K Jain, W Tyler Niggel, Janet Anderson, Jordan Cheney, 2018. Iarpa janus benchmark-c: Face dataset and protocol. In 2018 international conference on biometrics (ICB). IEEE, 158–165.Google ScholarCross Ref
Sachin Mehta, Ezgi Mercan, Jamen Bartlett, Donald Weaver, Joann G Elmore, and Linda Shapiro. 2018. Y-Net: joint segmentation and classification for diagnosis of breast biopsy images. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 893–901.Google ScholarDigital Library
Yuval Nirkin, Yosi Keller, and Tal Hassner. 2019. Fsgan: Subject agnostic face swapping and reenactment. In Proceedings of the IEEE/CVF international conference on computer vision. 7184–7193.Google ScholarCross Ref
Yuval Nirkin, Iacopo Masi, Anh Tran Tuan, Tal Hassner, and Gerard Medioni. 2018. On face segmentation, face swapping, and face perception. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 98–105.Google ScholarDigital Library
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, 2023. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023).Google Scholar
Guim Perarnau, Joost Van De Weijer, Bogdan Raducanu, and Jose M Álvarez. 2016. Invertible conditional gans for image editing. arXiv preprint arXiv:1611.06355 (2016).Google Scholar
Ivan Perov, Daiheng Gao, Nikolay Chervoniy, Kunlin Liu, Sugasa Marangonda, Chris Umé, Mr Dpfks, Carl Shift Facenheim, Luis RP, Jian Jiang, 2020. DeepFaceLab: Integrated, flexible and extensible face-swapping framework. arXiv preprint arXiv:2005.05535 (2020).Google Scholar
Artem A Pokroy and Alexey D Egorov. 2021. EfficientNets for deepfake detection: Comparison of pretrained models. In 2021 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (ElConRus). IEEE, 598–600.Google ScholarCross Ref
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021. High-Resolution Image Synthesis with Latent Diffusion Models. arxiv:2112.10752 [cs.CV]Google Scholar
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, 2022. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402 (2022).Google Scholar
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence.Google ScholarDigital Library
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1–9.Google ScholarCross Ref
Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning. PMLR, 6105–6114.Google Scholar
Eric Tjon, Melody Moh, and Teng-Sheng Moh. 2021. Eff-YNet: A Dual Task Network for DeepFake Detection and Segmentation. In 2021 15th International Conference on Ubiquitous Information Management and Communication (IMCOM). IEEE, 1–8.Google Scholar
Junke Wang, Zuxuan Wu, Jingjing Chen, and Yu-Gang Jiang. 2021. M2tr: Multi-modal multi-scale transformers for deepfake detection. arXiv preprint arXiv:2104.09770 (2021).Google Scholar
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600–612.Google ScholarDigital Library
Xinye Wanyan, Sachith Seneviratne, Shuchang Shen, and Michael Kirley. 2023. DINO-MC: Self-supervised Contrastive Learning for Remote Sensing Imagery with Multi-sized Local Crops. arXiv preprint arXiv:2303.06670 (2023).Google Scholar
Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE signal processing letters 23, 10 (2016), 1499–1503.Google ScholarCross Ref
Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo Wang, and Stan Z Li. 2017. S3fd: Single shot scale-invariant face detector. In Proceedings of the IEEE international conference on computer vision. 192–201.Google Scholar
Hanqing Zhao, Wenbo Zhou, Dongdong Chen, Weiming Zhang, and Nenghai Yu. 2022. Self-supervised transformer for deepfake detection. arXiv preprint arXiv:2203.01265 (2022).Google Scholar
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision. 2223–2232.Google ScholarCross Ref

Index Terms

Realistic Facial Deep Fakes Detection Through Self-Supervised Features Generated by a Self-Distilled Vision Transformer
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Boosting Facial Landmark Detection via Self-supervised and Semi-supervised Learning
SOICT '23: Proceedings of the 12th International Symposium on Information and Communication Technology

Keypoint detection is one of the main focused fields in computer vision with various applications. Traditional fully-supervised deep learning methods currently dominate the field with impressive accuracy, but typically require careful, expensive, and ...
Read More
Twin self-supervision based semi-supervised learning (TS-SSL): Retinal anomaly classification in SD-OCT images
Abstract
The performance of supervised deep learning significantly relies on the volume of training samples. However, the vast majority of medical images lacks manual expert annotations. Compared to natural image annotation, the cost of medical ...
Read More
Deep Weakly-supervised Anomaly Detection
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Recent semi-supervised anomaly detection methods that are trained using small labeled anomaly examples and large unlabeled data (mostly normal data) have shown largely improved performance over unsupervised methods. However, these methods often focus on ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

WebMedia '23: Proceedings of the 29th Brazilian Symposium on Multimedia and the Web
October 2023
285 pages
ISBN:9798400709081
DOI:10.1145/3617023

Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
deep fake detection
deep learning
self-supervised
vision transfomers
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate270of873submissions,31%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 45
  Total Downloads
- Downloads (Last 12 months)45
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Realistic Facial Deep Fakes Detection Through Self-Supervised Features Generated by a Self-Distilled Vision Transformer

WebMedia '23: Proceedings of the 29th Brazilian Symposium on Multimedia and the Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Boosting Facial Landmark Detection via Self-supervised and Semi-supervised Learning

Twin self-supervision based semi-supervised learning (TS-SSL): Retinal anomaly classification in SD-OCT images

Deep Weakly-supervised Anomaly Detection

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Realistic Facial Deep Fakes Detection Through Self-Supervised Features Generated by a Self-Distilled Vision Transformer

WebMedia '23: Proceedings of the 29th Brazilian Symposium on Multimedia and the Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Boosting Facial Landmark Detection via Self-supervised and Semi-supervised Learning

Twin self-supervision based semi-supervised learning (TS-SSL): Retinal anomaly classification in SD-OCT images

Deep Weakly-supervised Anomaly Detection

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media