Transformers-Based Few-Shot Learning for Scene Classification in Child Sexual Abuse Imagery

Thamiris Coelho; Leo S. F. Ribeiro; João Macedo; Jefersson A. dos Santos; Sandra Avila

doi:10.5753/sibgrapi.est.2024.31638

Thamiris Coelho UNICAMP
Leo S. F. Ribeiro UNICAMP
João Macedo UFMG / Polícia Federal
Jefersson A. dos Santos UFMG / Polícia Federal
Sandra Avila University of Sheffield

DOI: https://doi.org/10.5753/sibgrapi.est.2024.31638

Resumo

Sexual abuse affects many children globally, with over 36 million reports in the past year. The vast amount of multimedia content exceeds law enforcement’s analysis capacity, necessitating reliable automated classification tools. While effective, deep learning methods require extensive data and costly annotations that are restricted to law enforcement. This Master’s thesis addresses these challenges using Transformer-based models for classifying indoor scenes, where such content is often found. Utilizing few-shot learning, the study reduces the need for extensive annotations, comparing classic few-shot models with Transformer-based models and exploring different methods for feature vector aggregation. The findings show that aggregating vectors using the mean is most effective, achieving 73.50 ± 0.09% accuracy with just five annotated samples per class. Evaluated with the Brazilian Federal Police, the model achieved 63.38 ± 0.09% balanced accuracy on annotated child sexual abuse indoor scenes, indicating the technique’s potential to aid preliminary screening efforts.

Referências

L. Leopold and H. Engelhartdt, “Education and physical health trajectories in old age. evidence from the survey of health, ageing and retirement in europe (share),” International Journal of Public Health, 2013.

N. Pereda, G. Guilera, M. Forns, and J. Gómez-Benito, “The prevalence of child sexual abuse in community and student samples: A meta-analysis,” Clinical psychology review, vol. 29, no. 4, pp. 328–338, 2009.

M. Stoltenborgh, M. H. Van Ijzendoorn, E. M. Euser, and M. J. Bakermans-Kranenburg, “A global perspective on child sexual abuse: Meta-analysis of prevalence around the world,” Child maltreatment, vol. 16, no. 2, pp. 79–101, 2011.

M. de Castro Polastro and P. M. da Silva Eleuterio, “Nudetective: A forensic tool to help combat child pornography through automatic nudity detection,” in Workshops on Database and Expert Systems Applications, 2010, pp. 349–353.

C. Peersman, C. Schulze, A. Rashid, M. Brennan, and C. Fischer, “icop: Live forensics to reveal previously unknown criminal media on p2p networks,” Digital Investigation, vol. 18, pp. 50–64, 2016.

M. Inc., “Photodna cloud services,” [link], 2020.

C. Schulze, D. Henter, D. Borth, and A. Dengel, “Automatic detection of csa media by multi-modal feature fusion for law enforcement support,” in International conference on multimedia retrieval, 2014, pp. 353–360.

P. Vitorino, S. Avila, M. Perez, and A. Rocha, “Leveraging deep neural networks to fight child pornography in the age of social media,” Journal of Visual Communication and Image Representation, 2018.

J. Macedo, F. Costa, and J. A. dos Santos, “A benchmark methodology for child pornography detection,” in Conference on Graphics, Patterns and Images (SIBGRAPI), 2018.

R. Brewer, B. Westlake, T. Swearingen, S. Patterson, D. Bright, A. Ross, K. Logos, and D. Michalski, “Advancing child sexual abuse investigations using biometrics and social network analysis,” Trends and Issues in Crime and Criminal Justice, no. 668, pp. 1–16, 2023.

E. Bursztein, E. Clarke, M. DeLaune, D. M. Elifff, N. Hsu, L. Olson, J. Shehan, M. Thakur, K. Thomas, and T. Bright, “Rethinking the detection of child sexual abuse imagery on the internet,” in The World Wide Web Conference, 2019, pp. 2601–2607.

P. H. V. Valois, J. Macedo, L. S. F. Ribeiro, J. A. dos Santos, and S. Avila, “Leveraging self-supervised learning for scene recognition in child sexual abuse imagery,” arXiv preprint arXiv:2403.01183, 2024.

C. Laranjeira da Silva, J. Macedo, S. Avila, and J. dos Santos, “Seeing without looking: Analysis pipeline for child sexual abuse datasets,” in ACM Conference on Fairness, Accountability, and Transparency, 2022.

M. Perez, S. Avila, D. Moreira, D. Moraes, V. Testoni, E. Valle et al., “Video pornography detection through deep learning techniques and motion information,” Neurocomputing, 2017.

M. V. Adão Teixeira and S. Avila, “What should we pay attention to when classifying violent videos?” in ARES, 2021.

D. Moreira, S. Avila, M. Perez, D. Moraes, V. Testoni, E. Valle, S. Goldenstein, and A. Rocha, “Multimodal data fusion for sensitive scene localization,” Information Fusion, 2019.

——, “Pornography classification: The hidden clues in video space-time,” Forensic Science International, 2016.

A. Ishikawa, E. Bollis, and S. Avila, “Combating the elsagate phenomenon: Deep learning architectures for disturbing cartoons,” in IEEE International Workshop on Biometrics and Forensics, 2019, pp. 1–6.

J. A. Kloess, J. Woodhams, H. Whittle, T. Grant, and C. E. Hamilton-Giachritsis, “The challenges of identifying and classifying child sexual abuse material,” Sexual Abuse, vol. 31, no. 2, pp. 173–196, 2019.

J. Qiu, Y. Yang, X. Wang, and D. Tao, “Scene essence,” in CVPR, 2021.

Z. Yu, L. Jin, and S. Gao, “P2Net: Patch-match and plane-regularization for unsupervised indoor depth estimation,” in ECCV, 2020, pp. 206–222.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017, pp. 5998–6008.

W.-Y. Chen, Y.-C. Liu, Z. Kira, Y.-C. Wang, and J.-B. Huang, “A closer look at few-shot classification,” in ICLR, 2019.

B. Dong, P. Zhou, S. Yan, and W. Zuo, “Self-promoted supervision for few-shot transformer,” in ECCV, 2022.

F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales, “Learning to compare: Relation network for few-shot learning,” in CVPR, 2018.

J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” in NeurIPS, 2017.

N. Bendre, H. T. Marín, and P. Najafirad, “Learning from few samples: A survey,” arXiv preprint arXiv:2007.15484, 2020.

Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a few examples: A survey on few-shot learning,” ACM Computing Surveys, vol. 53, no. 3, pp. 1–34, 2020.

T. A. Patel, V. K. Dabhi, and H. B. Prajapati, “Survey on scene classification techniques,” in ICACCS, 2020, pp. 452–458.

B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 6, 2017.

W. Wang, J. Dai, Z. Chen, Z. Huang, Z. Li, X. Zhu, X. Hu, T. Lu, L. Lu, H. Li, X. Wang, and Y. Qiao, “Internimage: Exploring large-scale vision foundation models with deformable convolutions,” in CVPR, 2023.

H. Seong, J. Hyun, and E. Kim, “Fosnet: An end-to-end trainable deep neural network for scene recognition,” IEEE Access, 2020.

A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in CVPR, 2009, pp. 413–420.

A. López-Cifuentes, M. Escudero-Viñolo, J. Bescós, and Á. García-Martín, “Semantic-aware scene recognition,” Pattern Recognition, 2020.

J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in CVPR, 2010, pp. 3485–3492.

J. Mahadeokar and G. Pesavento, “Open sourcing a deep learning solution for detecting nsfw images,” Retrieved August, 2016.

J. Rondeau, Deep Learning of Human Apparent Age for the Detection of Sexually Exploitative Imagery of Children. University of Rhode Island, 2019.

F. Anda, N.-A. Le-Khac, and M. Scanlon, “Deepuage: improving underage age estimation accuracy to aid csem investigation,” Forensic Science International: Digital Investigation, vol. 32, p. 300921, 2020.

A. Gangwar, V. González-Castro, E. Alegre, and E. Fidalgo, “Attm-cnn: Attention and metric learning based cnn for pornography, age and child sexual abuse (csa) detection in images,” Neurocomputing, 2021.

J. Rondeau, D. Deslauriers, T. Howard III, and M. Alvarez, “A deep learning framework for finding illicit images/videos of children,” Machine Vision and Applications, vol. 33, no. 5, p. 66, 2022.

J. Dalins, Y. Tyshetskiy, C. Wilson, M. J. Carman, and D. Boudry, “Laying foundations for effective machine learning in law enforcement. majura–a labelling schema for child exploitation materials,” Digital Investigation, vol. 26, pp. 40–54, 2018.

B. Oreshkin, P. Rodríguez López, and A. Lacoste, “Tadam: Task dependent adaptive metric for improved few-shot learning,” in NeurIPS, vol. 31, 2018.

V. G. Satorras and J. B. Estrach, “Few-shot learning with graph neural networks,” in ICLR, 2018.

N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel, “A simple neural attentive meta-learner,” in ICLR, 2018.

H.-J. Ye, H. Hu, D.-C. Zhan, and F. Sha, “Few-shot learning via embedding adaptation with set-to-set functions,” in CVPR, 2020.

C. Doersch, A. Gupta, and A. Zisserman, “Crosstransformers: spatially-aware few-shot transfer,” in NeurIPS, vol. 33, 2020, pp. 21 981–21 993.

H. Chen, H. Li, Y. Li, and C. Chen, “Sparse spatial transformers for few-shot learning,” Sci. China Inf. Sci., 2023.

S. X. Hu, D. Li, J. Stühmer, M. Kim, and T. M. Hospedales, “Pushing the limits of simple pipelines for few-shot learning: External data and fine-tuning make a difference,” in CVPR, 2022.

M. Hiller, R. Ma, M. Harandi, and T. Drummond, “Rethinking generalization in few-shot classification,” in NeurIPS, 2022.

Y. He, W. Liang, D. Zhao, H.-Y. Zhou, W. Ge, Y. Yu, and W. Zhang, “Attribute surrogates learning and spectral tokens pooling in transformers for few-shot learning,” in CVPR, 2022, pp. 9119–9129.

W. Chen, C. Si, Z. Zhang, L. Wang, Z. Wang, and T. Tan, “Semantic prompt for few-shot image recognition,” in CVPR, 2023.

H. Lin, G. Han, J. Ma, S. Huang, X. Lin, and S.-F. Chang, “Supervised masked knowledge distillation for few-shot transformers,” in CVPR, 2023, pp. 19 649–19 659.

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” ICLR, 2021.

L. Liu, W. L. Hamilton, G. Long, J. Jiang, and H. Larochelle, “A universal representation transformer layer for few-shot image classification,” in ICLR, 2021.

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009, pp. 248–255.

C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in ICML, 2017, pp. 1126–1135.

A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell, “Meta-learning with latent embedding optimization,” in ICLR, 2019.