Multimodal person discovery using label propagation over speaking faces graphs

Gabriel Barbosa Fonseca; Zenilton K. G. Patrocínio Jr; Guillaume Gravier; Silvio Jamil F. Guimarães

doi:10.5753/sibgrapi.est.2019.8312

Gabriel Barbosa Fonseca PUC Minas
Zenilton K. G. Patrocínio Jr PUC Minas
Guillaume Gravier CNRS, IRISA
Silvio Jamil F. Guimarães PUC Minas

DOI: https://doi.org/10.5753/sibgrapi.est.2019.8312

Resumo

The indexing of large datasets is a task of great importance, since it directly impacts on the quality of information that can be retrieved from these sets. Unfortunately, some datasets are growing in size so fast that manually indexing becomes unfeasible. Automatic indexing techniques can be applied to overcome this issue, and in this study, a unsupervised technique for multimodal person discovery is proposed, which consists in detecting persons that are appearing and speaking simultaneously on a video and associating names to them. To achieve that, the data is modeled as a graph of speaking-faces, and names are extracted via OCR and propagated through the graph based on audiovisual relations between speaking faces. To propagate labels, two graph based methods are proposed, one based on random walks and the other based on a hierarchical approach. In order to assess the proposed approach, we use two graph clustering baselines, and different modality fusion approaches. On the MediaEval MPD 2017 dataset, the proposed label propagation methods outperform all literature methods except one, which uses a different approach on the pre-processing step. Even though the Kappa coefficient indicates that the random walk and the hierarchical label propagation produce highly equivalent results, the hierarchical propagation is more than 6 times faster than the random walk under same configurations.

Referências

M. Everingham, J. Sivic, and A. Zisserman, “Hello! my name is... buffy–automatic naming of characters in tv video,” 2006. https://doi.org/10.5244/c.20.92

L. Canseco, L. Lamel, and J. L. Gauvain, “A comparative study using manual and automatic transcriptions for diarization,” in IEEE Workshop on Automatic Speech Recognition and Understanding, 2005., Nov 2005, pp. 415–419. https://doi.org/10.1109/asru.2005.1566507

L. Canseco-Rodriguez, L. Lamel, and J.-L. Gauvain, “Speaker diarization from speech transcripts,” in INTERSPEECH. ICSLP, 2004.

S. E. Tranter, “Who really spoke when? finding speaker turns and identities in broadcast news audio,” in 2006 IEEE ICASSP, vol. 1, May 2006, pp. I–I. https://doi.org/10.1109/icassp.2006.1660195

Y. Estève, S. Meignier, P. Deléglise, and J. Mauclair, “Extracting true speaker identities from transcriptions,” in INTERSPEECH 2007 – ICSLP, 2007, pp. 2601–2604.

J. Mauclair, S. Meignier, and Y. Esteve, “Speaker diarization: About whom the speaker is talking ?” in 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop, June 2006, pp. 1–6. https://doi.org/10.1109/odyssey.2006.248114

R. Houghton, “Named faces: putting names to faces,” IEEE Intelligent Systems and their Applications, vol. 14, no. 5, pp. 45–50, Sep 1999. https://doi.org/10.1109/5254.796089

S. Satoh, Y. Nakamura, and T. Kanade, “Name-it: naming and detecting faces in news videos,” IEEE MultiMedia, vol. 6, no. 1, pp. 22–35, Jan 1999. https://doi.org/10.1109/93.752960

J. Yang and A. G. Hauptmann, “Naming every individual in news video monologues,” in Proceedings of the 12th Annual ACM International Conference on Multimedia, New York, NY, USA, 2004, pp. 580–587. https://doi.org/10.1145/1027527.1027666

J. Yang, R. Yan, and A. G. Hauptmann, “Multiple instance learning for labeling faces in broadcasting news video,” in Proceedings of the 13th Annual ACM International Conference on Multimedia, New York, NY, USA, 2005, pp. 31–40. https://doi.org/10.1145/1101149.1101155

T. Tuytelaars, M.-F. Moens et al., “Naming people in news videos with label propagation,” IEEE multimedia, vol. 18, no. 3, pp. 44–55, 2011. https://doi.org/10.1109/mmul.2011.22

O. Galibert and J. Kahn, “The first official repere evaluation,” in First Workshop on Speech, Language and Audio for Multimedia (SLAM 2013), 2013.

J. Kahn, O. Galibert, L. Quintard, M. Carr, A. Giraudel, and P. Joly, “A presentation of the repere challenge,” in 2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI), June 2012, pp. 1–6. https://doi.org/10.1109/cbmi.2012.6269851

F. Bechet, M. Bendris, D. Charlet, G. Damnati, B. Favre, M. Rouvier, R. Auguste, B. Bigot, R. Dufour, C. Fredouille et al., “Multimodal understanding for person recognition in video broadcasts.” in INTER-SPEECH 2014 – ICSLP, 2014, pp. 607–611.

M. Bendris, B. Favre, D. Charlet, G. Damnati, G. Senay, R. Auguste, and J. Martinet, “Unsupervised face identification in tv content using audio-visual sources,” in 2013 11th International Workshop on Content-Based Multimedia Indexing (CBMI), June 2013, pp. 243–249. https://doi.org/10.1109/cbmi.2013.6576591

H. Bredin, A. Laurent, A. Sarkar, V.-B. Le, S. Rosset, and C. Barras, “Person Instance Graphs for Named Speaker Identification in TV Broadcast,” in Odyssey 2014, The Speaker and Language Recognition Workshop, Joensuu, Finland, June 2014.

H. Bredin, A. Roy, V.-B. Le, and C. Barras, “Person Instance Graphs for Mono-, Cross- and Multi-Modal Person Recognition in Multimedia Data. Application to Speaker Identification in TV Broadcast,” International Journal of Multimedia Information Retrieval, 2014. https://doi.org/10.1007/s13735-014-0055-y

P. Gay, G. Dupuy, C. Lailler, J. M. Odobez, S. Meignier, and P. Delglise, “Comparison of two methods for unsupervised person identification in tv shows,” in 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI), June 2014, pp. 1–6. https://doi.org/10.1109/cbmi.2014.6849828

J. Poignant, L. Besacier, and G. Qunot, “Unsupervised speaker identification in tv broadcast based on written names,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 57–68, Jan 2015. https://doi.org/10.1109/taslp.2014.2367822

J. Poignant, G. Fortier, L. Besacier, and G. Quénot, “Naming multi-modal clusters to identify persons in TV broadcast,” Multimedia Tools Appl., vol. 75, no. 15, pp. 8999–9023, 2016. https://doi.org/10.1007/s11042-015-2723-1

M. Rouvier, B. Favre, M. Bendris, D. Charlet, and G. Damnati, “Scene understanding for identifying persons in tv shows: Beyond face authentication,” in 2014 12th International Workshop on Content-Based Multimedia Indexing (CBMI), June 2014, pp. 1–6. https://doi.org/10.1109/cbmi.2014.6849829

J. Poignant, H. Bredin, and C. Barras, “Multimodal person discovery in broadcast TV at mediaeval 2015,” in Working Notes Proceedings of the MediaEval 2015 Workshop, 2015.

C. E. dos Santos Jr., G. Gravier, and W. Robson Schwartz, “SSIG and IRISA at Multimodal Person Discovery,” in Working Notes Proceedings of the MediaEval Workshop, Wurzen, Germany, 2015. [Online]. Available: https://hal.archives-ouvertes.fr/hal-01196171

D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, “Learning with local and global consistency,” in Advances in neural information processing systems, 2004, pp. 321–328.

B. Perret, J. Cousty, J. C. R. Ura, and S. J. F. Guimarães, “Evaluation of morphological hierarchies for supervised segmentation,” in Proceedings of the 12th International Symposium on Mathematical Morphology and Its Applications to Signal and Image Processing. Springer, 2015, pp. 39–50. https://doi.org/10.1007/978-3-319-18720-4_4

H. Bredin, C. Barras, and C. Guinaudeau, “Multimodal person discovery in broadcast TV at MediaEval 2016,” in Working notes of the MediaEval 2016 Workshop, October 2016.

D. Chen and J.-M. Odobez, “Video text recognition using sequential Monte Carlo and error voting methods,” Pattern Recognition Letters, vol. 26, no. 9, pp. 1386–1403, July 2005. https://doi.org/10.1016/j.patrec.2004.11.019

M. Rouvier, G. Dupuy, P. Gay, E. Khoury, T. Merlin, and S. Meigner, “An open-source state of the art toolbox for broadcast news diarization,” in Interspeech, 2013, pp. 25–29.

N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human Detection,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, 2005, pp. 886–893. https://doi.org/10.1109/CVPR.2005.177

M. Danelljan, G. Häger, F. Shahbaz Khan, and M. Felsber, “Accurate Scale Estimation for Robust Visual Tracking,” in Proceedings of the British Machine Vision Conference. BMVA Press, September 2014. https://doi.org/10.5244/c.28.65

C. Raymond, “Robust tree-structured named entities recognition from speech,” in International Conference on Acoustics, Speech and Signal Processing, 2013. https://doi.org/10.1109/icassp.2013.6639319

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” ICLR, 2015.

F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823. https://doi.org/10.1109/CVPR.2015.7298682

D. Garcia-Romero and C. Y. Espy-Wilson, “Analysis of i-vector length normalization in speaker recognition systems,” in Twelfth Annual Conference of the International Speech Communication Association, 2011.

A. J. Enright, S. Van Dongen, and C. A. Ouzounis, “An efficient algorithm for large-scale detection of protein families,” Nucleic acids research, vol. 30, no. 7, pp. 1575–1584, 2002. https://doi.org/10.1093/nar%2F30.7.1575

J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data,” biometrics, pp. 159–174, 1977. https://doi.org/10.2307/2529310

N. Le, S. Meignier, and J.-M. Odobez, “Eumssi team at the mediaeval person discovery challenge 2016,” in MediaEval Benchmarking Initiative for Multimedia Evaluation, no. EPFL-CONF-223040, 2016.

P. L. Otero, L. Docio-Fernandez, and C. G. Mateo, “Gtm-uvigo system for multimodal person discovery in broadcast tv task at mediaeval 2016,” in MediaEval, 2016.

V.-T. Nguyen, M.-T. H. Nguyen, Q.-H. Che, V.-T. Ninh, T.-K. Le, T.-A. Nguyen, and M.-T. Tran, “Hcmus team at the multimodal person discovery in broadcast tv task of mediaeval 2016.” in MediaEval, 2016.

F. Nishi, N. Inoue, K. Iwano, and K. Shinoda, “Tokyo tech at mediaeval 2016 multimodal person discovery in broadcast tv task.” in MediaEval, 2016.

G. Martı́, C. Cortillas, G. Bouritsas, E. Sayrol, J. R. Morros, and J. Hernando, “Upc system for the 2016 mediaeval multimodal person discovery in broadcast tv task,” in MediaEval, 2016.

G. Sargent, G. B. de Fonseca, I. L. Freire, R. Sicre, Z. K. G. do Patrocı́nio Jr., S. J. F. Guimarães, and G. Gravier, “Pucminas and IRISA at multimodal person discovery,” in Working Notes Proceedings of the MediaEval 2016 Workshop, 2016.

N. Le, H. Bredin, G. Sargent, P. Lopez-Otero, C. Barras, C. Guinaudeau, G. Gravier, G. B. da Fonseca, I. L. Freire, Z. Patrocı́nio Jr et al., “Towards large scale multimedia indexing: A case study on person discovery in broadcast news,” in Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing. ACM, 2017, p. 18. https://doi.org/10.1145/3095713.3095732

G. B. Da Fonseca, I. L. Freire, Z. Patrocı́nio Jr, S. J. F. Guimarães, G. Sargent, R. Sicre, and G. Gravier, “Tag propagation approaches within speaking face graphs for multimodal person discovery,” in Proceedings of the 15th International Workshop on Content-Based Multimedia In-dexing. ACM, 2017, p. 15. https://doi.org/10.1145/3095713.3095729

Multimodal person discovery using label propagation over speaking faces graphs

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)