Fast Spatial-Temporal Transformer Network

Rafael Molossi Escher; Rodrigo Andrade de Bem; Paulo Lilles Jorge Drews

Rafael Molossi Escher FURG
Rodrigo Andrade de Bem FURG
Paulo Lilles Jorge Drews FURG

Resumo

In computer vision, the restoration of missing regions in an image can be tackled with image inpainting techniques. Neural networks that perform inpainting in videos require the extraction of information from neighboring frames to obtain a temporally coherent result. The state-of-the-art methods for video inpainting are mainly based on Transformer Networks, which rely on attention mechanisms to handle temporal input data. However, such networks are highly costly, requiring considerable computational power for training and testing, which hinders its use on modest computing platforms. In this context, our goal is to reduce the computational complexity of state-of-the-art video inpainting methods, improving performance and facilitating its use in low-end GPUs. Therefore, we introduce the Fast Spatio-Temporal Transformer Network (FastSTTN), an extension of the Spatio-Temporal Transformer Network (STTN) in which the adoption of Reversible Layers reduces memory usage up to 7 times and execution time by approximately 2.2 times, while maintaining state-of-the-art video inpainting accuracy.

Palavras-chave: Training, Graphics, Computer vision, Neural networks, Transformers, Image restoration, Data mining, Deep Learning, Video Inpainting, Reformer Networks, Transformer Networks