Zero and Few-Shot Learning with Modern MLLMs to Filter Empty Images in Camera Trap Data
Resumo
Camera traps are typically equipped with motion or heat sensors and capture images of wild animals with little human interference. The recording of an event is activated when the sensor is triggered, which often results in huge volume of images, mainly empty ones. This study addresses the challenge of filtering out empty images, which is a crucial step for efficient data storage, transmission and automatic classification. We investigate the use of large-scale multimodal language models (MLLMs) in zero-shot and few-shot approaches for filtering empty images. We analyze whether or not the visual and textual data integration performed by MLLMs enhance their ability to detect animal presence. Three MLLMs are investigated: CLIP, BLIP, and Gemini. They are also compared to a model specially designed to filter out empty images in camera trap data: a ResNet50-Siamese. In our experiments, we compare the learning approaches across three datasets: Snapshot Serengeti, Caltech, and WCS. Our results indicate that few-shot learning significantly improves the performance of MLLMs, especially BLIP. However, these models face challenges such as high computational demands and sensitivity to environmental variations.
Palavras-chave:
Training, Visualization, Sensitivity, Filtering, Animals, Computational modeling, Optical wavelength conversion, Cameras, Sensors, Few shot learning
Publicado
30/09/2024
Como Citar
ALENCAR, Luiz; CUNHA, Fagner; SANTOS, Eulanda M. Dos.
Zero and Few-Shot Learning with Modern MLLMs to Filter Empty Images in Camera Trap Data. In: CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI), 37. , 2024, Manaus/AM.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2024
.