Please use this identifier to cite or link to this item:
|Visualizar otros formatos: MARC | Dublin Core | RDF | ORE | MODS | METS | DIDL|
Seeing and hearing egocentric actions: How much can we learn?
|Authors:||Cartas, Alejandro; Luque, Jordi; Radeva, Petia; Segura, Carlos; Dimiccoli, Mariella|
|Citation:||International Conference on Computer Vision Workshop (2019)|
|Abstract:||Our interaction with the world is an inherently multi-modal experience. However, the understanding of human-to-object interactions has historically been addressed focusing on a single modality. In particular, a limited number of works have considered to integrate the visual and audio modalities for this purpose. In this work, we propose a multimodal approach for egocentric action recognition in a kitchen environment that relies on audio and visual information. Our model combines a sparse temporal sampling strategy with a late fusion of audio, spatial,and temporal streams. Experimental results on the EPIC-Kitchens dataset show that multimodal integration leads to better performance than unimodal approaches. In particular, we achieved a5.18%improvement over the state of the art on verb classification.|
|Description:||Trabajo presentado en la International Conference on Computer Vision Workshop (ICCVW), celebrada en Seúl (Corea del Sur), los días 27 y 28 de octubre de 2019|
|Appears in Collections:||(IRII) Comunicaciones congresos|