English   español  
Please use this identifier to cite or link to this item: http://hdl.handle.net/10261/207472
logo share SHARE   Add this article to your Mendeley library MendeleyBASE
Visualizar otros formatos: MARC | Dublin Core | RDF | ORE | MODS | METS | DIDL
Exportar a otros formatos:


Seeing and hearing egocentric actions: How much can we learn?

AuthorsCartas, Alejandro; Luque, Jordi; Radeva, Petia; Segura, Carlos; Dimiccoli, Mariella
Issue Date27-Oct-2019
CitationInternational Conference on Computer Vision Workshop (2019)
AbstractOur interaction with the world is an inherently multi-modal experience. However, the understanding of human-to-object interactions has historically been addressed focusing on a single modality. In particular, a limited number of works have considered to integrate the visual and audio modalities for this purpose. In this work, we propose a multimodal approach for egocentric action recognition in a kitchen environment that relies on audio and visual information. Our model combines a sparse temporal sampling strategy with a late fusion of audio, spatial,and temporal streams. Experimental results on the EPIC-Kitchens dataset show that multimodal integration leads to better performance than unimodal approaches. In particular, we achieved a5.18%improvement over the state of the art on verb classification.
DescriptionTrabajo presentado en la International Conference on Computer Vision Workshop (ICCVW), celebrada en Seúl (Corea del Sur), los días 27 y 28 de octubre de 2019
Appears in Collections:(IRII) Comunicaciones congresos
Files in This Item:
File Description SizeFormat 
789964.pdf5,08 MBUnknownView/Open
Show full item record
Review this work

WARNING: Items in Digital.CSIC are protected by copyright, with all rights reserved, unless otherwise indicated.