Deep Multimodal Learning: An Effective Method for Video Classification

Videos have become ubiquitous on the Internet. And video analysis can provide lots of information for detecting and recognizing objects as well as help people understand human actions and interactions with the real world. However, facing data as huge as TB level, effective methods should be applied....

Full description

Bibliographic Details
Main Author: Zhao, Tianqi
Format: Report
Language:unknown
Published: arXiv 2018
Subjects:
DML
Online Access:https://dx.doi.org/10.48550/arxiv.1811.12563
https://arxiv.org/abs/1811.12563
id ftdatacite:10.48550/arxiv.1811.12563
record_format openpolar
spelling ftdatacite:10.48550/arxiv.1811.12563 2023-05-15T16:01:38+02:00 Deep Multimodal Learning: An Effective Method for Video Classification Zhao, Tianqi 2018 https://dx.doi.org/10.48550/arxiv.1811.12563 https://arxiv.org/abs/1811.12563 unknown arXiv arXiv.org perpetual, non-exclusive license http://arxiv.org/licenses/nonexclusive-distrib/1.0/ Computer Vision and Pattern Recognition cs.CV Multimedia cs.MM FOS Computer and information sciences Preprint Article article CreativeWork 2018 ftdatacite https://doi.org/10.48550/arxiv.1811.12563 2022-04-01T08:54:36Z Videos have become ubiquitous on the Internet. And video analysis can provide lots of information for detecting and recognizing objects as well as help people understand human actions and interactions with the real world. However, facing data as huge as TB level, effective methods should be applied. Recurrent neural network (RNN) architecture has wildly been used on many sequential learning problems such as Language Model, Time-Series Analysis, etc. In this paper, we propose some variations of RNN such as stacked bidirectional LSTM/GRU network with attention mechanism to categorize large-scale video data. We also explore different multimodal fusion methods. Our model combines both visual and audio information on both video and frame level and received great result. Ensemble methods are also applied. Because of its multimodal characteristics, we decide to call this method Deep Multimodal Learning(DML). Our DML-based model was trained on Google Cloud and our own server and was tested in a well-known video classification competition on Kaggle held by Google. Report DML DataCite Metadata Store (German National Library of Science and Technology)
institution Open Polar
collection DataCite Metadata Store (German National Library of Science and Technology)
op_collection_id ftdatacite
language unknown
topic Computer Vision and Pattern Recognition cs.CV
Multimedia cs.MM
FOS Computer and information sciences
spellingShingle Computer Vision and Pattern Recognition cs.CV
Multimedia cs.MM
FOS Computer and information sciences
Zhao, Tianqi
Deep Multimodal Learning: An Effective Method for Video Classification
topic_facet Computer Vision and Pattern Recognition cs.CV
Multimedia cs.MM
FOS Computer and information sciences
description Videos have become ubiquitous on the Internet. And video analysis can provide lots of information for detecting and recognizing objects as well as help people understand human actions and interactions with the real world. However, facing data as huge as TB level, effective methods should be applied. Recurrent neural network (RNN) architecture has wildly been used on many sequential learning problems such as Language Model, Time-Series Analysis, etc. In this paper, we propose some variations of RNN such as stacked bidirectional LSTM/GRU network with attention mechanism to categorize large-scale video data. We also explore different multimodal fusion methods. Our model combines both visual and audio information on both video and frame level and received great result. Ensemble methods are also applied. Because of its multimodal characteristics, we decide to call this method Deep Multimodal Learning(DML). Our DML-based model was trained on Google Cloud and our own server and was tested in a well-known video classification competition on Kaggle held by Google.
format Report
author Zhao, Tianqi
author_facet Zhao, Tianqi
author_sort Zhao, Tianqi
title Deep Multimodal Learning: An Effective Method for Video Classification
title_short Deep Multimodal Learning: An Effective Method for Video Classification
title_full Deep Multimodal Learning: An Effective Method for Video Classification
title_fullStr Deep Multimodal Learning: An Effective Method for Video Classification
title_full_unstemmed Deep Multimodal Learning: An Effective Method for Video Classification
title_sort deep multimodal learning: an effective method for video classification
publisher arXiv
publishDate 2018
url https://dx.doi.org/10.48550/arxiv.1811.12563
https://arxiv.org/abs/1811.12563
genre DML
genre_facet DML
op_rights arXiv.org perpetual, non-exclusive license
http://arxiv.org/licenses/nonexclusive-distrib/1.0/
op_doi https://doi.org/10.48550/arxiv.1811.12563
_version_ 1766397403984822272