Deep Multimodal Learning: An Effective Method for Video Classification
Videos have become ubiquitous on the Internet. And video analysis can provide lots of information for detecting and recognizing objects as well as help people understand human actions and interactions with the real world. However, facing data as huge as TB level, effective methods should be applied....
Main Author: | |
---|---|
Format: | Report |
Language: | unknown |
Published: |
arXiv
2018
|
Subjects: | |
Online Access: | https://dx.doi.org/10.48550/arxiv.1811.12563 https://arxiv.org/abs/1811.12563 |
id |
ftdatacite:10.48550/arxiv.1811.12563 |
---|---|
record_format |
openpolar |
spelling |
ftdatacite:10.48550/arxiv.1811.12563 2023-05-15T16:01:38+02:00 Deep Multimodal Learning: An Effective Method for Video Classification Zhao, Tianqi 2018 https://dx.doi.org/10.48550/arxiv.1811.12563 https://arxiv.org/abs/1811.12563 unknown arXiv arXiv.org perpetual, non-exclusive license http://arxiv.org/licenses/nonexclusive-distrib/1.0/ Computer Vision and Pattern Recognition cs.CV Multimedia cs.MM FOS Computer and information sciences Preprint Article article CreativeWork 2018 ftdatacite https://doi.org/10.48550/arxiv.1811.12563 2022-04-01T08:54:36Z Videos have become ubiquitous on the Internet. And video analysis can provide lots of information for detecting and recognizing objects as well as help people understand human actions and interactions with the real world. However, facing data as huge as TB level, effective methods should be applied. Recurrent neural network (RNN) architecture has wildly been used on many sequential learning problems such as Language Model, Time-Series Analysis, etc. In this paper, we propose some variations of RNN such as stacked bidirectional LSTM/GRU network with attention mechanism to categorize large-scale video data. We also explore different multimodal fusion methods. Our model combines both visual and audio information on both video and frame level and received great result. Ensemble methods are also applied. Because of its multimodal characteristics, we decide to call this method Deep Multimodal Learning(DML). Our DML-based model was trained on Google Cloud and our own server and was tested in a well-known video classification competition on Kaggle held by Google. Report DML DataCite Metadata Store (German National Library of Science and Technology) |
institution |
Open Polar |
collection |
DataCite Metadata Store (German National Library of Science and Technology) |
op_collection_id |
ftdatacite |
language |
unknown |
topic |
Computer Vision and Pattern Recognition cs.CV Multimedia cs.MM FOS Computer and information sciences |
spellingShingle |
Computer Vision and Pattern Recognition cs.CV Multimedia cs.MM FOS Computer and information sciences Zhao, Tianqi Deep Multimodal Learning: An Effective Method for Video Classification |
topic_facet |
Computer Vision and Pattern Recognition cs.CV Multimedia cs.MM FOS Computer and information sciences |
description |
Videos have become ubiquitous on the Internet. And video analysis can provide lots of information for detecting and recognizing objects as well as help people understand human actions and interactions with the real world. However, facing data as huge as TB level, effective methods should be applied. Recurrent neural network (RNN) architecture has wildly been used on many sequential learning problems such as Language Model, Time-Series Analysis, etc. In this paper, we propose some variations of RNN such as stacked bidirectional LSTM/GRU network with attention mechanism to categorize large-scale video data. We also explore different multimodal fusion methods. Our model combines both visual and audio information on both video and frame level and received great result. Ensemble methods are also applied. Because of its multimodal characteristics, we decide to call this method Deep Multimodal Learning(DML). Our DML-based model was trained on Google Cloud and our own server and was tested in a well-known video classification competition on Kaggle held by Google. |
format |
Report |
author |
Zhao, Tianqi |
author_facet |
Zhao, Tianqi |
author_sort |
Zhao, Tianqi |
title |
Deep Multimodal Learning: An Effective Method for Video Classification |
title_short |
Deep Multimodal Learning: An Effective Method for Video Classification |
title_full |
Deep Multimodal Learning: An Effective Method for Video Classification |
title_fullStr |
Deep Multimodal Learning: An Effective Method for Video Classification |
title_full_unstemmed |
Deep Multimodal Learning: An Effective Method for Video Classification |
title_sort |
deep multimodal learning: an effective method for video classification |
publisher |
arXiv |
publishDate |
2018 |
url |
https://dx.doi.org/10.48550/arxiv.1811.12563 https://arxiv.org/abs/1811.12563 |
genre |
DML |
genre_facet |
DML |
op_rights |
arXiv.org perpetual, non-exclusive license http://arxiv.org/licenses/nonexclusive-distrib/1.0/ |
op_doi |
https://doi.org/10.48550/arxiv.1811.12563 |
_version_ |
1766397403984822272 |