Deep Multimodal Learning: An Effective Method for Video Classification

Videos have become ubiquitous on the Internet. And video analysis can provide lots of information for detecting and recognizing objects as well as help people understand human actions and interactions with the real world. However, facing data as huge as TB level, effective methods should be applied....

Full description

Bibliographic Details
Main Author:	Zhao, Tianqi
Format:	Report
Language:	unknown
Published:	arXiv 2018
Subjects:	Computer Vision and Pattern Recognition cs.CV Multimedia cs.MM FOS Computer and information sciences DML
Online Access:	https://dx.doi.org/10.48550/arxiv.1811.12563 https://arxiv.org/abs/1811.12563

id	ftdatacite:10.48550/arxiv.1811.12563
record_format	openpolar
spelling	ftdatacite:10.48550/arxiv.1811.12563 2023-05-15T16:01:38+02:00 Deep Multimodal Learning: An Effective Method for Video Classification Zhao, Tianqi 2018 https://dx.doi.org/10.48550/arxiv.1811.12563 https://arxiv.org/abs/1811.12563 unknown arXiv arXiv.org perpetual, non-exclusive license http://arxiv.org/licenses/nonexclusive-distrib/1.0/ Computer Vision and Pattern Recognition cs.CV Multimedia cs.MM FOS Computer and information sciences Preprint Article article CreativeWork 2018 ftdatacite https://doi.org/10.48550/arxiv.1811.12563 2022-04-01T08:54:36Z Videos have become ubiquitous on the Internet. And video analysis can provide lots of information for detecting and recognizing objects as well as help people understand human actions and interactions with the real world. However, facing data as huge as TB level, effective methods should be applied. Recurrent neural network (RNN) architecture has wildly been used on many sequential learning problems such as Language Model, Time-Series Analysis, etc. In this paper, we propose some variations of RNN such as stacked bidirectional LSTM/GRU network with attention mechanism to categorize large-scale video data. We also explore different multimodal fusion methods. Our model combines both visual and audio information on both video and frame level and received great result. Ensemble methods are also applied. Because of its multimodal characteristics, we decide to call this method Deep Multimodal Learning(DML). Our DML-based model was trained on Google Cloud and our own server and was tested in a well-known video classification competition on Kaggle held by Google. Report DML DataCite Metadata Store (German National Library of Science and Technology)
institution	Open Polar
collection	DataCite Metadata Store (German National Library of Science and Technology)
op_collection_id	ftdatacite
language	unknown
topic	Computer Vision and Pattern Recognition cs.CV Multimedia cs.MM FOS Computer and information sciences
spellingShingle	Computer Vision and Pattern Recognition cs.CV Multimedia cs.MM FOS Computer and information sciences Zhao, Tianqi Deep Multimodal Learning: An Effective Method for Video Classification
topic_facet	Computer Vision and Pattern Recognition cs.CV Multimedia cs.MM FOS Computer and information sciences
description	Videos have become ubiquitous on the Internet. And video analysis can provide lots of information for detecting and recognizing objects as well as help people understand human actions and interactions with the real world. However, facing data as huge as TB level, effective methods should be applied. Recurrent neural network (RNN) architecture has wildly been used on many sequential learning problems such as Language Model, Time-Series Analysis, etc. In this paper, we propose some variations of RNN such as stacked bidirectional LSTM/GRU network with attention mechanism to categorize large-scale video data. We also explore different multimodal fusion methods. Our model combines both visual and audio information on both video and frame level and received great result. Ensemble methods are also applied. Because of its multimodal characteristics, we decide to call this method Deep Multimodal Learning(DML). Our DML-based model was trained on Google Cloud and our own server and was tested in a well-known video classification competition on Kaggle held by Google.
format	Report
author	Zhao, Tianqi
author_facet	Zhao, Tianqi
author_sort	Zhao, Tianqi
title	Deep Multimodal Learning: An Effective Method for Video Classification
title_short	Deep Multimodal Learning: An Effective Method for Video Classification
title_full	Deep Multimodal Learning: An Effective Method for Video Classification
title_fullStr	Deep Multimodal Learning: An Effective Method for Video Classification
title_full_unstemmed	Deep Multimodal Learning: An Effective Method for Video Classification
title_sort	deep multimodal learning: an effective method for video classification
publisher	arXiv
publishDate	2018
url	https://dx.doi.org/10.48550/arxiv.1811.12563 https://arxiv.org/abs/1811.12563
genre	DML
genre_facet	DML
op_rights	arXiv.org perpetual, non-exclusive license http://arxiv.org/licenses/nonexclusive-distrib/1.0/
op_doi	https://doi.org/10.48550/arxiv.1811.12563
_version_	1766397403984822272

Deep Multimodal Learning: An Effective Method for Video Classification

Similar Items