Learning Distinct and Representative Styles for Image Captioning ...

Over the years, state-of-the-art (SoTA) image captioning methods have achieved promising results on some evaluation metrics (e.g., CIDEr). However, recent findings show that the captions generated by these methods tend to be biased toward the "average" caption that only captures the most g...

Full description

Bibliographic Details
Main Authors:	Chen, Qi, Deng, Chaorui, Wu, Qi
Format:	Report
Language:	unknown
Published:	arXiv 2022
Subjects:	Computer Vision and Pattern Recognition cs.CV FOS Computer and information sciences DML
Online Access:	https://dx.doi.org/10.48550/arxiv.2209.08231 https://arxiv.org/abs/2209.08231

id	ftdatacite:10.48550/arxiv.2209.08231
record_format	openpolar
spelling	ftdatacite:10.48550/arxiv.2209.08231 2023-10-01T03:55:39+02:00 Learning Distinct and Representative Styles for Image Captioning ... Chen, Qi Deng, Chaorui Wu, Qi 2022 https://dx.doi.org/10.48550/arxiv.2209.08231 https://arxiv.org/abs/2209.08231 unknown arXiv Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode cc-by-4.0 Computer Vision and Pattern Recognition cs.CV FOS Computer and information sciences Preprint article Article CreativeWork 2022 ftdatacite https://doi.org/10.48550/arxiv.2209.08231 2023-09-04T14:01:24Z Over the years, state-of-the-art (SoTA) image captioning methods have achieved promising results on some evaluation metrics (e.g., CIDEr). However, recent findings show that the captions generated by these methods tend to be biased toward the "average" caption that only captures the most general mode (a.k.a, language pattern) in the training corpus, i.e., the so-called mode collapse problem. Affected by it, the generated captions are limited in diversity and usually less informative than natural image descriptions made by humans. In this paper, we seek to avoid this problem by proposing a Discrete Mode Learning (DML) paradigm for image captioning. Our innovative idea is to explore the rich modes in the training caption corpus to learn a set of "mode embeddings", and further use them to control the mode of the generated captions for existing image captioning models. Specifically, the proposed DML optimizes a dual architecture that consists of an image-conditioned discrete variational autoencoder (CdVAE) ... : NeurIPS 2022 ... Report DML DataCite Metadata Store (German National Library of Science and Technology)
institution	Open Polar
collection	DataCite Metadata Store (German National Library of Science and Technology)
op_collection_id	ftdatacite
language	unknown
topic	Computer Vision and Pattern Recognition cs.CV FOS Computer and information sciences
spellingShingle	Computer Vision and Pattern Recognition cs.CV FOS Computer and information sciences Chen, Qi Deng, Chaorui Wu, Qi Learning Distinct and Representative Styles for Image Captioning ...
topic_facet	Computer Vision and Pattern Recognition cs.CV FOS Computer and information sciences
description	Over the years, state-of-the-art (SoTA) image captioning methods have achieved promising results on some evaluation metrics (e.g., CIDEr). However, recent findings show that the captions generated by these methods tend to be biased toward the "average" caption that only captures the most general mode (a.k.a, language pattern) in the training corpus, i.e., the so-called mode collapse problem. Affected by it, the generated captions are limited in diversity and usually less informative than natural image descriptions made by humans. In this paper, we seek to avoid this problem by proposing a Discrete Mode Learning (DML) paradigm for image captioning. Our innovative idea is to explore the rich modes in the training caption corpus to learn a set of "mode embeddings", and further use them to control the mode of the generated captions for existing image captioning models. Specifically, the proposed DML optimizes a dual architecture that consists of an image-conditioned discrete variational autoencoder (CdVAE) ... : NeurIPS 2022 ...
format	Report
author	Chen, Qi Deng, Chaorui Wu, Qi
author_facet	Chen, Qi Deng, Chaorui Wu, Qi
author_sort	Chen, Qi
title	Learning Distinct and Representative Styles for Image Captioning ...
title_short	Learning Distinct and Representative Styles for Image Captioning ...
title_full	Learning Distinct and Representative Styles for Image Captioning ...
title_fullStr	Learning Distinct and Representative Styles for Image Captioning ...
title_full_unstemmed	Learning Distinct and Representative Styles for Image Captioning ...
title_sort	learning distinct and representative styles for image captioning ...
publisher	arXiv
publishDate	2022
url	https://dx.doi.org/10.48550/arxiv.2209.08231 https://arxiv.org/abs/2209.08231
genre	DML
genre_facet	DML
op_rights	Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode cc-by-4.0
op_doi	https://doi.org/10.48550/arxiv.2209.08231
_version_	1778524270940389376

Learning Distinct and Representative Styles for Image Captioning ...

Similar Items