Learning Distinct and Representative Styles for Image Captioning ...

Over the years, state-of-the-art (SoTA) image captioning methods have achieved promising results on some evaluation metrics (e.g., CIDEr). However, recent findings show that the captions generated by these methods tend to be biased toward the "average" caption that only captures the most g...

Full description

Bibliographic Details
Main Authors: Chen, Qi, Deng, Chaorui, Wu, Qi
Format: Report
Language:unknown
Published: arXiv 2022
Subjects:
DML
Online Access:https://dx.doi.org/10.48550/arxiv.2209.08231
https://arxiv.org/abs/2209.08231
id ftdatacite:10.48550/arxiv.2209.08231
record_format openpolar
spelling ftdatacite:10.48550/arxiv.2209.08231 2023-10-01T03:55:39+02:00 Learning Distinct and Representative Styles for Image Captioning ... Chen, Qi Deng, Chaorui Wu, Qi 2022 https://dx.doi.org/10.48550/arxiv.2209.08231 https://arxiv.org/abs/2209.08231 unknown arXiv Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode cc-by-4.0 Computer Vision and Pattern Recognition cs.CV FOS Computer and information sciences Preprint article Article CreativeWork 2022 ftdatacite https://doi.org/10.48550/arxiv.2209.08231 2023-09-04T14:01:24Z Over the years, state-of-the-art (SoTA) image captioning methods have achieved promising results on some evaluation metrics (e.g., CIDEr). However, recent findings show that the captions generated by these methods tend to be biased toward the "average" caption that only captures the most general mode (a.k.a, language pattern) in the training corpus, i.e., the so-called mode collapse problem. Affected by it, the generated captions are limited in diversity and usually less informative than natural image descriptions made by humans. In this paper, we seek to avoid this problem by proposing a Discrete Mode Learning (DML) paradigm for image captioning. Our innovative idea is to explore the rich modes in the training caption corpus to learn a set of "mode embeddings", and further use them to control the mode of the generated captions for existing image captioning models. Specifically, the proposed DML optimizes a dual architecture that consists of an image-conditioned discrete variational autoencoder (CdVAE) ... : NeurIPS 2022 ... Report DML DataCite Metadata Store (German National Library of Science and Technology)
institution Open Polar
collection DataCite Metadata Store (German National Library of Science and Technology)
op_collection_id ftdatacite
language unknown
topic Computer Vision and Pattern Recognition cs.CV
FOS Computer and information sciences
spellingShingle Computer Vision and Pattern Recognition cs.CV
FOS Computer and information sciences
Chen, Qi
Deng, Chaorui
Wu, Qi
Learning Distinct and Representative Styles for Image Captioning ...
topic_facet Computer Vision and Pattern Recognition cs.CV
FOS Computer and information sciences
description Over the years, state-of-the-art (SoTA) image captioning methods have achieved promising results on some evaluation metrics (e.g., CIDEr). However, recent findings show that the captions generated by these methods tend to be biased toward the "average" caption that only captures the most general mode (a.k.a, language pattern) in the training corpus, i.e., the so-called mode collapse problem. Affected by it, the generated captions are limited in diversity and usually less informative than natural image descriptions made by humans. In this paper, we seek to avoid this problem by proposing a Discrete Mode Learning (DML) paradigm for image captioning. Our innovative idea is to explore the rich modes in the training caption corpus to learn a set of "mode embeddings", and further use them to control the mode of the generated captions for existing image captioning models. Specifically, the proposed DML optimizes a dual architecture that consists of an image-conditioned discrete variational autoencoder (CdVAE) ... : NeurIPS 2022 ...
format Report
author Chen, Qi
Deng, Chaorui
Wu, Qi
author_facet Chen, Qi
Deng, Chaorui
Wu, Qi
author_sort Chen, Qi
title Learning Distinct and Representative Styles for Image Captioning ...
title_short Learning Distinct and Representative Styles for Image Captioning ...
title_full Learning Distinct and Representative Styles for Image Captioning ...
title_fullStr Learning Distinct and Representative Styles for Image Captioning ...
title_full_unstemmed Learning Distinct and Representative Styles for Image Captioning ...
title_sort learning distinct and representative styles for image captioning ...
publisher arXiv
publishDate 2022
url https://dx.doi.org/10.48550/arxiv.2209.08231
https://arxiv.org/abs/2209.08231
genre DML
genre_facet DML
op_rights Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
cc-by-4.0
op_doi https://doi.org/10.48550/arxiv.2209.08231
_version_ 1778524270940389376