InDiReCT: Language-Guided Zero-Shot Deep Metric Learning for Images

Common Deep Metric Learning (DML) datasets specify only one notion of similarity, e.g., two images in the Cars196 dataset are deemed similar if they show the same car model. We argue that depending on the application, users of image retrieval systems have different and changing similarity notions th...

Full description

Bibliographic Details
Main Authors:	Kobs, Konstantin, Steininger, Michael, Hotho, Andreas
Format:	Text
Language:	unknown
Published:	2022
Subjects:	Computer Science - Computer Vision and Pattern Recognition Computer Science - Artificial Intelligence Computer Science - Information Retrieval DML
Online Access:	http://arxiv.org/abs/2211.12760

id	ftarxivpreprints:oai:arXiv.org:2211.12760
record_format	openpolar
spelling	ftarxivpreprints:oai:arXiv.org:2211.12760 2023-09-05T13:19:05+02:00 InDiReCT: Language-Guided Zero-Shot Deep Metric Learning for Images Kobs, Konstantin Steininger, Michael Hotho, Andreas 2022-11-23 http://arxiv.org/abs/2211.12760 unknown http://arxiv.org/abs/2211.12760 Computer Science - Computer Vision and Pattern Recognition Computer Science - Artificial Intelligence Computer Science - Information Retrieval text 2022 ftarxivpreprints 2023-08-16T17:24:24Z Common Deep Metric Learning (DML) datasets specify only one notion of similarity, e.g., two images in the Cars196 dataset are deemed similar if they show the same car model. We argue that depending on the application, users of image retrieval systems have different and changing similarity notions that should be incorporated as easily as possible. Therefore, we present Language-Guided Zero-Shot Deep Metric Learning (LanZ-DML) as a new DML setting in which users control the properties that should be important for image representations without training data by only using natural language. To this end, we propose InDiReCT (Image representations using Dimensionality Reduction on CLIP embedded Texts), a model for LanZ-DML on images that exclusively uses a few text prompts for training. InDiReCT utilizes CLIP as a fixed feature extractor for images and texts and transfers the variation in text prompt embeddings to the image embedding space. Extensive experiments on five datasets and overall thirteen similarity notions show that, despite not seeing any images during training, InDiReCT performs better than strong baselines and approaches the performance of fully-supervised models. An analysis reveals that InDiReCT learns to focus on regions of the image that correlate with the desired similarity notion, which makes it a fast to train and easy to use method to create custom embedding spaces only using natural language. Comment: Accepted to WACV 2023 Text DML ArXiv.org (Cornell University Library)
institution	Open Polar
collection	ArXiv.org (Cornell University Library)
op_collection_id	ftarxivpreprints
language	unknown
topic	Computer Science - Computer Vision and Pattern Recognition Computer Science - Artificial Intelligence Computer Science - Information Retrieval
spellingShingle	Computer Science - Computer Vision and Pattern Recognition Computer Science - Artificial Intelligence Computer Science - Information Retrieval Kobs, Konstantin Steininger, Michael Hotho, Andreas InDiReCT: Language-Guided Zero-Shot Deep Metric Learning for Images
topic_facet	Computer Science - Computer Vision and Pattern Recognition Computer Science - Artificial Intelligence Computer Science - Information Retrieval
description	Common Deep Metric Learning (DML) datasets specify only one notion of similarity, e.g., two images in the Cars196 dataset are deemed similar if they show the same car model. We argue that depending on the application, users of image retrieval systems have different and changing similarity notions that should be incorporated as easily as possible. Therefore, we present Language-Guided Zero-Shot Deep Metric Learning (LanZ-DML) as a new DML setting in which users control the properties that should be important for image representations without training data by only using natural language. To this end, we propose InDiReCT (Image representations using Dimensionality Reduction on CLIP embedded Texts), a model for LanZ-DML on images that exclusively uses a few text prompts for training. InDiReCT utilizes CLIP as a fixed feature extractor for images and texts and transfers the variation in text prompt embeddings to the image embedding space. Extensive experiments on five datasets and overall thirteen similarity notions show that, despite not seeing any images during training, InDiReCT performs better than strong baselines and approaches the performance of fully-supervised models. An analysis reveals that InDiReCT learns to focus on regions of the image that correlate with the desired similarity notion, which makes it a fast to train and easy to use method to create custom embedding spaces only using natural language. Comment: Accepted to WACV 2023
format	Text
author	Kobs, Konstantin Steininger, Michael Hotho, Andreas
author_facet	Kobs, Konstantin Steininger, Michael Hotho, Andreas
author_sort	Kobs, Konstantin
title	InDiReCT: Language-Guided Zero-Shot Deep Metric Learning for Images
title_short	InDiReCT: Language-Guided Zero-Shot Deep Metric Learning for Images
title_full	InDiReCT: Language-Guided Zero-Shot Deep Metric Learning for Images
title_fullStr	InDiReCT: Language-Guided Zero-Shot Deep Metric Learning for Images
title_full_unstemmed	InDiReCT: Language-Guided Zero-Shot Deep Metric Learning for Images
title_sort	indirect: language-guided zero-shot deep metric learning for images
publishDate	2022
url	http://arxiv.org/abs/2211.12760
genre	DML
genre_facet	DML
op_relation	http://arxiv.org/abs/2211.12760
_version_	1776199895619731456

InDiReCT: Language-Guided Zero-Shot Deep Metric Learning for Images

Similar Items