Large Scale Distributed Distance Metric Learning

In large scale machine learning and data mining problems with high feature dimensionality, the Euclidean distance between data points can be uninformative, and Distance Metric Learning (DML) is often desired to learn a proper similarity measure (using side information such as example data pairs bein...

Full description

Bibliographic Details
Main Authors: Xie, Pengtao, Xing, Eric
Format: Report
Language:unknown
Published: arXiv 2014
Subjects:
DML
Online Access:https://dx.doi.org/10.48550/arxiv.1412.5949
https://arxiv.org/abs/1412.5949
id ftdatacite:10.48550/arxiv.1412.5949
record_format openpolar
spelling ftdatacite:10.48550/arxiv.1412.5949 2023-05-15T16:01:12+02:00 Large Scale Distributed Distance Metric Learning Xie, Pengtao Xing, Eric 2014 https://dx.doi.org/10.48550/arxiv.1412.5949 https://arxiv.org/abs/1412.5949 unknown arXiv arXiv.org perpetual, non-exclusive license http://arxiv.org/licenses/nonexclusive-distrib/1.0/ Machine Learning cs.LG FOS Computer and information sciences Preprint Article article CreativeWork 2014 ftdatacite https://doi.org/10.48550/arxiv.1412.5949 2022-04-01T12:30:00Z In large scale machine learning and data mining problems with high feature dimensionality, the Euclidean distance between data points can be uninformative, and Distance Metric Learning (DML) is often desired to learn a proper similarity measure (using side information such as example data pairs being similar or dissimilar). However, high dimensionality and large volume of pairwise constraints in modern big data can lead to prohibitive computational cost for both the original DML formulation in Xing et al. (2002) and later extensions. In this paper, we present a distributed algorithm for DML, and a large-scale implementation on a parameter server architecture. Our approach builds on a parallelizable reformulation of Xing et al. (2002), and an asynchronous stochastic gradient descent optimization procedure. To our knowledge, this is the first distributed solution to DML, and we show that, on a system with 256 CPU cores, our program is able to complete a DML task on a dataset with 1 million data points, 22-thousand features, and 200 million labeled data pairs, in 15 hours; and the learned metric shows great effectiveness in properly measuring distances. Report DML DataCite Metadata Store (German National Library of Science and Technology)
institution Open Polar
collection DataCite Metadata Store (German National Library of Science and Technology)
op_collection_id ftdatacite
language unknown
topic Machine Learning cs.LG
FOS Computer and information sciences
spellingShingle Machine Learning cs.LG
FOS Computer and information sciences
Xie, Pengtao
Xing, Eric
Large Scale Distributed Distance Metric Learning
topic_facet Machine Learning cs.LG
FOS Computer and information sciences
description In large scale machine learning and data mining problems with high feature dimensionality, the Euclidean distance between data points can be uninformative, and Distance Metric Learning (DML) is often desired to learn a proper similarity measure (using side information such as example data pairs being similar or dissimilar). However, high dimensionality and large volume of pairwise constraints in modern big data can lead to prohibitive computational cost for both the original DML formulation in Xing et al. (2002) and later extensions. In this paper, we present a distributed algorithm for DML, and a large-scale implementation on a parameter server architecture. Our approach builds on a parallelizable reformulation of Xing et al. (2002), and an asynchronous stochastic gradient descent optimization procedure. To our knowledge, this is the first distributed solution to DML, and we show that, on a system with 256 CPU cores, our program is able to complete a DML task on a dataset with 1 million data points, 22-thousand features, and 200 million labeled data pairs, in 15 hours; and the learned metric shows great effectiveness in properly measuring distances.
format Report
author Xie, Pengtao
Xing, Eric
author_facet Xie, Pengtao
Xing, Eric
author_sort Xie, Pengtao
title Large Scale Distributed Distance Metric Learning
title_short Large Scale Distributed Distance Metric Learning
title_full Large Scale Distributed Distance Metric Learning
title_fullStr Large Scale Distributed Distance Metric Learning
title_full_unstemmed Large Scale Distributed Distance Metric Learning
title_sort large scale distributed distance metric learning
publisher arXiv
publishDate 2014
url https://dx.doi.org/10.48550/arxiv.1412.5949
https://arxiv.org/abs/1412.5949
genre DML
genre_facet DML
op_rights arXiv.org perpetual, non-exclusive license
http://arxiv.org/licenses/nonexclusive-distrib/1.0/
op_doi https://doi.org/10.48550/arxiv.1412.5949
_version_ 1766397164298174464