Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models ...

This report describes the training dataset creation and recipe behind the family of \texttt{arctic-embed} text embedding models (a set of five models ranging from 22 to 334 million parameters with weights open-sourced under an Apache-2 license). At the time of their release, each model achieved stat...

Full description

Bibliographic Details
Main Authors: Merrick, Luke, Xu, Danmei, Nuti, Gaurav, Campos, Daniel
Format: Article in Journal/Newspaper
Language:unknown
Published: arXiv 2024
Subjects:
Online Access:https://dx.doi.org/10.48550/arxiv.2405.05374
https://arxiv.org/abs/2405.05374
id ftdatacite:10.48550/arxiv.2405.05374
record_format openpolar
spelling ftdatacite:10.48550/arxiv.2405.05374 2024-09-09T19:20:10+00:00 Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models ... Merrick, Luke Xu, Danmei Nuti, Gaurav Campos, Daniel 2024 https://dx.doi.org/10.48550/arxiv.2405.05374 https://arxiv.org/abs/2405.05374 unknown arXiv arXiv.org perpetual, non-exclusive license http://arxiv.org/licenses/nonexclusive-distrib/1.0/ Computation and Language cs.CL Artificial Intelligence cs.AI Information Retrieval cs.IR FOS Computer and information sciences Article article Preprint CreativeWork 2024 ftdatacite https://doi.org/10.48550/arxiv.2405.05374 2024-06-17T09:21:52Z This report describes the training dataset creation and recipe behind the family of \texttt{arctic-embed} text embedding models (a set of five models ranging from 22 to 334 million parameters with weights open-sourced under an Apache-2 license). At the time of their release, each model achieved state-of-the-art retrieval accuracy for models of their size on the MTEB Retrieval leaderboard, with the largest model, arctic-embed-l outperforming closed source embedding models such as Cohere's embed-v3 and Open AI's text-embed-3-large. In addition to the details of our training recipe, we have provided several informative ablation studies, which we believe are the cause of our model performance. ... : 17 pages, 11 Figures, 9 tables ... Article in Journal/Newspaper Arctic DataCite Arctic
institution Open Polar
collection DataCite
op_collection_id ftdatacite
language unknown
topic Computation and Language cs.CL
Artificial Intelligence cs.AI
Information Retrieval cs.IR
FOS Computer and information sciences
spellingShingle Computation and Language cs.CL
Artificial Intelligence cs.AI
Information Retrieval cs.IR
FOS Computer and information sciences
Merrick, Luke
Xu, Danmei
Nuti, Gaurav
Campos, Daniel
Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models ...
topic_facet Computation and Language cs.CL
Artificial Intelligence cs.AI
Information Retrieval cs.IR
FOS Computer and information sciences
description This report describes the training dataset creation and recipe behind the family of \texttt{arctic-embed} text embedding models (a set of five models ranging from 22 to 334 million parameters with weights open-sourced under an Apache-2 license). At the time of their release, each model achieved state-of-the-art retrieval accuracy for models of their size on the MTEB Retrieval leaderboard, with the largest model, arctic-embed-l outperforming closed source embedding models such as Cohere's embed-v3 and Open AI's text-embed-3-large. In addition to the details of our training recipe, we have provided several informative ablation studies, which we believe are the cause of our model performance. ... : 17 pages, 11 Figures, 9 tables ...
format Article in Journal/Newspaper
author Merrick, Luke
Xu, Danmei
Nuti, Gaurav
Campos, Daniel
author_facet Merrick, Luke
Xu, Danmei
Nuti, Gaurav
Campos, Daniel
author_sort Merrick, Luke
title Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models ...
title_short Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models ...
title_full Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models ...
title_fullStr Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models ...
title_full_unstemmed Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models ...
title_sort arctic-embed: scalable, efficient, and accurate text embedding models ...
publisher arXiv
publishDate 2024
url https://dx.doi.org/10.48550/arxiv.2405.05374
https://arxiv.org/abs/2405.05374
geographic Arctic
geographic_facet Arctic
genre Arctic
genre_facet Arctic
op_rights arXiv.org perpetual, non-exclusive license
http://arxiv.org/licenses/nonexclusive-distrib/1.0/
op_doi https://doi.org/10.48550/arxiv.2405.05374
_version_ 1809760285880221696