Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models ...
This report describes the training dataset creation and recipe behind the family of \texttt{arctic-embed} text embedding models (a set of five models ranging from 22 to 334 million parameters with weights open-sourced under an Apache-2 license). At the time of their release, each model achieved stat...
Main Authors: | , , , |
---|---|
Format: | Article in Journal/Newspaper |
Language: | unknown |
Published: |
arXiv
2024
|
Subjects: | |
Online Access: | https://dx.doi.org/10.48550/arxiv.2405.05374 https://arxiv.org/abs/2405.05374 |
id |
ftdatacite:10.48550/arxiv.2405.05374 |
---|---|
record_format |
openpolar |
spelling |
ftdatacite:10.48550/arxiv.2405.05374 2024-09-09T19:20:10+00:00 Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models ... Merrick, Luke Xu, Danmei Nuti, Gaurav Campos, Daniel 2024 https://dx.doi.org/10.48550/arxiv.2405.05374 https://arxiv.org/abs/2405.05374 unknown arXiv arXiv.org perpetual, non-exclusive license http://arxiv.org/licenses/nonexclusive-distrib/1.0/ Computation and Language cs.CL Artificial Intelligence cs.AI Information Retrieval cs.IR FOS Computer and information sciences Article article Preprint CreativeWork 2024 ftdatacite https://doi.org/10.48550/arxiv.2405.05374 2024-06-17T09:21:52Z This report describes the training dataset creation and recipe behind the family of \texttt{arctic-embed} text embedding models (a set of five models ranging from 22 to 334 million parameters with weights open-sourced under an Apache-2 license). At the time of their release, each model achieved state-of-the-art retrieval accuracy for models of their size on the MTEB Retrieval leaderboard, with the largest model, arctic-embed-l outperforming closed source embedding models such as Cohere's embed-v3 and Open AI's text-embed-3-large. In addition to the details of our training recipe, we have provided several informative ablation studies, which we believe are the cause of our model performance. ... : 17 pages, 11 Figures, 9 tables ... Article in Journal/Newspaper Arctic DataCite Arctic |
institution |
Open Polar |
collection |
DataCite |
op_collection_id |
ftdatacite |
language |
unknown |
topic |
Computation and Language cs.CL Artificial Intelligence cs.AI Information Retrieval cs.IR FOS Computer and information sciences |
spellingShingle |
Computation and Language cs.CL Artificial Intelligence cs.AI Information Retrieval cs.IR FOS Computer and information sciences Merrick, Luke Xu, Danmei Nuti, Gaurav Campos, Daniel Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models ... |
topic_facet |
Computation and Language cs.CL Artificial Intelligence cs.AI Information Retrieval cs.IR FOS Computer and information sciences |
description |
This report describes the training dataset creation and recipe behind the family of \texttt{arctic-embed} text embedding models (a set of five models ranging from 22 to 334 million parameters with weights open-sourced under an Apache-2 license). At the time of their release, each model achieved state-of-the-art retrieval accuracy for models of their size on the MTEB Retrieval leaderboard, with the largest model, arctic-embed-l outperforming closed source embedding models such as Cohere's embed-v3 and Open AI's text-embed-3-large. In addition to the details of our training recipe, we have provided several informative ablation studies, which we believe are the cause of our model performance. ... : 17 pages, 11 Figures, 9 tables ... |
format |
Article in Journal/Newspaper |
author |
Merrick, Luke Xu, Danmei Nuti, Gaurav Campos, Daniel |
author_facet |
Merrick, Luke Xu, Danmei Nuti, Gaurav Campos, Daniel |
author_sort |
Merrick, Luke |
title |
Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models ... |
title_short |
Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models ... |
title_full |
Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models ... |
title_fullStr |
Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models ... |
title_full_unstemmed |
Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models ... |
title_sort |
arctic-embed: scalable, efficient, and accurate text embedding models ... |
publisher |
arXiv |
publishDate |
2024 |
url |
https://dx.doi.org/10.48550/arxiv.2405.05374 https://arxiv.org/abs/2405.05374 |
geographic |
Arctic |
geographic_facet |
Arctic |
genre |
Arctic |
genre_facet |
Arctic |
op_rights |
arXiv.org perpetual, non-exclusive license http://arxiv.org/licenses/nonexclusive-distrib/1.0/ |
op_doi |
https://doi.org/10.48550/arxiv.2405.05374 |
_version_ |
1809760285880221696 |