Cross-Modal Fine-Tuning: Align then Refine ...

Fine-tuning large-scale pretrained models has led to tremendous progress in well-studied modalities such as vision and NLP. However, similar gains have not been observed in many other modalities due to a lack of relevant pretrained models. In this work, we propose ORCA, a general cross-modal fine-tu...

Full description

Bibliographic Details
Main Authors: Shen, Junhong, Li, Liam, Dery, Lucio M., Staten, Corey, Khodak, Mikhail, Neubig, Graham, Talwalkar, Ameet
Format: Article in Journal/Newspaper
Language:unknown
Published: arXiv 2023
Subjects:
Online Access:https://dx.doi.org/10.48550/arxiv.2302.05738
https://arxiv.org/abs/2302.05738
Description
Summary:Fine-tuning large-scale pretrained models has led to tremendous progress in well-studied modalities such as vision and NLP. However, similar gains have not been observed in many other modalities due to a lack of relevant pretrained models. In this work, we propose ORCA, a general cross-modal fine-tuning framework that extends the applicability of a single large-scale pretrained model to diverse modalities. ORCA adapts to a target task via an align-then-refine workflow: given the target input, ORCA first learns an embedding network that aligns the embedded feature distribution with the pretraining modality. The pretrained model is then fine-tuned on the embedded data to exploit the knowledge shared across modalities. Through extensive experiments, we show that ORCA obtains state-of-the-art results on 3 benchmarks containing over 60 datasets from 12 modalities, outperforming a wide range of hand-designed, AutoML, general-purpose, and task-specific methods. We highlight the importance of data alignment via a ...