EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything

Segment Anything Model (SAM) has emerged as a powerful tool for numerous vision applications. A key component that drives the impressive performance for zero-shot transfer and high versatility is a super large Transformer model trained on the extensive high-quality SA-1B dataset. While beneficial, t...

Full description

Bibliographic Details
Main Authors:	Xiong, Yunyang, Varadarajan, Bala, Wu, Lemeng, Xiang, Xiaoyu, Xiao, Fanyi, Zhu, Chenchen, Dai, Xiaoliang, Wang, Dilin, Sun, Fei, Iandola, Forrest, Krishnamoorthi, Raghuraman, Chandra, Vikas
Format:	Text
Language:	unknown
Published:	2023
Subjects:	Computer Science - Computer Vision and Pattern Recognition sami
Online Access:	http://arxiv.org/abs/2312.00863

id	ftarxivpreprints:oai:arXiv.org:2312.00863
record_format	openpolar
spelling	ftarxivpreprints:oai:arXiv.org:2312.00863 2024-01-07T09:46:22+01:00 EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything Xiong, Yunyang Varadarajan, Bala Wu, Lemeng Xiang, Xiaoyu Xiao, Fanyi Zhu, Chenchen Dai, Xiaoliang Wang, Dilin Sun, Fei Iandola, Forrest Krishnamoorthi, Raghuraman Chandra, Vikas 2023-12-01 http://arxiv.org/abs/2312.00863 unknown http://arxiv.org/abs/2312.00863 Computer Science - Computer Vision and Pattern Recognition text 2023 ftarxivpreprints 2023-12-10T02:07:15Z Segment Anything Model (SAM) has emerged as a powerful tool for numerous vision applications. A key component that drives the impressive performance for zero-shot transfer and high versatility is a super large Transformer model trained on the extensive high-quality SA-1B dataset. While beneficial, the huge computation cost of SAM model has limited its applications to wider real-world applications. To address this limitation, we propose EfficientSAMs, light-weight SAM models that exhibits decent performance with largely reduced complexity. Our idea is based on leveraging masked image pretraining, SAMI, which learns to reconstruct features from SAM image encoder for effective visual representation learning. Further, we take SAMI-pretrained light-weight image encoders and mask decoder to build EfficientSAMs, and finetune the models on SA-1B for segment anything task. We perform evaluations on multiple vision tasks including image classification, object detection, instance segmentation, and semantic object detection, and find that our proposed pretraining method, SAMI, consistently outperforms other masked image pretraining methods. On segment anything task such as zero-shot instance segmentation, our EfficientSAMs with SAMI-pretrained lightweight image encoders perform favorably with a significant gain (e.g., ~4 AP on COCO/LVIS) over other fast SAM models. Text sami ArXiv.org (Cornell University Library)
institution	Open Polar
collection	ArXiv.org (Cornell University Library)
op_collection_id	ftarxivpreprints
language	unknown
topic	Computer Science - Computer Vision and Pattern Recognition
spellingShingle	Computer Science - Computer Vision and Pattern Recognition Xiong, Yunyang Varadarajan, Bala Wu, Lemeng Xiang, Xiaoyu Xiao, Fanyi Zhu, Chenchen Dai, Xiaoliang Wang, Dilin Sun, Fei Iandola, Forrest Krishnamoorthi, Raghuraman Chandra, Vikas EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything
topic_facet	Computer Science - Computer Vision and Pattern Recognition
description	Segment Anything Model (SAM) has emerged as a powerful tool for numerous vision applications. A key component that drives the impressive performance for zero-shot transfer and high versatility is a super large Transformer model trained on the extensive high-quality SA-1B dataset. While beneficial, the huge computation cost of SAM model has limited its applications to wider real-world applications. To address this limitation, we propose EfficientSAMs, light-weight SAM models that exhibits decent performance with largely reduced complexity. Our idea is based on leveraging masked image pretraining, SAMI, which learns to reconstruct features from SAM image encoder for effective visual representation learning. Further, we take SAMI-pretrained light-weight image encoders and mask decoder to build EfficientSAMs, and finetune the models on SA-1B for segment anything task. We perform evaluations on multiple vision tasks including image classification, object detection, instance segmentation, and semantic object detection, and find that our proposed pretraining method, SAMI, consistently outperforms other masked image pretraining methods. On segment anything task such as zero-shot instance segmentation, our EfficientSAMs with SAMI-pretrained lightweight image encoders perform favorably with a significant gain (e.g., ~4 AP on COCO/LVIS) over other fast SAM models.
format	Text
author	Xiong, Yunyang Varadarajan, Bala Wu, Lemeng Xiang, Xiaoyu Xiao, Fanyi Zhu, Chenchen Dai, Xiaoliang Wang, Dilin Sun, Fei Iandola, Forrest Krishnamoorthi, Raghuraman Chandra, Vikas
author_facet	Xiong, Yunyang Varadarajan, Bala Wu, Lemeng Xiang, Xiaoyu Xiao, Fanyi Zhu, Chenchen Dai, Xiaoliang Wang, Dilin Sun, Fei Iandola, Forrest Krishnamoorthi, Raghuraman Chandra, Vikas
author_sort	Xiong, Yunyang
title	EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything
title_short	EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything
title_full	EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything
title_fullStr	EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything
title_full_unstemmed	EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything
title_sort	efficientsam: leveraged masked image pretraining for efficient segment anything
publishDate	2023
url	http://arxiv.org/abs/2312.00863
genre	sami
genre_facet	sami
op_relation	http://arxiv.org/abs/2312.00863
_version_	1787428140952322048

EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything

Similar Items