Convolutional MLP orthogonal fusion of multiscale features for visual place recognition

Abstract Visual place recognition (VPR) involves obtaining robust image descriptors to cope with differences in camera viewpoints and drastic external environment changes. Utilizing multiscale features improves the robustness of image descriptors; however, existing methods neither exploit the multis...

Full description

Bibliographic Details
Published in:Scientific Reports
Main Authors: Wenjian Gan, Yang Zhou, Xiaofei Hu, Luying Zhao, Gaoshuang Huang, Chenglong Zhang
Format: Article in Journal/Newspaper
Language:English
Published: Nature Portfolio 2024
Subjects:
R
Q
Online Access:https://doi.org/10.1038/s41598-024-62749-x
https://doaj.org/article/d944ac2a7b3a41368a30d462471b2592
Description
Summary:Abstract Visual place recognition (VPR) involves obtaining robust image descriptors to cope with differences in camera viewpoints and drastic external environment changes. Utilizing multiscale features improves the robustness of image descriptors; however, existing methods neither exploit the multiscale features generated during feature extraction nor consider the feature redundancy problem when fusing multiscale information when image descriptors are enhanced. We propose a novel encoding strategy—convolutional multilayer perceptron orthogonal fusion of multiscale features (ConvMLP-OFMS)—for VPR. A ConvMLP is used to obtain robust and generalized global image descriptors and the multiscale features generated during feature extraction are used to enhance the global descriptors to cope with changes in the environment and viewpoints. Additionally, an attention mechanism is used to eliminate noise and redundant information. Compared to traditional methods that use tensor splicing for feature fusion, we introduced matrix orthogonal decomposition to eliminate redundant information. Experiments demonstrated that the proposed architecture outperformed NetVLAD, CosPlace, ConvAP, and other methods. On the Pittsburgh and MSLS datasets, which contained significant viewpoint and illumination variations, our method achieved 92.5% and 86.5% Recall@1, respectively. We also achieved good performances—80.6% and 43.2%—on the SPED and NordLand datasets, respectively, which have more extreme illumination and appearance variations.