AMCF-NET: ADAPTIVE MULTI-SCALE CROSS-MODAL FUSION NETWORK FOR UAV-SATELLITE CROSS-VIEW LOCALIZATION

Authors

  • Van Quan Ngo Institute of Information Technology and Electronics, Academy of Military Science and Technology
  • Quang Tung Pham Institute of Information Technology and Electronics, Academy of Military Science and Technology
  • Chi Thanh Nguyen Institute of Information Technology and Electronics, Academy of Military Science and Technology

Keywords:

, , , ,

Abstract

Cross-view localization between Unmanned Aerial Vehicle (UAV) and satellite imagery
is crucial for autonomous navigation in GPS-denied environments. However, large domain
gaps, including viewpoint discrepancies, scale variations, and appearance differences — pose
significant challenges. In this paper, we propose the Adaptive Multi-scale Cross-modal Fusion
Network (AMCF-Net), a novel approach that effectively addresses these limitations through a
shared backbone architecture and adaptive fusion mechanisms. Unlike previous dual-backbone
approaches that process UAV and satellite images separately, our method employs a unified
FocalNet-Tiny backbone to extract cross-modal features, followed by a Spatially-adaptive Crossmodal
Feature Fusion (AMCF) module that dynamically combines multi-scale similarities
using learned adaptive weights. This shared representation learning enables better cross-modal
alignment and significantly reduces computational overhead. Comprehensive experiments on
the UL14 benchmark demonstrate that AMCF-Net achieves state-of-the-art performance, with a
Relative Distance Score (RDS) of 78.12% and meter-level accuracy of 27.25% at 3 m, 50.16%
at 5 m, 84.37% at 10 m, and finally 88.51% at 20 m. Ablation studies further validate the
effectiveness of the shared backbone and adaptive fusion mechanism, demonstrating significant
improvements over traditional separate processing approaches.

Downloads

Download data is not yet available.

Published

2026-01-11

Issue

Section

Bài viết