Bridging semantics and geometry: A decoupled LVLM–SAM framework for reasoning segmentation in optical remote sensing

Abstract

Large Vision–Language Models (LVLMs) hold great promise for advancing optical remote sensing (RS) analysis, yet existing reasoning segmentation frameworks couple linguistic reasoning and pixel prediction through end-to-end supervised fine-tuning, leading to weak geometric grounding and limited generalization across tasks. To address this, we developed Think2Seg-RS, a decoupled framework that trains an LVLM prompter to control a frozen Segment Anything Model (SAM) via structured geometric prompts. Through a mask-only Group Relative Policy Optimization (GRPO) reinforcement learning objective driven strictly by final mask IoU, the LVLM learns to translate abstract semantic reasoning into spatially grounded actions, achieving state-of-the-art performance on the EarthReason dataset. Notably, Think2Seg-RS outperforms leading approaches such as RemoteReasoner and SegEarth-R1 on the EarthReason dataset by reaching a test cIoU of 75.60% and gIoU of 73.36%, yielding absolute improvements of 6.47% and 2.40% over the strongest baseline, respectively. Zero-shot evaluations across three referring segmentation benchmarks reveal a fundamental distinction in task inductive bias, exposing a distinct divide between semantic-level grounding—which aggregates all regions matching a conceptual intent—and instance-level tasks that demand discrete object separation. We further found that compact segmenters outperform larger ones under semantic-level supervision by mitigating textural over-segmentation, and that unconstrained negative prompting is unstable in heterogeneous aerial backgrounds. Together, these findings demonstrate that optimizing LVLMs through direct segmentation feedback offers a scalable framework for complex geospatial reasoning, effectively bridging the gap between abstract language understanding and precise pixel-level execution. Our code and model are available at https://github.com/Ricardo-XZ/Think2Seg-RS.

Publication
ISPRS Journal of Photogrammetry and Remote Sensing, 237: 217-235
Xu Zhang
Xu Zhang
PhD student (2025-Now)

My research interests include deep learning, computer vision, large language model and remote sensing.

Junyao Ge
Junyao Ge
PhD student (2020-2025), now at Huawei

My research interests include deep learning, computer vision and remote sensing.

Yang Zheng
Yang Zheng
Assistant Professor

My research interests include human behaviour analysis for intelligent diagnosis of developmental coordination disorder, aritifical intelligence, and computer vision.

Kaitai Guo
Kaitai Guo
Associate Professor

My research interests include broad-spectrum substance identification, microwave and infrared imaging, and system simulation and evaluation.

Jimin Liang
Jimin Liang
Professor of Electronic Engineering

My research interests include artificial intelligence and computer vision.