Scaling Visual Language Models for Remote Sensing with Open Data and LLMs

Last updated on Jun 3, 2025

Introduction

Vision-language models (VLMs) show great promise for interpreting remote sensing (RS) imagery in human-like ways. However, training effective VLMs requires massive amounts of paired image-text data - a significant challenge in the RS domain.

Our solution? RSTeller - an automated workflow that generates high-quality RS image-text pairs at scale using:

Openly available RS imagery from Google Earth Engine
Semantic data from OpenStreetMap (OSM)
Large language models (LLMs) to generate rich captions

Key Contributions

Automated Workflow: End-to-end pipeline for fetching RS data, generating captions with LLMs, and compiling datasets
RSTeller Dataset: 1.3M RS images with 2.6M high-quality captions (2x richer semantics than existing datasets)
Comprehensive Experiments: Validating dataset effectiveness, ablation studies, and scaling laws

Methodology

Data Generation Workflow

Data Collection:
- RS images from NAIP via Google Earth Engine
- Semantic tags from OpenStreetMap
Caption Generation:
- Task 1: Describe “area” elements (land use, buildings)
- Task 2: Describe “non-area” elements (roads, rivers)
- Task 3: Caption augmentation for linguistic diversity
Quality Control:
- Filter irrelevant OSM tags
- Remove duplicate/hallucinated captions
- Compile into WebDataset format

Dataset Characteristics

Scale: 1.3M images, 2.6M captions
Coverage: Continental US, 0.6m GSD, RGB bands
Semantic Richness: MTLD score >100 (2x higher than alternatives)
Geographic Coverage: United States

Key Findings

Dataset Effectiveness

Continual pre-training with RSTeller improves performance across RS tasks:

Zero-shot classification:

Zero-shot retrieval:

Average accuracy improvements up to 4.89% (ViT-B/32) and 4.07% (ViT-L/14)
Best results when starting from DataComp pre-trained checkpoints
Minimal impact on general-domain performance (ImageNet)

Ablation Study

Key takeaways:

Common knowledge (LAION-10M) prevents catastrophic forgetting
LLM interpretation boosts performance (vs template captions)
Caption augmentation (Task 3) provides consistent gains
Named entities have negligible impact

Scaling Laws

Scaling laws tested on zero-shot classification:

Scaling laws tested on zero-shot retrieval:

More domain data → Better performance (negative error slopes)
ViT-L/14 outperforms ViT-B/32 consistently
Some benchmarks show initial degradation before improvement

Practical Recommendations

For researchers building RS VLMs:

Start with robust checkpoints (DataComp pre-trained)
Include common knowledge data to prevent forgetting
Use LLMs for caption generation (not just templates)
Scale domain data continuously - initial drops may recover
Larger models (ViT-L/14) generally perform better

Limitations and Future Work

Current limitations:

Single-element focus in captions
Residual LLM hallucinations
Limited geographic coverage (US only)
Simple CLIP architecture

Future directions:

Multi-element captioning
Better hallucination suppression
Global coverage expansion
More sophisticated VLM architectures

Conclusion

RSTeller provides:

✅ Automated pipeline for RS multimodal data
✅ High-quality dataset (1.3M images, 2.6M captions)
✅ Proven effectiveness for VLM training
✅ Clear path for future scaling

The dataset and code are publicly available to advance RS VLM research. For further details, please refer to the paper. This work has been publised in ISPRS Journal of Photogrammetry and Remote Sensing:

Junyao Ge, Xu Zhang, Yang Zheng, Kaitai Guo, Jimin Liang*, RSTeller: Scaling up visual language modeling in remote sensing with rich linguistic semantics from openly available data and large language models, ISPRS Journal of Photogrammetry and Remote Sensing, 2025, 226: 146-163. IF: 10.6