]

GEOBench-VLM

Benchmarking Vision-Language Models for Geospatial Tasks

1Mohamed bin Zayed University of AI, 2University College London, 3LinkΓΆping University, 4IBM Research Europe, UK, 5ServiceNow Research, 6Australian National University
*Equally contributing first authors

GEOBench-VLM, a comprehensive benchmark designed to evaluate Vision-Language Models (VLMs) for complex geospatial applications. Unlike generic benchmarks, GEOBench-VLM addresses the unique challenges of geospatial data, such as temporal analysis, fine-grained object detection, damange assesment and spatial reasoning, key for applications like disaster management, environmental monitoring, and urban planning. Featuring over 10,000 manually verified tasks across eight categories, including scene understanding, visual grounding, and temporal change analysis, this benchmark offers a rigorous evaluation framework. Through extensive testing of state-of-the-art VLMs, our findings reveal critical gaps in their geospatial performance, with the best models achieving only 40% accuracy in multiple-choice tasks.

πŸ’‘ Contributions

  1. GEOBench-VLM Benchmark. We introduce GEOBench-VLM, a benchmark suite designed specifically for evaluating VLMs on geospatial tasks, addressing geospatial data challenges. It covers 8 broad categories and 31 sub-tasks with over 10,000 manually verified questions.

  2. Evaluation of VLMs. We provide a detailed evaluation of ten state-of-the-art VLMs, including generic (open and closed-source) and task-specific geospatial VLMs, highlighting their capabilities and limitations in handling geospatial tasks

  3. Analysis of Geospatial Task Performance. We analyze performance across a range of tasks, including scene classification, counting, change detection, relationship prediction, visual grounding, image captioning, segmentation, disaster detection, and temporal analysis, among others, providing key insights into improving VLMs for geospatial applications.

πŸ† Leaderboard


Model Aircraft Type Classification Buidling Counting Crop Type Classification Disaster type Classification Fire Risk Assessment General Vehicle Counting Land Use Classification Marine Debirs Counting Scene Classification Ship Type Classification Spatial Relation Classification Specific Aircraft Type Counting Specific Vehicle Type Counting Tree Health Assessment Trees Counting Water Bodies Counting Average
Qwen2-VL0.59000.28820.27270.55070.43330.27330.58290.28000.81450.47570.68670.33750.23660.21660.20000.55290.3995
GPT-4o0.62000.30590.18180.59910.20000.16000.63410.20000.86270.61650.70600.36670.24550.21660.27060.56470.3971
LLaVA-OneVision0.61000.30590.27270.51540.23330.30000.57320.24000.79760.34470.69310.40000.28570.22290.34120.49410.3900
GeoChat0.45000.25290.21820.39210.26670.16670.51460.22000.80960.35920.56650.25830.19200.11460.23530.38820.3179
LLaVA-1.50.45000.25290.27270.36120.20000.14000.51950.28000.70840.34950.44210.22080.19640.17200.24710.32940.3025
SPHINX0.24000.23530.25450.29520.40000.15330.44150.14000.70360.25120.56440.17860.24110.13380.24710.47060.3012
InternVL-20.36000.26470.21820.39650.16670.14670.46100.34000.69640.30580.49790.24580.22770.21660.18820.37650.3005
LLaVA-NeXT0.48000.23530.20000.36560.30000.14670.47320.28000.71080.28160.39480.21250.20090.15920.23530.31760.2937
RS-LLaVA0.42000.24710.20000.35680.20000.12000.41710.10000.72770.23790.37770.22500.14730.13380.24710.34120.2646
Ferret0.08000.19410.16360.05730.06670.19330.20490.28000.24100.21260.20600.17140.22320.17200.23530.11760.1805

πŸ—‚οΈ Benchmarks Comparison

The table below compares generic and geospatial-specific benchmarks, highlighting their focus areas, data sources, and evaluation formats. It showcases the diversity of modalities, task types, and annotation approaches across various benchmarks, emphasizing the unique requirements and challenges of geospatial datasets in contrast to generic ones.

Benchmark Domain Modalities Data Sources Answer Type Annotation Type Human Verify Year RS Category
CulturalVQA General O Curated FF M βœ— 2024 N/A
EXAMS-V General O Academic Exams MCQ M - 2024 N/A
M4U General O Academic Exams MCQ M βœ“ 2024 N/A
MMMU General O Academic Exams FF, MCQ M βœ“ 2024 N/A
MME General O Various Open-Source Yes/No M βœ“ 2024 N/A
MMBench General O Various Open-Source MCQ A+M βœ— 2024 N/A
MMStar General O Existing Benchmarks MCQ M βœ“ 2024 N/A
LAMM General O, PC Various Open-Source FF A+M βœ— 2023 N/A
SEED-Bench General O, V Various Open-Source MCQ A+M βœ“ 2023 N/A
SEED-Bench2 General O, MI, V Various Open-Source MCQ A+M βœ“ 2024 N/A
SEED-Bench-H General O, MI, V Public Data / Curated MCQ M βœ“ 2024 N/A
RSIEval RS O, PAN DOTA FF M βœ“ 2023 6
LHRS-Bench RS O GE + OSM SC M βœ“ 2024 4
EarthGPT RS O, IR, SAR DRSD FF, BBox A+M - 2024 5
SkyEyeGPT RS O, V DRSD FF, MCQ A+M βœ“ 2024 6
Fit-RSRC RS O DRSD FF, MCQ A+M βœ“ 2024 1
Fit-RSFG RS O DRSD FF, MCQ A+M βœ— 2024 5
VRSBench RS O DOTA, DIOR FF, BBox A+M βœ“ 2024 3
VLEO-Bench RS O, BT DRSD FF, BBox, MCQ A+M βœ“ 2024 6
EarthVQA RS O DRSD FF A+M βœ— 2023 5
RemoteCount RS O DOTA SC A+M βœ“ 2024 1
FineGrip RS O MAR20 FF, Seg A+M βœ“ 2024 5
GeoChat-Bench RS O DRSD FF, BBox A+M βœ“ 2023 6
GEOBench-VLM RS O, MS, SAR, BT, MT DRSD MCQ, BBox, Seg A+M βœ“ Ours 8
  • Modalities: O: Optical, PAN: Panchromatic, MS: Multi-spectral, IR: Infrared, SAR: Synthetic Aperture Radar, V: Video, MI: Multi-image, BT: Bi-Temporal, MT: Multi-temporal, PC: Point Cloud
  • Data Sources: DRSD: Diverse Remote Sensing Datasets, OSM: OpenStreetMap, GE: Google Earth, DOTA: Dataset for Object Detection in Aerial Images, DIOR: Dataset for Object Detection in Optical Remote Sensing Images, MAR20: Maritime Dataset 2020
  • Answer Types: MCQ: Multiple Choice Questions, SC: Single Choice, FF: Free-Form, BBox: Bounding Box, Seg: Segmentation Mask
  • Annotation Types: A: Automatic, M: Manual
  • Human Verify: βœ“: Yes, βœ—: No, -: Not specified

πŸ“‹ Tasks Example

Comprehensive benchmark for VLMs in numerous geospatial tasks. This benchmark evaluates VLMs across eight core task categories, assessing their ability to interpret complex spatial data, classify scenes, identify and localize objects, detect events, generate captions, segment regions, analyze temporal changes, and process non-optical data. Tasks span from classifying landscapes and objects (e.g., land use, crop types, ships, aircraft) to counting, detecting hazards, and assessing disaster impact, testing VLMs on spatial reasoning.

πŸ” Dataset Annotation Pipeline

Our pipeline integrates diverse datasets, automated tools, and manual annotation. Tasks such as scene understanding, object classification, and non-optical analysis are based on classification datasets, while GPT-4o generates unique MCQs with five options: one correct answer, one semantically similar "closest" option, and three plausible alternatives. Spatial relationship tasks rely on manually annotated object pair relationships, ensuring consistency through cross-verification. Caption generation leverages GPT-4o, combining image, object details, and spatial interactions with manual refinement for high precision.

πŸ“Šβœ¨ Qualitative Results

Scene Understanding

This illustrates model performance on geospatial scene understanding tasks, highlighting successes in clear contexts and challenges in ambiguous scenes. The results emphasize the importance of contextual reasoning and addressing overlapping visual cues for accurate classification.

What type of facility or structure is depicted in this image?

...
A. Crop field
B. Military facility
C. Debris or rubble
D. Solar farm
E. Toll booth
... Solar farm βœ”
... Solar farm βœ”
... Solar farm βœ”
... Solar farm βœ”
... Solar farm βœ”
... Solar farm βœ”
... Solar farm βœ”
... Solar farm βœ”
... Solar farm βœ”
... Instruction not followed ✘

What is the primary type of land use visible in this aerial image?

...
A. Airport
B. Statue
C. Park
D. Stadium
E. Tower
... Stadium βœ”
... Stadium βœ”
... Stadium βœ”
... Stadium βœ”
... Stadium βœ”
... Stadium βœ”
... Stadium βœ”
... Stadium βœ”
... Stadium βœ”
... Airport ✘

What is the level of fire risk depicted in this image?

...
A. High
B. Low
C. Non-burnable
D. Very low
E. Very High
... Low βœ”
... Low βœ”
... Low βœ”
... Low βœ”
... High ✘
... Low βœ”
... Low βœ”
... Non-burnable ✘
... Low βœ”
... High ✘

What type of facility or structure is depicted in this image?

...
A. Single-unit residential
B. Lighthouse
C. Road bridge
D. Interchange
E. Nuclear power plant
... Interchange βœ”
... Interchange βœ”
... Interchange βœ”
... Interchange βœ”
... Interchange βœ”
... Road bridge ✘
... Interchange βœ”
... Interchange βœ”
... Road bridge ✘
... Instruction not followed ✘

What is the primary type of scene depicted in this aerial image?

...
A. Nursing home
B. Ferry terminal
C. Harbor
D. Christmas tree farm
E. Tennis court
... Harbor ✘
... Harbor ✘
... Harbor ✘
... Harbor ✘
... Harbor ✘
... Harbor ✘
... Harbor ✘
... Harbor ✘
... Harbor ✘
... Harbor ✘

What is the primary type of scene depicted in this aerial image?

...
A. Ferry terminal
B. Oil well
C. Storage tank
D. Tennis court
E. Wastewater treatment plant
... Wastewater treatment plant βœ”
... Wastewater treatment plant βœ”
... Wastewater treatment plant βœ”
... Storage tank ✘
... Storage tank ✘
... Storage tank ✘
... Storage tank ✘
... Storage tank ✘
... Storage tank ✘
... Instruction not followed ✘

Which crop is primarily cultivated in this area?

...
A. Winter triticale
B. Winter rapeseed
C. Beet
D. Mixed cereal
E. Void label
... Void label ✘
... Mixed cereal βœ”
... Winter triticale ✘
... Mixed cereal βœ”
... Mixed cereal βœ”
... Winter rapeseed ✘
... Mixed cereal βœ”
... Winter triticale ✘
... Mixed cereal βœ”
... Winter triticale ✘

What type of facility or structure is depicted in this image?

...
A. Crop field
B. Dam
C. Lighthouse
D. Railway bridge
E. Water treatment facility
... Water treatment facility βœ”
... Water treatment facility βœ”
... Water treatment facility βœ”
... Water treatment facility βœ”
... Water treatment facility βœ”
... Water treatment facility βœ”
... Water treatment facility βœ”
... Water treatment facility βœ”
... Dam ✘
... Lighthouse ✘

Counting

The figure highlights model performance on object classification, showing success with familiar ob- jects like the β€œatago-class destroyer” and β€œsmall civil transport/utility” aircraft. However, models struggled with rarer objects like the β€œmurasame-class destroyer” and β€œgaribaldi aircraft carrier” indicating a need for improvement on less common classes and fine-grained recognition.

What type of ship is visible in this image?

...
A. Murasame-class destroyer
B. Kongo-class destroyer
C. Civil yacht
D. Arleigh Burke-class destroyer
E. Kitty Hawk-class aircraft carrier
... Kitty Hawk-class aircraft carrier ✘
... Kitty Hawk-class aircraft carrier ✘
... Kitty Hawk-class aircraft carrier ✘
... Kitty Hawk-class aircraft carrier ✘
... Kitty Hawk-class aircraft carrier ✘
... Arleigh Burke-class destroyer ✘
... Kitty Hawk-class aircraft carrier ✘
... Arleigh Burke-class destroyer ✘
... Kitty Hawk-class aircraft carrier ✘
... Kitty Hawk-class aircraft carrier ✘

What type of aircraft is visible in this image?

...
A. Military Bomber
B. Medium Civil Transport/Utility
C. Military Trainer
D. Small Civil Transport/Utility
E. Large Civil Transport/Utility
... Small Civil Transport/Utility βœ”
... Small Civil Transport/Utility βœ”
... Small Civil Transport/Utility βœ”
... Small Civil Transport/Utility βœ”
... Small Civil Transport/Utility βœ”
... Small Civil Transport/Utility βœ”
... Small Civil Transport/Utility βœ”
... Small Civil Transport/Utility βœ”
... Small Civil Transport/Utility βœ”
... Instruction not followed ✘

What type of ship is visible in this image?

...
A. Civil yacht
B. INS Vikramaditya carrier
C. Atago-class destroyer
D. Garibaldi aircraft carrier
E. Kitty Hawk-class aircraft carrier
... INS Vikramaditya carrier ✘
... Kitty Hawk-class aircraft carrier ✘
... INS Vikramaditya carrier ✘
... Kitty Hawk-class aircraft carrier ✘
... Kitty Hawk-class aircraft carrier ✘
... INS Vikramaditya carrier ✘
... Kitty Hawk-class aircraft carrier ✘
... Kitty Hawk-class aircraft carrier ✘
... Instruction not followed ✘
... INS Vikramaditya carrier ✘

What type of ship is visible in this image?

...
A. Atago-class destroyer
B. Type 45 destroyer
C. Mega yacht
D. Mega yacht
E. Mistral-class amphibious assault ship
... Atago-class destroyer βœ”
... Atago-class destroyer βœ”
... Atago-class destroyer βœ”
... Atago-class destroyer βœ”
... Atago-class destroyer βœ”
... Type 45 destroyer ✘
... Atago-class destroyer βœ”
... Atago-class destroyer βœ”
... Instruction not followed ✘
... Atago-class destroyer βœ”

Counting

This figure showcases model performance on counting tasks, where Qwen 2-VL, GPT-4o and LLaVA-One have better performance in identifying objects. Other models, such as Ferret, struggled with overestimation, highlighting challenges in object differentiation and spatial reasoning.

How many vehicles are visible in this image?

...
A. 69
B. 52
C. 103
D. 120
E. 86
... 103 ✘
... 103 ✘
... 103 ✘
... 103 ✘
... 69 ✘
... 103 ✘
... 69 ✘
... 69 ✘
... 69 ✘
... 69 ✘

How many water bodies can you identify in this image?

...
A. 3
B. 2
C. 0
D. 4
E. 1
... 1 βœ”
... 1 βœ”
... 1 βœ”
... 3 ✘
... 3 ✘
... 0 ✘
... 1 βœ”
... 4 ✘
... 3 ✘
... 2 ✘

How many medium-sized civil transport or utility aircraft are visible in this image?

...
A. 5
B. 2
C. 6
D. 4
E. 3
... 3 βœ”
... 3 βœ”
... 4 ✘
... 5 ✘
... 5 ✘
... 5 ✘
... 5 ✘
... 5 ✘
... 6 ✘
... 5 ✘

How many large civil transport or utility aircraft can you spot in this image?

...
A. 4
B. 2
C. 3
D. 5
E. 1
... 3 ✘
... 4 ✘
... 5 ✘
... 4 ✘
... 5 ✘
... 3 ✘
... 2 βœ”
... 5 ✘
... 3 ✘
... 4 ✘

How many pickup trucks are visible in this image?

...
A. 0
B. 1
C. 4
D. 3
E. 2
... 1 βœ”
... 1 βœ”
... 1 βœ”
... 3 ✘
... 0 ✘
... 4 ✘
... 2 ✘
... 3 ✘
... 0 ✘
... 4 ✘

How many pieces of marine debris are visible in this image?

...
A. 3
B. 4
C. 0
D. 1
E. 2
... 0 ✘
... 0 ✘
... 0 ✘
... 3 ✘
... 3 ✘
... 0 ✘
... 0 ✘
... 3 ✘
... 0 ✘
... 3 ✘

How many trees show light damage in this image?

...
A. 30
B. 18
C. 24
D. 36
E. 42
... 24 ✘
... 24 ✘
... 24 ✘
... 18 ✘
... 18 ✘
... 18 ✘
... 24 ✘
... 18 ✘
... 30 βœ”
... 24 ✘

How many buildings can you identify in this image?

...
A. 168
B. 140
C. 84
D. 196
E. 112
... 84 ✘
... 196 ✘
... 196 ✘
... 196 ✘
... 168 ✘
... 168 ✘
... 168 ✘
... 168 ✘
... 168 ✘
... 168 ✘

Event

Model performance on disaster assessment tasks, with success in scenarios like ’fire’ and ’flooding’ but challenges in ambiguous cases like 'tsunami' and 'seismic activity'. Misclassifications highlight limitations in contextual reasoning and insufficient exposure on overlapping disaster features.

What was the primary trigger for this landslide?

...
A. Soil Erosion
B. Human Activities
C. Snow and Glacier Melting
D. Seismic Activity
E. Precipitation-Related Events
... Precipitation-Related Events ✘
... Seismic Activity βœ”
... Precipitation-Related Events ✘
... Precipitation-Related Events ✘
... Soil Erosion ✘
... Human Activities ✘
... Precipitation-Related Events ✘
... Precipitation-Related Events ✘
... Precipitation-Related Events ✘
... Instruction not followed ✘

What type of disaster is responsible for the visible damage in this image?

...
A. earthquake
B. tsunami
C. volcano
D. wind
E. flooding
... flooding βœ”
... flooding βœ”
... flooding βœ”
... flooding βœ”
... flooding βœ”
... flooding βœ”
... flooding βœ”
... flooding βœ”
... flooding βœ”
... Instruction not followed ✘

What type of disaster is responsible for the visible damage in this image?

...
A. volcano
B. flooding
C. tsunami
D. fire
E. earthquake
... fire βœ”
... fire βœ”
... fire βœ”
... fire βœ”
... fire βœ”
... earthquake ✘
... fire βœ”
... fire βœ”
... fire βœ”
... Instruction not followed ✘

What type of disaster is responsible for the visible damage in this image?

...
A. flooding
B. volcano
C. earthquake
D. wind
E. tsunami
... earthquake ✘
... earthquake ✘
... flooding ✘
... earthquake ✘
... flooding ✘
... flooding ✘
... flooding ✘
... earthquake ✘
... earthquake ✘
... Instruction not followed ✘

Spatial Relation

The figure demonstrates model performance on spatial relationship tasks, with success in close-object scenarios and struggles in cluttered environments with distant objects.

What is the relationship between object in green box and object in red box in this image?

...
A. A small-vehicle is driving by the bridge.
B. A plane is parked near the runway.
C. A helicopter is moving away from the helipad.
D. A helicopter is positioned beside the helipad.
E. A large-vehicle is moving towards the a roundabout.
... C. ✘
... D. ✘
... D. ✘
... D. ✘
... D. ✘
... D. ✘
... A. ✘
... D. ✘
... D. ✘
... A. ✘

What is the relationship between object in green box and object in red box in this image?

...
A. A helicopter is below the airport.
B. A small vehicle is to the left of the large vehicle.
C. A ship is to the right of the a small-vehicle.
D. A large-vehicle is positioned in front of the bridge.
E. A helicopter is above the helipad.
... B. ✘
... E. ✘
... B. ✘
... D. ✘
... D. ✘
... C. βœ”
... A. ✘
... B. ✘
... E. ✘
... B. ✘

What is the relationship between object in green box and object in red box in this image?

...
A. A large vehicle is aligned with the bridge.
B. A helicopter is positioned next to the helipad.
C. A overpass leads to the a golffield.
D. A tennis court is beside the basketball court.
E. A runway connects to the airport.
... C. βœ”
... C. βœ”
... C. βœ”
... C. βœ”
... C. βœ”
... C. βœ”
... C. βœ”
... C. βœ”
... C. βœ”
... C. βœ”

What is the relationship between object in green box and object in red box in this image?

...
A. An a350 is aligned with the runway.
B. A plane is parked near the runway.
C. A large-vehicle is driving by the storage-tank.
D. A large-vehicle is moving towards the storage-tank.
E. A large-vehicle is moving away from the a roundabout.
... C. ✘
... E. βœ”
... B. ✘
... D. ✘
... B. ✘
... B. ✘
... E. βœ”
... B. ✘
... D. ✘
... A. ✘

BibTeX


    @misc{danish2024geobenchvlm,
        title={GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks}, 
        author={Muhammad Sohail Danish and Muhammad Akhtar Munir and Syed Roshaan Ali Shah and Kartik Kuckreja and Fahad Shahbaz Khan and Paolo Fraccaro and Alexandre Lacoste and Salman Khan},
        year={2024},
        eprint={2411.19325},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2411.19325}, 
    }
                

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We are thankful to LLaVA and Vicuna for releasing their models and code as open-source contributions.