]
Model | Aircraft Type Classification | Buidling Counting | Crop Type Classification | Disaster type Classification | Fire Risk Assessment | General Vehicle Counting | Land Use Classification | Marine Debirs Counting | Scene Classification | Ship Type Classification | Spatial Relation Classification | Specific Aircraft Type Counting | Specific Vehicle Type Counting | Tree Health Assessment | Trees Counting | Water Bodies Counting | Average |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Qwen2-VL | 0.5900 | 0.2882 | 0.2727 | 0.5507 | 0.4333 | 0.2733 | 0.5829 | 0.2800 | 0.8145 | 0.4757 | 0.6867 | 0.3375 | 0.2366 | 0.2166 | 0.2000 | 0.5529 | 0.3995 |
GPT-4o | 0.6200 | 0.3059 | 0.1818 | 0.5991 | 0.2000 | 0.1600 | 0.6341 | 0.2000 | 0.8627 | 0.6165 | 0.7060 | 0.3667 | 0.2455 | 0.2166 | 0.2706 | 0.5647 | 0.3971 |
LLaVA-OneVision | 0.6100 | 0.3059 | 0.2727 | 0.5154 | 0.2333 | 0.3000 | 0.5732 | 0.2400 | 0.7976 | 0.3447 | 0.6931 | 0.4000 | 0.2857 | 0.2229 | 0.3412 | 0.4941 | 0.3900 |
GeoChat | 0.4500 | 0.2529 | 0.2182 | 0.3921 | 0.2667 | 0.1667 | 0.5146 | 0.2200 | 0.8096 | 0.3592 | 0.5665 | 0.2583 | 0.1920 | 0.1146 | 0.2353 | 0.3882 | 0.3179 |
LLaVA-1.5 | 0.4500 | 0.2529 | 0.2727 | 0.3612 | 0.2000 | 0.1400 | 0.5195 | 0.2800 | 0.7084 | 0.3495 | 0.4421 | 0.2208 | 0.1964 | 0.1720 | 0.2471 | 0.3294 | 0.3025 |
SPHINX | 0.2400 | 0.2353 | 0.2545 | 0.2952 | 0.4000 | 0.1533 | 0.4415 | 0.1400 | 0.7036 | 0.2512 | 0.5644 | 0.1786 | 0.2411 | 0.1338 | 0.2471 | 0.4706 | 0.3012 |
InternVL-2 | 0.3600 | 0.2647 | 0.2182 | 0.3965 | 0.1667 | 0.1467 | 0.4610 | 0.3400 | 0.6964 | 0.3058 | 0.4979 | 0.2458 | 0.2277 | 0.2166 | 0.1882 | 0.3765 | 0.3005 |
LLaVA-NeXT | 0.4800 | 0.2353 | 0.2000 | 0.3656 | 0.3000 | 0.1467 | 0.4732 | 0.2800 | 0.7108 | 0.2816 | 0.3948 | 0.2125 | 0.2009 | 0.1592 | 0.2353 | 0.3176 | 0.2937 |
RS-LLaVA | 0.4200 | 0.2471 | 0.2000 | 0.3568 | 0.2000 | 0.1200 | 0.4171 | 0.1000 | 0.7277 | 0.2379 | 0.3777 | 0.2250 | 0.1473 | 0.1338 | 0.2471 | 0.3412 | 0.2646 |
Ferret | 0.0800 | 0.1941 | 0.1636 | 0.0573 | 0.0667 | 0.1933 | 0.2049 | 0.2800 | 0.2410 | 0.2126 | 0.2060 | 0.1714 | 0.2232 | 0.1720 | 0.2353 | 0.1176 | 0.1805 |
The table below compares generic and geospatial-specific benchmarks, highlighting their focus areas, data sources, and evaluation formats. It showcases the diversity of modalities, task types, and annotation approaches across various benchmarks, emphasizing the unique requirements and challenges of geospatial datasets in contrast to generic ones.
Benchmark | Domain | Modalities | Data Sources | Answer Type | Annotation Type | Human Verify | Year | RS Category |
---|---|---|---|---|---|---|---|---|
CulturalVQA | General | O | Curated | FF | M | β | 2024 | N/A |
EXAMS-V | General | O | Academic Exams | MCQ | M | - | 2024 | N/A |
M4U | General | O | Academic Exams | MCQ | M | β | 2024 | N/A |
MMMU | General | O | Academic Exams | FF, MCQ | M | β | 2024 | N/A |
MME | General | O | Various Open-Source | Yes/No | M | β | 2024 | N/A |
MMBench | General | O | Various Open-Source | MCQ | A+M | β | 2024 | N/A |
MMStar | General | O | Existing Benchmarks | MCQ | M | β | 2024 | N/A |
LAMM | General | O, PC | Various Open-Source | FF | A+M | β | 2023 | N/A |
SEED-Bench | General | O, V | Various Open-Source | MCQ | A+M | β | 2023 | N/A |
SEED-Bench2 | General | O, MI, V | Various Open-Source | MCQ | A+M | β | 2024 | N/A |
SEED-Bench-H | General | O, MI, V | Public Data / Curated | MCQ | M | β | 2024 | N/A |
RSIEval | RS | O, PAN | DOTA | FF | M | β | 2023 | 6 |
LHRS-Bench | RS | O | GE + OSM | SC | M | β | 2024 | 4 |
EarthGPT | RS | O, IR, SAR | DRSD | FF, BBox | A+M | - | 2024 | 5 |
SkyEyeGPT | RS | O, V | DRSD | FF, MCQ | A+M | β | 2024 | 6 |
Fit-RSRC | RS | O | DRSD | FF, MCQ | A+M | β | 2024 | 1 |
Fit-RSFG | RS | O | DRSD | FF, MCQ | A+M | β | 2024 | 5 |
VRSBench | RS | O | DOTA, DIOR | FF, BBox | A+M | β | 2024 | 3 |
VLEO-Bench | RS | O, BT | DRSD | FF, BBox, MCQ | A+M | β | 2024 | 6 |
EarthVQA | RS | O | DRSD | FF | A+M | β | 2023 | 5 |
RemoteCount | RS | O | DOTA | SC | A+M | β | 2024 | 1 |
FineGrip | RS | O | MAR20 | FF, Seg | A+M | β | 2024 | 5 |
GeoChat-Bench | RS | O | DRSD | FF, BBox | A+M | β | 2023 | 6 |
GEOBench-VLM | RS | O, MS, SAR, BT, MT | DRSD | MCQ, BBox, Seg | A+M | β | Ours | 8 |
Comprehensive benchmark for VLMs in numerous geospatial tasks. This benchmark evaluates VLMs across eight core task categories, assessing their ability to interpret complex spatial data, classify scenes, identify and localize objects, detect events, generate captions, segment regions, analyze temporal changes, and process non-optical data. Tasks span from classifying landscapes and objects (e.g., land use, crop types, ships, aircraft) to counting, detecting hazards, and assessing disaster impact, testing VLMs on spatial reasoning.
Our pipeline integrates diverse datasets, automated tools, and manual annotation. Tasks such as scene understanding, object classification, and non-optical analysis are based on classification datasets, while GPT-4o generates unique MCQs with five options: one correct answer, one semantically similar "closest" option, and three plausible alternatives. Spatial relationship tasks rely on manually annotated object pair relationships, ensuring consistency through cross-verification. Caption generation leverages GPT-4o, combining image, object details, and spatial interactions with manual refinement for high precision.
This illustrates model performance on geospatial scene understanding tasks, highlighting successes in clear contexts and challenges in ambiguous scenes. The results emphasize the importance of contextual reasoning and addressing overlapping visual cues for accurate classification.
![]() |
Solar farm | β |
![]() |
Solar farm | β |
![]() |
Solar farm | β |
![]() |
Solar farm | β |
![]() |
Solar farm | β |
![]() |
Solar farm | β |
![]() |
Solar farm | β |
![]() |
Solar farm | β |
![]() |
Solar farm | β |
![]() |
Instruction not followed | β |
![]() |
Stadium | β |
![]() |
Stadium | β |
![]() |
Stadium | β |
![]() |
Stadium | β |
![]() |
Stadium | β |
![]() |
Stadium | β |
![]() |
Stadium | β |
![]() |
Stadium | β |
![]() |
Stadium | β |
![]() |
Airport | β |
![]() |
Low | β |
![]() |
Low | β |
![]() |
Low | β |
![]() |
Low | β |
![]() |
High | β |
![]() |
Low | β |
![]() |
Low | β |
![]() |
Non-burnable | β |
![]() |
Low | β |
![]() |
High | β |
![]() |
Interchange | β |
![]() |
Interchange | β |
![]() |
Interchange | β |
![]() |
Interchange | β |
![]() |
Interchange | β |
![]() |
Road bridge | β |
![]() |
Interchange | β |
![]() |
Interchange | β |
![]() |
Road bridge | β |
![]() |
Instruction not followed | β |
![]() |
Harbor | β |
![]() |
Harbor | β |
![]() |
Harbor | β |
![]() |
Harbor | β |
![]() |
Harbor | β |
![]() |
Harbor | β |
![]() |
Harbor | β |
![]() |
Harbor | β |
![]() |
Harbor | β |
![]() |
Harbor | β |
![]() |
Wastewater treatment plant | β |
![]() |
Wastewater treatment plant | β |
![]() |
Wastewater treatment plant | β |
![]() |
Storage tank | β |
![]() |
Storage tank | β |
![]() |
Storage tank | β |
![]() |
Storage tank | β |
![]() |
Storage tank | β |
![]() |
Storage tank | β |
![]() |
Instruction not followed | β |
![]() |
Void label | β |
![]() |
Mixed cereal | β |
![]() |
Winter triticale | β |
![]() |
Mixed cereal | β |
![]() |
Mixed cereal | β |
![]() |
Winter rapeseed | β |
![]() |
Mixed cereal | β |
![]() |
Winter triticale | β |
![]() |
Mixed cereal | β |
![]() |
Winter triticale | β |
![]() |
Water treatment facility | β |
![]() |
Water treatment facility | β |
![]() |
Water treatment facility | β |
![]() |
Water treatment facility | β |
![]() |
Water treatment facility | β |
![]() |
Water treatment facility | β |
![]() |
Water treatment facility | β |
![]() |
Water treatment facility | β |
![]() |
Dam | β |
![]() |
Lighthouse | β |
The figure highlights model performance on object classification, showing success with familiar ob- jects like the βatago-class destroyerβ and βsmall civil transport/utilityβ aircraft. However, models struggled with rarer objects like the βmurasame-class destroyerβ and βgaribaldi aircraft carrierβ indicating a need for improvement on less common classes and fine-grained recognition.
![]() |
Kitty Hawk-class aircraft carrier | β |
![]() |
Kitty Hawk-class aircraft carrier | β |
![]() |
Kitty Hawk-class aircraft carrier | β |
![]() |
Kitty Hawk-class aircraft carrier | β |
![]() |
Kitty Hawk-class aircraft carrier | β |
![]() |
Arleigh Burke-class destroyer | β |
![]() |
Kitty Hawk-class aircraft carrier | β |
![]() |
Arleigh Burke-class destroyer | β |
![]() |
Kitty Hawk-class aircraft carrier | β |
![]() |
Kitty Hawk-class aircraft carrier | β |
![]() |
Small Civil Transport/Utility | β |
![]() |
Small Civil Transport/Utility | β |
![]() |
Small Civil Transport/Utility | β |
![]() |
Small Civil Transport/Utility | β |
![]() |
Small Civil Transport/Utility | β |
![]() |
Small Civil Transport/Utility | β |
![]() |
Small Civil Transport/Utility | β |
![]() |
Small Civil Transport/Utility | β |
![]() |
Small Civil Transport/Utility | β |
![]() |
Instruction not followed | β |
![]() |
INS Vikramaditya carrier | β |
![]() |
Kitty Hawk-class aircraft carrier | β |
![]() |
INS Vikramaditya carrier | β |
![]() |
Kitty Hawk-class aircraft carrier | β |
![]() |
Kitty Hawk-class aircraft carrier | β |
![]() |
INS Vikramaditya carrier | β |
![]() |
Kitty Hawk-class aircraft carrier | β |
![]() |
Kitty Hawk-class aircraft carrier | β |
![]() |
Instruction not followed | β |
![]() |
INS Vikramaditya carrier | β |
![]() |
Atago-class destroyer | β |
![]() |
Atago-class destroyer | β |
![]() |
Atago-class destroyer | β |
![]() |
Atago-class destroyer | β |
![]() |
Atago-class destroyer | β |
![]() |
Type 45 destroyer | β |
![]() |
Atago-class destroyer | β |
![]() |
Atago-class destroyer | β |
![]() |
Instruction not followed | β |
![]() |
Atago-class destroyer | β |
This figure showcases model performance on counting tasks, where Qwen 2-VL, GPT-4o and LLaVA-One have better performance in identifying objects. Other models, such as Ferret, struggled with overestimation, highlighting challenges in object differentiation and spatial reasoning.
![]() |
103 | β |
![]() |
103 | β |
![]() |
103 | β |
![]() |
103 | β |
![]() |
69 | β |
![]() |
103 | β |
![]() |
69 | β |
![]() |
69 | β |
![]() |
69 | β |
![]() |
69 | β |
![]() |
1 | β |
![]() |
1 | β |
![]() |
1 | β |
![]() |
3 | β |
![]() |
3 | β |
![]() |
0 | β |
![]() |
1 | β |
![]() |
4 | β |
![]() |
3 | β |
![]() |
2 | β |
![]() |
3 | β |
![]() |
3 | β |
![]() |
4 | β |
![]() |
5 | β |
![]() |
5 | β |
![]() |
5 | β |
![]() |
5 | β |
![]() |
5 | β |
![]() |
6 | β |
![]() |
5 | β |
![]() |
3 | β |
![]() |
4 | β |
![]() |
5 | β |
![]() |
4 | β |
![]() |
5 | β |
![]() |
3 | β |
![]() |
2 | β |
![]() |
5 | β |
![]() |
3 | β |
![]() |
4 | β |
![]() |
1 | β |
![]() |
1 | β |
![]() |
1 | β |
![]() |
3 | β |
![]() |
0 | β |
![]() |
4 | β |
![]() |
2 | β |
![]() |
3 | β |
![]() |
0 | β |
![]() |
4 | β |
![]() |
0 | β |
![]() |
0 | β |
![]() |
0 | β |
![]() |
3 | β |
![]() |
3 | β |
![]() |
0 | β |
![]() |
0 | β |
![]() |
3 | β |
![]() |
0 | β |
![]() |
3 | β |
![]() |
24 | β |
![]() |
24 | β |
![]() |
24 | β |
![]() |
18 | β |
![]() |
18 | β |
![]() |
18 | β |
![]() |
24 | β |
![]() |
18 | β |
![]() |
30 | β |
![]() |
24 | β |
![]() |
84 | β |
![]() |
196 | β |
![]() |
196 | β |
![]() |
196 | β |
![]() |
168 | β |
![]() |
168 | β |
![]() |
168 | β |
![]() |
168 | β |
![]() |
168 | β |
![]() |
168 | β |
Model performance on disaster assessment tasks, with success in scenarios like βfireβ and βfloodingβ but challenges in ambiguous cases like 'tsunami' and 'seismic activity'. Misclassifications highlight limitations in contextual reasoning and insufficient exposure on overlapping disaster features.
![]() |
Precipitation-Related Events | β |
![]() |
Seismic Activity | β |
![]() |
Precipitation-Related Events | β |
![]() |
Precipitation-Related Events | β |
![]() |
Soil Erosion | β |
![]() |
Human Activities | β |
![]() |
Precipitation-Related Events | β |
![]() |
Precipitation-Related Events | β |
![]() |
Precipitation-Related Events | β |
![]() |
Instruction not followed | β |
![]() |
flooding | β |
![]() |
flooding | β |
![]() |
flooding | β |
![]() |
flooding | β |
![]() |
flooding | β |
![]() |
flooding | β |
![]() |
flooding | β |
![]() |
flooding | β |
![]() |
flooding | β |
![]() |
Instruction not followed | β |
![]() |
fire | β |
![]() |
fire | β |
![]() |
fire | β |
![]() |
fire | β |
![]() |
fire | β |
![]() |
earthquake | β |
![]() |
fire | β |
![]() |
fire | β |
![]() |
fire | β |
![]() |
Instruction not followed | β |
![]() |
earthquake | β |
![]() |
earthquake | β |
![]() |
flooding | β |
![]() |
earthquake | β |
![]() |
flooding | β |
![]() |
flooding | β |
![]() |
flooding | β |
![]() |
earthquake | β |
![]() |
earthquake | β |
![]() |
Instruction not followed | β |
The figure demonstrates model performance on spatial relationship tasks, with success in close-object scenarios and struggles in cluttered environments with distant objects.
![]() |
C. | β |
![]() |
D. | β |
![]() |
D. | β |
![]() |
D. | β |
![]() |
D. | β |
![]() |
D. | β |
![]() |
A. | β |
![]() |
D. | β |
![]() |
D. | β |
![]() |
A. | β |
![]() |
B. | β |
![]() |
E. | β |
![]() |
B. | β |
![]() |
D. | β |
![]() |
D. | β |
![]() |
C. | β |
![]() |
A. | β |
![]() |
B. | β |
![]() |
E. | β |
![]() |
B. | β |
![]() |
C. | β |
![]() |
C. | β |
![]() |
C. | β |
![]() |
C. | β |
![]() |
C. | β |
![]() |
C. | β |
![]() |
C. | β |
![]() |
C. | β |
![]() |
C. | β |
![]() |
C. | β |
![]() |
C. | β |
![]() |
E. | β |
![]() |
B. | β |
![]() |
D. | β |
![]() |
B. | β |
![]() |
B. | β |
![]() |
E. | β |
![]() |
B. | β |
![]() |
D. | β |
![]() |
A. | β |
@misc{danish2024geobenchvlm,
title={GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks},
author={Muhammad Sohail Danish and Muhammad Akhtar Munir and Syed Roshaan Ali Shah and Kartik Kuckreja and Fahad Shahbaz Khan and Paolo Fraccaro and Alexandre Lacoste and Salman Khan},
year={2024},
eprint={2411.19325},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.19325},
}
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We are thankful to LLaVA and Vicuna for releasing their models and code as open-source contributions.