Learning-Deep-Learning

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

February 2026

tl;dr: VLM is important initializaing for VLA but vision encoder needs finetuning.

Overall impression

Most Surprising Finding is that the performance requirements for VLMs in embodied manipulation tasks do not fully align with their VQA capabilities. Specifically, and contrary to common expectations, VLMs that perform well on general VQA benchmarks are not necessarily better when used in VLAs. Furthermore, on various auxiliary Embodied-QA tasks, we discover that fine-tuning on most of these tasks leads to a performance degradation in the resulting VLA.

QwenVL series models significantly outperform other VLMs.

Key ideas

Technical details

Notes