Our paper titled "Before the First Token: Benchmarking Data Preprocessing in Vision-Language Models" has been accepted to the 6th Workshop on Machine Learning and Systems (EuroMLSys), to be held in Edinburgh, UK, on April 27, 2026. EuroMLSys 2026 is co-located with EuroSys 2026. This work was done by our PhD student Sepideh Zohdi.
The paper performs comprehensive benchmarking to understand the performance bottlenecks in serving systems for vision-language models (VLM), which takes multi-modal inputs (e.g., images or videos) that require complex preprocessing before fed into the large language model (LLM) for generation.
Here is the abstract of the paper:
While vision-language models (VLMs) become widely used for video understanding, the sheer volume of spatiotemporal data they process presents a critical computational challenge. Current efforts have been predominantly focused on accelerating token generation by large language models (LLMs), overlooking the preprocessing required for preparing the input data. In this paper, we systematically benchmark and analyze the performance of the end-to-end VLM pipeline. Through a stage-by-stage latency characterization across diverse real-world datasets, we reveal that data preprocessing, spanning both CPU-bound video decoding and GPU-bound vision encoder, represents more frequently a critical performance bottleneck compared to the actual generation. Moreover, this data preprocessing overhead stays significant under varying input characteristics and hardware specifications. Our results underscore the urgent need for holistic, end-to-end performance optimizations for VLM pipelines.
Before the First Token: Benchmarking Data Preprocessing in Vision-Language Models
Sepideh Zohdi (Paderborn University), Lin Wang (Paderborn University)