Our paper titled "Breaking the Ice: Analyzing Cold Start Latency in vLLM" has been accepted to the 9th Annual Conference on Machine Learning and Systems (MLSys'26) held in Bellevue, WA during May 18-22, 2026. MLSys is a flagship conference focusing on the intersection of machine learning and systems. This paper was done by our Master's student Huzaifa Shaaban Kabakibo, in collaboration with IBM Research in Zurich.
This paper provides a comprehensive analysis of vLLM startup latency. vLLM has become the de facto large language model (LLM) serving engine for major cloud providers. Cloud providers typically perform autoscaling to adjust resource provisioning for LLM serving in order to save costs, which requires LLM instances to be brought up and down in a timely fashion. However, LLM instances served with vLLM suffer from high latency during initialization, a process that requires a deep understanding in order to be properly addressed. Our work addresses this important and timely challenge in the research field.
Here is the abstract of the paper:
As scalable inference services become popular, the cold start latency of an inference engine becomes important. Today, vLLM has evolved into the de-facto inference engine of choice for many inference workloads. Although popular, due to its complexity and rapid evolution, there has not been a systematic study on the startup latency of its engine. With major architectural innovations under it (e.g., the V1 API, introduction of torch.compile), in this paper, we present the first detailed performance characterization of vLLM startup latency. We break down the startup process into six foundational steps and demonstrate that this process is predominantly CPU-bound. Each step exhibits consistent and interpretable scaling trends with respect to model- and system-level parameters, enabling fine-grained attribution of latency sources. Building on these insights, we develop a lightweight analytical model that accurately predicts vLLM’s startup latency for a given hardware configuration, providing actionable guidance for serverless scheduling and resource planning in large-scale inference environments.
Breaking the Ice: Analyzing Cold Start Latency in vLLM
Huzaifa Shaaban Kabakibo (Paderborn University), Animesh Trivedi (IBM Research Europe, Zurich), Lin Wang (Paderborn University)