Skip to content

vLLM V1ยถ

V1 is now enabled by default for all supported use cases, and we will gradually enable it for every use case we plan to support. Please share any feedback on GitHub or in the vLLM Slack.

To disable V1, please set the environment variable as: VLLM_USE_V1=0, and send us a GitHub issue sharing the reason!

Why vLLM V1?ยถ

vLLM V1 re-architects the engine to reduce accumulated complexity while preserving the stable, battle-tested components users rely on (such as models, GPU kernels, and supporting utilities). The scheduler, KV cache manager, worker, sampler, and API server now operate within a cohesive framework that is easier to extend and maintain as new capabilities are added.

Specifically, V1 aims to:

  • Provide a simple, modular, and easy-to-hack codebase.
  • Ensure high performance with near-zero CPU overhead.
  • Combine key optimizations into a unified architecture.
  • Require zero configs by enabling features/optimizations by default.

We see significant performance improvements from upgrading to V1 core engine, in particular for long context scenarios. Please see performance benchmark (To be added).

For more details, check out the vLLM V1 blog post vLLM V1: A Major Upgrade to vLLMโ€™s Core Architecture (published Jan 27, 2025).

This living user guide outlines a few known important changes and limitations introduced by vLLM V1. The team has been working actively to bring V1 as the default engine, therefore this guide will be updated constantly as more features get supported on vLLM V1.

Current Statusยถ

For each item, our progress towards V1 support falls into one of the following states:

  • ๐Ÿš€ Optimized: Nearly fully optimized, with no further work currently planned.
  • ๐ŸŸข Functional: Fully operational, with ongoing optimizations.
  • ๐Ÿšง WIP: Under active development.
  • ๐ŸŸก Planned: Scheduled for future implementation (some may have open PRs/RFCs).
  • ๐ŸŸ  Delayed: Temporarily dropped in V1 but planned to be re-introduced later.
  • ๐Ÿ”ด Deprecated: Not planned for V1 unless there is strong demand.

Note

vLLM V1โ€™s unified scheduler treats both prompt and output tokens the same way by using a simple dictionary (e.g., {request_id: num_tokens}) to dynamically allocate a fixed token budget per request, enabling features like chunked prefills, prefix caching, and speculative decoding without a strict separation between prefill and decode phases.

The V1 scheduler supports multiple scheduling policies, including First-Come, First-Served (FCFS) and priority-based scheduling (where requests are processed based on assigned priority, with FCFS as a tie-breaker), configurable via the --scheduling-policy argument.

Hardwareยถ

Hardware Status
NVIDIA ๐Ÿš€
AMD ๐ŸŸข
INTEL GPU ๐ŸŸข
TPU ๐ŸŸข
CPU ๐ŸŸข (x86_64/aarch64) ๐ŸŸก (MacOS)

Note

More hardware platforms may be supported via plugins, e.g.:

Please check their corresponding repositories for more details.

Modelsยถ

Model Type Status
Decoder-only Models ๐Ÿš€ Optimized
Encoder-Decoder Models ๐ŸŸข Whisper only
Embedding Models ๐ŸŸข Functional
Mamba Models ๐ŸŸข (Mamba-2), ๐ŸŸข (Mamba-1)
Multimodal Models ๐ŸŸข Functional

Tip

This corresponds to the V1 column in our list of supported models.

See below for the status of models that are not yet supported or have more features planned in V1.

Embedding Modelsยถ

The initial basic support is now functional.

Later, we will consider using hidden states processor, which is based on global logits processor to enable simultaneous generation and embedding using the same engine instance in V1.

Mamba Modelsยถ

Models using selective state-space mechanisms instead of standard transformer attention are supported. Models that use Mamba-2 and Mamba-1 layers (e.g., Mamba2ForCausalLM, MambaForCausalLM,FalconMambaForCausalLM) are supported.

Hybrid models that combine Mamba-2 and Mamba-1 layers with standard attention layers are also supported (e.g., BambaForCausalLM, Zamba2ForCausalLM, NemotronHForCausalLM, FalconH1ForCausalLM and GraniteMoeHybridForCausalLM, JambaForCausalLM, Plamo2ForCausalLM).

Hybrid models with mechanisms different to Mamba are also supported (e.g, MiniMaxText01ForCausalLM, MiniMaxM1ForCausalLM, Lfm2ForCausalLM).

Please note that prefix caching is not yet supported for any of the above models.

Encoder-Decoder Modelsยถ

Whisper is supported. Other models requiring cross-attention between separate encoder and decoder (e.g., BartForConditionalGeneration, MllamaForConditionalGeneration) are not supported.

Featuresยถ

Feature Status
Prefix Caching ๐Ÿš€ Optimized
Chunked Prefill ๐Ÿš€ Optimized
LoRA ๐Ÿš€ Optimized
Logprobs Calculation ๐ŸŸข Functional
FP8 KV Cache ๐ŸŸข Functional on Hopper devices ( Pull Request #15191)
Spec Decode ๐Ÿš€ Optimized
Prompt Logprobs with Prefix Caching ๐ŸŸก Planned ( RFC #13414)
Structured Output Alternative Backends ๐ŸŸข Functional
Request-level Structured Output Backend ๐Ÿ”ด Deprecated
best_of ๐Ÿ”ด Deprecated ( RFC #13361)
Per-Request Logits Processors ๐Ÿ”ด Deprecated ( RFC #13360)
GPU <> CPU KV Cache Swapping ๐Ÿ”ด Deprecated

Note

vLLM V1โ€™s unified scheduler treats both prompt and output tokens the same way by using a simple dictionary (e.g., {request_id: num_tokens}) to dynamically allocate a fixed token budget per request, enabling features like chunked prefills, prefix caching, and speculative decoding without a strict separation between prefill and decode phases.

Semantic Changes to Logprobsยถ

vLLM V1 supports logprobs and prompt logprobs. However, there are some important semantics to consider:

Logprobs Calculationยถ

By default, logprobs in V1 are now returned immediately once computed from the modelโ€™s raw output (i.e. before applying any logits post-processing such as temperature scaling or penalty adjustments). As a result, the returned logprobs do not reflect the final adjusted probabilities used during sampling.

You can adjust this behavior by setting the --logprobs-mode flag. Four modes are supported: raw_logprobs (default), processed_logprobs, raw_logits, processed_logits. Raw means the values before applying any logit processors, like bad words. Processed means the values after applying all processors, including temperature and top_k/top_p.

Prompt Logprobs with Prefix Cachingยถ

Logprobs are not cached. For a request requiring prompt logprobs, the engine will ignore the prefix cache and recompute the prefill of full prompt to generate the logprobs.

Deprecated Featuresยถ

As part of the major architectural rework in vLLM V1, several legacy features have been deprecated.

Sampling featuresยถ
  • best_of: This feature has been deprecated due to limited usage. See details at RFC #13361.
  • Per-Request Logits Processors: Previously, users could pass custom processing functions to adjust logits on a per-request basis. In vLLM V1, this feature has been deprecated. Instead, the design is moving toward supporting global logits processors, a feature the team is actively working on for future releases. See details at RFC #13360.
KV Cache featuresยถ
  • GPU <> CPU KV Cache Swapping: with the new simplified core architecture, vLLM V1 no longer requires KV cache swapping to handle request preemptions.
Structured Output featuresยถ
  • Request-level Structured Output Backend: Deprecated, alternative backends (outlines, guidance) with fallbacks is supported now.