vllm.attention.backends.abstract
 
  Bases: ABC
Abstract class for attention backends.
Source code in vllm/attention/backends/abstract.py
  
    abstractmethod staticmethod  ¶
    classmethod  ¶
    abstractmethod staticmethod  ¶
 get_builder_cls() -> Type[AttentionMetadataBuilder]
 abstractmethod staticmethod  ¶
 get_impl_cls() -> Type[AttentionImpl]
 abstractmethod staticmethod  ¶
    staticmethod  ¶
    abstractmethod staticmethod  ¶
 get_metadata_cls() -> Type[AttentionMetadata]
 abstractmethod staticmethod  ¶
 get_state_cls() -> Type[AttentionState]
 classmethod  ¶
 make_metadata(*args, **kwargs) -> AttentionMetadata
 abstractmethod staticmethod  ¶
    
 Source code in vllm/attention/backends/abstract.py
  abstractmethod  ¶
 __init__(
    num_heads: int,
    head_size: int,
    scale: float,
    num_kv_heads: Optional[int] = None,
    alibi_slopes: Optional[List[float]] = None,
    sliding_window: Optional[int] = None,
    kv_cache_dtype: str = "auto",
    logits_soft_cap: Optional[float] = None,
    attn_type: str = DECODER,
    kv_sharing_target_layer_name: Optional[str] = None,
) -> None
Source code in vllm/attention/backends/abstract.py
  abstractmethod  ¶
 forward(
    layer: AttentionLayer,
    query: Tensor,
    key: Tensor,
    value: Tensor,
    kv_cache: Tensor,
    attn_metadata: T,
    output: Optional[Tensor] = None,
    output_scale: Optional[Tensor] = None,
) -> Tensor
Source code in vllm/attention/backends/abstract.py
  
 fused_output_quant_supported(
    dtype: dtype, static: bool, group_shape: GroupShape
)
Does this attention implementation support fused output quantization. This is used by the AttnFusionPass to only fuse output quantization onto implementations that support it.
TODO(luka) merge parameters into QuantDescriptor :param dtype: quantized dtype :param static: static or dynamic quantization :param group_shape: quant group shape. :return: is fusion supported for this type of quantization
Source code in vllm/attention/backends/abstract.py
  
  Bases: Protocol
Source code in vllm/attention/backends/abstract.py
  dataclass  ¶
 Attention metadata for prefill and decode batched together.
Source code in vllm/attention/backends/abstract.py
  abstractmethod property  ¶
 decode_metadata: Optional[AttentionMetadata]
Return the attention metadata that's required to run decode attention.
 instance-attribute  ¶
   abstractmethod property  ¶
 prefill_metadata: Optional[AttentionMetadata]
Return the attention metadata that's required to run prefill attention.
 
 __init__(
    num_prefills: int,
    num_prefill_tokens: int,
    num_decode_tokens: int,
    slot_mapping: Tensor,
    multi_modal_placeholder_index_maps: Optional[
        Dict[str, IndexMap]
    ],
    enable_kv_scales_calculation: bool,
) -> None
 
  Similar to dataclasses.asdict, but avoids deepcopying.
Source code in vllm/attention/backends/abstract.py
  
 Abstract class for attention metadata builders.
Source code in vllm/attention/backends/abstract.py
  abstractmethod  ¶
 __init__(
    input_builder: ModelRunnerInputBuilderBase,
) -> None
Create the builder, remember some configuration and parameters.
 
 Holds attention backend-specific objects reused during the lifetime of the model runner.
Source code in vllm/attention/backends/abstract.py
  abstractmethod  ¶
 __init__(runner: ModelRunnerBase)
 abstractmethod  ¶
 begin_forward(model_input: ModelRunnerInputBase) -> None
 abstractmethod  ¶
 get_graph_input_buffers(
    attn_metadata: T,
    is_encoder_decoder_model: bool = False,
) -> Dict[str, Any]
Get attention-specific input buffers for CUDA graph capture.
 abstractmethod  ¶
 graph_capture_get_metadata_for_batch(
    batch_size: int, is_encoder_decoder_model: bool = False
) -> T
Get attention metadata for CUDA graph capture of batch_size.
 abstractmethod  ¶
 graph_clone(batch_size: int) -> AttentionState[T]
 
 Attention type. Use string to be compatible with torch.compile.
Source code in vllm/attention/backends/abstract.py
  
  Bases: AttentionImpl[T], Generic[T]
Source code in vllm/attention/backends/abstract.py
  abstractmethod  ¶
 forward(
    layer: AttentionLayer,
    hidden_states_or_cq: Tensor,
    kv_c_normed: Tensor,
    k_pe: Tensor,
    kv_cache: Tensor,
    attn_metadata: T,
    output: Optional[Tensor] = None,
    output_scale: Optional[Tensor] = None,
) -> Tensor