vllm.model_executor.layers.mamba.mamba2_metadata
 dataclass  ¶
 Source code in vllm/model_executor/layers/mamba/mamba2_metadata.py
  instance-attribute  ¶
 chunk_offsets: Tensor
With continuous batching layout of x in vLLM, to enable a Triton program to handle a request in parallel, two supporting tensors are used (batch_ptr, token_chunk_offset_ptr) BLOCK_M = the # tokens to be handled by a Triton program (can be customized for different hardware)
nums_dict
tracks the data associated with a given value of BLOCK_M BLOCK_M = #tokens handled by a Triton program
cu_seqlen: total tokens per batch (used as flag to update other data at each new input) batch_ptr: tracks batch-id handled by the Triton program token_chunk_offset_ptr: tracks token group_idx handled by the Triton program (Triton implementation of causal_conv1d handles parallelism in 3-axes - feature-axis - batch-axis - sequence-axis)
 class-attribute instance-attribute  ¶
   
 __init__(
    has_initial_states: Tensor,
    prep_initial_states: bool,
    chunk_size: int,
    seq_idx: Tensor,
    chunk_indices: Tensor,
    chunk_offsets: Tensor,
    nums_dict: Optional[dict] = None,
    cu_seqlen: Optional[int] = None,
    batch_ptr: Optional[tensor] = None,
    token_chunk_offset_ptr: Optional[tensor] = None,
) -> None
 
 get_platform_metadata_classes() -> tuple[
    type[AttentionMetadata], ...
]
Returns the appropriate metadata classes for the current platform.
Source code in vllm/model_executor/layers/mamba/mamba2_metadata.py
  
 prepare_mamba2_metadata(
    chunk_size: int,
    attn_metadata: AttentionMetadata,
    mamba2_metadata=None,
) -> Mamba2Metadata
Source code in vllm/model_executor/layers/mamba/mamba2_metadata.py
  
 update_metadata(
    x: Tensor,
    query_start_loc: Tensor,
    mamba2_metadata: Union[
        Mamba2Metadata, Mamba2AttentionMetadata
    ],
)
this is triggered upon handling a new input at the first layer