vllm.model_executor.layers.quantization.kernels.scaled_mm
Modules:
| Name | Description | 
|---|---|
| ScaledMMLinearKernel |  | 
| aiter |  | 
| cutlass |  | 
| triton |  | 
| xla |  | 
 module-attribute  ¶
 _POSSIBLE_KERNELS: dict[
    PlatformEnum, list[type[ScaledMMLinearKernel]]
] = {
    CPU: [CutlassScaledMMLinearKernel],
    CUDA: [CutlassScaledMMLinearKernel],
    ROCM: [
        AiterScaledMMLinearKernel,
        TritonScaledMMLinearKernel,
    ],
    TPU: [XLAScaledMMLinearKernel],
}
 
 choose_scaled_mm_linear_kernel(
    config: ScaledMMLinearLayerConfig,
    compute_capability: Optional[int] = None,
) -> type[ScaledMMLinearKernel]
Choose an ScaledMMLinearKernel that can implement the given config for the given compute capability. Attempts to choose the best kernel in terms of performance.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| config | ScaledMMLinearLayerConfig | Description of the linear layer to be implemented. | required | 
| compute_capability | Optional[int] | The compute capability of the target device, if None uses  | None | 
Raises:
| Type | Description | 
|---|---|
| ValueError | If no kernel can implement the given config. | 
Returns:
| Type | Description | 
|---|---|
| type[ScaledMMLinearKernel] | type[ScaledMMLinearKernel]: Chosen kernel. |