vllm.model_executor.kernels.linear ¶
This module re-exports linear kernel implementations to provide a stable import interface during an ongoing reorganization. Upcoming PRs will remove the scaled_mm and mixed_precision subdirectories and reorganize kernels by provider (aiter, cutlass, flashinfer, etc.) rather than by precision type. By centralizing exports here, we minimize the need to update imports across other modules when the internal structure changes. If you are adding a new kernel selector or kernel implementation, add it to this init.py to maintain import stability.
Modules:
| Name | Description |
|---|---|
scaled_mm | |
AiterInt8ScaledMMLinearKernel ¶
Bases: CutlassInt8ScaledMMLinearKernel
Source code in vllm/model_executor/kernels/linear/scaled_mm/aiter.py
apply_weights ¶
AiterInt8ScaledMMLinearKernel implements a fused version of output = torch.mm((scale_a * a), (scale_b * b)).to(out_dtype) where scale_a * a and scale_b * b are implemented using numpy-style broadcasting. Currently only support per-tensor-per-tensor GEMM and per-token-per-channel GEMM through AITER w8a8 scaled gemm. AiterInt8ScaledMMLinearKernel also does not support ATIER block scaled GEMM and mix-precision GEMM.
Source code in vllm/model_executor/kernels/linear/scaled_mm/aiter.py
choose_mp_linear_kernel ¶
choose_mp_linear_kernel(
config: MPLinearLayerConfig,
compute_capability: int | None = None,
) -> type[MPLinearKernel]
Choose an MPLinearKernel that can implement the given config for the given compute capability. Attempts to choose the best kernel in terms of performance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config | MPLinearLayerConfig | Description of the linear layer to be implemented. | required |
compute_capability | Optional[int] | The compute capability of the target device, if None uses | None |
Raises:
| Type | Description |
|---|---|
ValueError | If no kernel can implement the given config. |
Returns:
| Type | Description |
|---|---|
type[MPLinearKernel] | type[MPLinearKernel]: Chosen kernel. |
Source code in vllm/model_executor/kernels/linear/__init__.py
choose_scaled_mm_linear_kernel ¶
choose_scaled_mm_linear_kernel(
config: _KernelConfigT,
possible_kernels: dict[
PlatformEnum, list[type[_KernelT]]
],
compute_capability: int | None = None,
force_kernel: type[_KernelT] | None = None,
) -> type[_KernelT]
Choose a _KernelT that can implement the given config for the given compute capability. Attempts to choose the best kernel in terms of performance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config | _KernelConfigT | Description of the linear layer to be implemented. | required |
possible_kernels | dict[PlatformEnum, list[_KernelT]] | A dictionary of platforms and their list of possible kernels. | required |
compute_capability | Optional[int] | The compute capability of the target device, if None uses | None |
force_kernel | Optional[type[_KernelT]] | An Optional forced kernel to override the possible_kernels if it can be implemented. If None, it will only try the possible kernels. | None |
Raises:
| Type | Description |
|---|---|
ValueError | If no kernel can implement the given config. |
Returns:
| Name | Type | Description |
|---|---|---|
_KernelT | type[_KernelT] | Chosen kernel. |