vllm.multimodal.audio ¶
AudioResampler ¶
Resample audio data to a target sample rate.
Source code in vllm/multimodal/audio.py
AudioSpec dataclass ¶
Specification for target audio format.
This dataclass defines the expected audio format for a model's feature extractor. It is used to normalize audio data before processing.
Attributes:
| Name | Type | Description |
|---|---|---|
target_channels | int | None | Number of output channels. None means passthrough (no normalization). 1 = mono, 2 = stereo, etc. |
channel_reduction | ChannelReduction | Method to reduce channels when input has more channels than target. Only used when reducing channels. |
Source code in vllm/multimodal/audio.py
ChannelReduction ¶
Method to reduce multi-channel audio to target channels.
Source code in vllm/multimodal/audio.py
find_split_point ¶
Find the best point to split audio by looking for silence or low amplitude.
Searches for the quietest region within a specified range by calculating RMS energy in sliding windows.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
wav | ndarray | Audio array. Can be 1D or multi-dimensional. | required |
start_idx | int | Start index of search region (inclusive). | required |
end_idx | int | End index of search region (exclusive). | required |
min_energy_window | int | Window size in samples for energy calculation. | required |
Returns:
| Type | Description |
|---|---|
int | Index of the quietest point within the search region. This is the |
int | recommended split point to minimize audio artifacts. |
Example
audio = np.random.randn(32000)
Insert quiet region¶
audio[16000:17600] = 0.01 split_idx = find_split_point( ... wav=audio, ... start_idx=0, ... end_idx=32000, ... min_energy_window=1600, ... ) 16000 <= split_idx <= 17600 True
Source code in vllm/multimodal/audio.py
normalize_audio ¶
Normalize audio to the specified format.
This function handles channel reduction for multi-channel audio, supporting both numpy arrays and torch tensors.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio | NDArray[floating] | Tensor | Input audio data. Can be: - 1D array/tensor: (time,) - already mono - 2D array/tensor: (channels, time) - standard format from torchaudio - 2D array/tensor: (time, channels) - format from soundfile (will be auto-detected and transposed if time > channels) | required |
spec | AudioSpec | AudioSpec defining the target format. | required |
Returns:
| Type | Description |
|---|---|
NDArray[floating] | Tensor | Normalized audio in the same type as input (numpy or torch). |
NDArray[floating] | Tensor | For mono output (target_channels=1), returns 1D array/tensor. |
Raises:
| Type | Description |
|---|---|
ValueError | If audio has unsupported dimensions or channel expansion is requested (e.g., mono to stereo). |
Source code in vllm/multimodal/audio.py
split_audio ¶
split_audio(
audio_data: ndarray,
sample_rate: int,
max_clip_duration_s: float,
overlap_duration_s: float,
min_energy_window_size: int,
) -> list[ndarray]
Split audio into chunks with intelligent split points.
Splits long audio into smaller chunks at low-energy regions to minimize cutting through speech. Uses overlapping windows to find quiet moments for splitting.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
audio_data | ndarray | Audio array to split. Can be 1D (mono) or multi-dimensional. Splits along the last dimension (time axis). | required |
sample_rate | int | Sample rate of the audio in Hz. | required |
max_clip_duration_s | float | Maximum duration of each chunk in seconds. | required |
overlap_duration_s | float | Overlap duration in seconds between consecutive chunks. Used to search for optimal split points. | required |
min_energy_window_size | int | Window size in samples for finding low-energy regions. | required |
Returns:
| Type | Description |
|---|---|
list[ndarray] | List of audio chunks. Each chunk is a numpy array with the same shape |
list[ndarray] | as the input except for the last (time) dimension. |
Example
audio = np.random.randn(1040000) # 65 seconds at 16kHz chunks = split_audio( ... audio_data=audio, ... sample_rate=16000, ... max_clip_duration_s=30.0, ... overlap_duration_s=1.0, ... min_energy_window_size=1600, ... ) len(chunks) 3