SileroVAD Speech Segment Detection Operator

About 793 wordsAbout 3 min

2025-10-14

📘 Overview

SileroVADGenerator is a Voice Activity Detection (VAD) operator used to identify segments of speech activity within an audio signal.

`init`

def __init__(self, 
    repo_or_dir: str = "snakers4/silero-vad", 
    source: str = "github",
    device: Union[str, List[str]] = "cuda",
    num_workers: int = 1,
    threshold: float = 0.5,
    use_min_cut: bool = False,
    sampling_rate: int = 16000,
    min_speech_duration_s: float = 0.25,
    max_speech_duration_s: float = float('inf'),
    min_silence_duration_s: float = 0.1,
    speech_pad_s: float = 0.03,
    return_seconds: bool = False,
    **kwargs,
)

`init` Parameters

Parameter	Type	Default	Description
`repo_or_dir`	`str`	`snakers4/silero-vad`	The repository or local directory of the model.
`source`	`str`	`github`	The source of the model, either `github` or `local`.
`device`	`Union[str, List[str]]`	`cuda`	The device(s) to run the model on. Can be a single device string (e.g., "cuda" or "cpu") or a list of device strings (e.g., `["cuda:0", "cuda:1"]`).
`num_workers`	`int`	`1`	The number of parallel workers (models) to initialize. These are distributed across the devices specified in the device parameter. If `num_workers` exceeds the number of available devices, multiple models will be initialized per device. For example, with `device=["cuda:0", "cuda:1"]` and `num_workers=4`, two models will run on `cuda:0` and two on `cuda:1`.
`threshold`	`float`	`0.5`	The VAD threshold used to determine whether a segment is speech.
`sampling_rate`	`int`	`16000`	The audio sampling rate, must be 16000.
`min_speech_duration_s`	`float`	`0.25`	Minimum duration (in seconds) for a speech segment. Segments shorter than this are discarded.
`max_speech_duration_s`	`float`	`float('inf')`	Maximum duration (in seconds) for a speech segment. Longer segments are truncated. If exceeded, the model first attempts to cut at a sufficiently long silent period. If none exists: `use_min_cut=True`: cut at the frame with the lowest probability in the latter half. `use_min_cut=False`: hard cut at the boundary.
`min_silence_duration_s`	`float`	`0.1`	Minimum silence duration (in seconds). Silences shorter than this are merged.
`speech_pad_s`	`float`	`0.03`	Padding (in seconds) added to both sides of detected speech segments. Helps avoid overly tight segmentation between short silences.
`return_seconds`	`bool`	`False`	If True, the output start and end are expressed in seconds; otherwise, in sample indices.

kwargs: additional VAD parameters

Parameter	Type	Default	Description
`time_resolution`	`int`	`1`	The time resolution (in seconds) for timestamp rounding when `return_seconds=True`.
`neg_threshold`	`Optional[float]`	`None`	Negative threshold for VAD. If `None`, set to `max(threshold - 0.15, 0.01)`. Lower values reduce jitter but increase sensitivity.
`min_silence_at_max_speech`	`float`	`0.098`	Maximum silence duration allowed (in seconds) when a speech region exceeds `max_speech_duration_s`. The algorithm will attempt normal silent cuts first, then fall back to min-cut if needed.
`use_max_poss_sil_at_max_speech`	`bool`	`True`	When the `max_speech_duration_s` limit is reached and multiple candidate silences exist, whether to select the longest one as the cut point.

Note: The Silero VAD model loads weights from the GitHub repository, not from the Hugging Face hub.

`run`

def run(
    self,
    storage: DataFlowStorage,
    input_audio_key: str = "audio",
    output_answer_key: str = "timestamps",           
)

Executes the main logic of the operator. It reads the input DataFrame from storage, calls the Silero VAD model to generate speech segment timestamps, and writes the results back to storage.

Parameters

Parameter	Type	Default	Description
`storage`	`DataFlowStorage`	Required	The data storage instance used for reading input and writing output.
`input_audio_key`	`str`	`audio`	The key name for the column containing paths to audio files.
`output_answer_key`	`str`	`timestamps`	The key name for the column where the list of detected speech segment timestamps will be stored.

🧠 Example Usage

from dataflow.utils.storage import FileStorage
from dataflow.operators.core_audio import SileroVADGenerator

class SileroVADGeneratorEval:
    def __init__(self):
        self.storage = FileStorage(
            first_entry_file_name="../example_data/audio_voice_activity_detection_pipeline/sample_data.jsonl",
            cache_path="./cache",
            file_name_prefix="silero_vad",
            cache_type="jsonl",
        )

        self.silero_vad_generator = SileroVADGenerator(
            repo_or_dir="snakers4/silero-vad",
            source="github",
            device=['cuda:0'],
            num_workers=1,
            threshold=0.5,
            sampling_rate=16000,
            min_speech_duration_s=0.25,
            max_speech_duration_s=float('inf'),
            min_silence_duration_s=0.1,
            speech_pad_s=0.03,
            return_seconds=True,
            # The following are kwargs parameterss
            time_resolution=1,
            neg_threshold=None,
            min_silence_at_max_speech=98,
            use_max_poss_sil_at_max_speech=True,
        )
    
    def forward(self):
        self.silero_vad_generator.run(
            storage=self.storage.step(),
            input_audio_key='audio',
            output_answer_key='timestamps',
        )

    
if __name__ == "__main__":
    pipline = SileroVADGeneratorEval()
    pipline.forward()

🧾 Default Output Format

Field	Type	Description
`timestamps`	`list[dict]`	A list of speech segment timestamps. Each element is a dictionary containing `start` and `end` keys, representing the start and end times (in seconds) of the detected speech segment.

Example Input:

{"audio": ["../example_data/audio_voice_activity_detection_pipeline/test.wav"], "conversation": [{"from": "human", "value": "" }]}

Example Output:

{"audio":["..\/example_data\/audio_voice_activity_detection_pipeline\/test.wav"],"conversation":[{"from":"human","value":""}],"timestamps":[{"start":0.0,"end":2.0},{"start":2.7,"end":4.7},{"start":5.0,"end":6.9},{"start":9.3,"end":13.3},{"start":13.5,"end":15.1},{"start":15.3,"end":15.9},{"start":16.3,"end":17.9},{"start":18.4,"end":19.6},{"start":20.3,"end":32.6},{"start":32.7,"end":35.6},{"start":35.7,"end":37.6},{"start":38.0,"end":38.9},{"start":39.9,"end":43.3},{"start":43.6,"end":44.6},{"start":45.0,"end":46.8},{"start":48.8,"end":50.0},{"start":51.1,"end":54.2},{"start":54.5,"end":57.4},{"start":57.5,"end":59.6}]}

generate

eval

filter

refine

generate

eval

filter

generate

eval

filter

generaterow

refine

SileroVAD Speech Segment Detection Operator

📘 Overview

`init`

`init` Parameters

`run`

🧠 Example Usage

🧾 Default Output Format

SileroVAD Speech Segment Detection Operator

📘 Overview

__init__

init Parameters

run

🧠 Example Usage

🧾 Default Output Format

`init`

`init` Parameters

`run`