CTCForcedAlignmentFilter
About 1269 wordsAbout 4 min
2025-10-14
📘-概述
CTCForcedAlignmentFilter is a filtering operator that filters data based on CTC forced-alignment scores from speech recognition results.
__init__
def __init__(
self,
model_path: str = "MahmoudAshraf/mms-300m-1130-forced-aligner",
device: Union[str, List[str]] = "cuda",
num_workers: int = 1,
sampling_rate: int = 16000,
language: str = "en",
micro_batch_size: int = 16,
chinese_to_pinyin: bool = False,
retain_word_level_alignment: bool = True,
romanize: bool = True,
threshold: float = 0.8,
threshold_mode: str = "min",
)init Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
model_path | str | MahmoudAshraf/mms-300m-1130-forced-aligner | The model identifier or path for the forced aligner used during evaluation. |
device | Union[str, List[str]] | cuda | The device on which the model runs. Options: cuda or cpu. You can also specify a list of devices, such as ["cuda:0", "cuda:1"], to initialize multiple models on multiple GPUs. |
num_workers | int | 1 | Degree of operator parallelism. Initializes num_workers model instances and assigns them to the devices specified by device. If num_workers exceeds the number of devices, multiple models will be initialized per device for concurrent execution. For example, with device=["cuda:0", "cuda:1"] and num_workers=4, two models will run on cuda:0 and two on cuda:1. |
sampling_rate | int | 16000 | Audio sampling rate, default 16000. |
language | str | en | Audio language, default en. |
micro_batch_size | int | 16 | For long audio, the model splits it into multiple chunks. micro_batch_size specifies the chunk batch size per inference, default 16. |
chinese_to_pinyin | bool | False | Whether to convert Chinese characters to Pinyin, default False. |
romanize | bool | True | Whether to romanize characters, default True. |
threshold | float | 0.8 | Alignment score threshold, default 0.8. |
threshold_mode | str | min | How to apply the threshold: min (filter by the minimum alignment score within a span window) or mean (filter by the average alignment score within a span window). Samples with scores ≥ threshold are kept. Default is min. |
run
def run(
self,
storage: DataFlowStorage,
input_audio_key: str = "audio",
input_conversation_key: str = "conversation",
)Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
storage | DataFlowStorage | Required | The data storage instance used to hold input and output data. |
input_audio_key | str | audio | The key name for audio data in the input data, default is audio. |
input_conversation_key | str | conversation | The key name for conversation data in the input data, default is conversation. |
output_answer_key | str | forced_alignment_results | The key name of the alignment results field in the retained output data. Default is forced_alignment_results. |
🧠 Example Usage
from dataflow.utils.storage import FileStorage
from dataflow.operators.core_audio import CTCForcedAlignmentFilter
from dataflow.wrapper import BatchWrapper
class testCTCForcedAlignmentFilter:
def __init__(self):
self.storage = FileStorage(
# See audio_asr_pipeline step 2 to step 3 for an example
first_entry_file_name="/path/to/your/cache/audio_asr_pipeline/audio_asr_pipeline_step2.jsonl",
cache_path="./cache",
file_name_prefix="forced_alignment_filter",
cache_type="jsonl",
)
self.filter = CTCForcedAlignmentFilter(
model_path="MahmoudAshraf/mms-300m-1130-forced-aligner",
device=["cuda:0"],
num_workers=1,
sampling_rate=16000,
language="en",
micro_batch_size=16,
chinese_to_pinyin=False,
romanize=True,
threshold=0.8,
threshold_mode="mean"
)
def forward(self):
self.filter.run(
storage=self.storage.step(),
input_audio_key='audio',
input_conversation_key='conversation',
)
self.filter.close()
if __name__ == "__main__":
pipline = testCTCForcedAlignFilter()
pipline.forward()🧾 Default Output Format
| Field | Type | Description |
|---|---|---|
forced_alignment_results | dict | Forced-alignment results where spans represent frame-level character alignment quality and word_timestamps represent word-level timestamp alignment. |
error | Optional[str] | If an error occurs during alignment score computation, the error message is stored here. If no error occurs, error is null. |
Samples that meet the alignment score threshold are retained. If all data are filtered out, the message All data has been filtered out! is printed.
Example Input:
{"audio":["..\/example_data\/audio_asr_pipeline\/test.wav"],"conversation":[{"from":"human","value":""}],"transcript":"and says how do i get to dublin and the answer that comes back is well i would not start from here sonny that is to say much of political philosophy develops theories that take no account of where we actually are and how the theories that people argue about in the journals and in the literature actually could be implemented in the world if it"}Example Output: Retained:
{"audio":["..\/example_data\/audio_asr_pipeline\/test.wav"],"conversation":[{"from":"human","value":""}],"transcript":"and says how do i get to dublin and the answer that comes back is well i would not start from here sonny that is to say much of political philosophy develops theories that take no account of where we actually are and how the theories that people argue about in the journals and in the literature actually could be implemented in the world if it","forced_alignment_results":{"alignment":[{"start":0.063,"end":0.147,"text":"and","score":0.9554548212},{"start":0.273,"end":0.462,"text":"says","score":0.9719064832},{"start":0.609,"end":0.735,"text":"how","score":0.9212982873},{"start":0.798,"end":0.84,"text":"do","score":0.9939799858},{"start":1.029,"end":1.029,"text":"i","score":null},{"start":1.113,"end":1.26,"text":"get","score":0.9985639263},{"start":1.365,"end":1.428,"text":"to","score":0.9945560943},{"start":1.554,"end":1.974,"text":"dublin","score":0.9609149893},{"start":2.856,"end":2.94,"text":"and","score":0.9309501759},{"start":2.982,"end":3.045,"text":"the","score":0.7141059392},{"start":3.192,"end":3.465,"text":"answer","score":0.5938632981},{"start":3.507,"end":3.633,"text":"that","score":0.9633214426},{"start":3.717,"end":4.011,"text":"comes","score":0.9843271526},{"start":4.116,"end":4.389,"text":"back","score":0.9842618417},{"start":4.515,"end":4.662,"text":"is","score":0.9815290374},{"start":5.25,"end":5.376,"text":"well","score":0.047969851},{"start":5.502,"end":5.502,"text":"i","score":null},{"start":5.544,"end":5.67,"text":"would","score":0.8428627272},{"start":5.754,"end":5.817,"text":"not","score":0.123845133},{"start":5.88,"end":6.153,"text":"start","score":0.9789600127},{"start":6.216,"end":6.363,"text":"from","score":0.9000720539},{"start":6.468,"end":6.657,"text":"here","score":0.9283110266},{"start":6.783,"end":7.035,"text":"sonny","score":0.8839239278},{"start":9.807,"end":9.975,"text":"that","score":0.7547208776},{"start":10.038,"end":10.122,"text":"is","score":0.8797863669},{"start":10.185,"end":10.248,"text":"to","score":0.8244834454},{"start":10.353,"end":10.542,"text":"say","score":0.9471999446},{"start":11.025,"end":11.34,"text":"much","score":0.9940719048},{"start":11.634,"end":11.802,"text":"of","score":0.9950778359},{"start":11.991,"end":12.621,"text":"political","score":0.9989232361},{"start":12.81,"end":13.629,"text":"philosophy","score":0.9465096714},{"start":14.217,"end":14.805,"text":"develops","score":0.9432990222},{"start":15.057,"end":15.666,"text":"theories","score":0.9267864129},{"start":17.136,"end":17.304,"text":"that","score":0.8086037475},{"start":17.43,"end":17.682,"text":"take","score":0.9565847912},{"start":17.829,"end":17.913,"text":"no","score":0.956001711},{"start":18.081,"end":18.648,"text":"account","score":0.9546385136},{"start":19.425,"end":19.656,"text":"of","score":0.8420175488},{"start":21.42,"end":21.567,"text":"where","score":0.7551332315},{"start":21.63,"end":21.693,"text":"we","score":0.9166198867},{"start":21.903,"end":22.323,"text":"actually","score":0.9312994611},{"start":22.512,"end":22.701,"text":"are","score":0.9616599245},{"start":22.89,"end":22.974,"text":"and","score":0.4025359219},{"start":23.079,"end":23.31,"text":"how","score":0.9633893459},{"start":23.436,"end":23.499,"text":"the","score":0.7716538814},{"start":23.625,"end":24.045,"text":"theories","score":0.9761697651},{"start":24.15,"end":24.36,"text":"that","score":0.9068021914},{"start":24.486,"end":24.78,"text":"people","score":0.9219708612},{"start":24.948,"end":25.2,"text":"argue","score":0.9620480049},{"start":25.242,"end":25.515,"text":"about","score":0.9651158228},{"start":25.641,"end":25.704,"text":"in","score":0.9931364561},{"start":25.767,"end":25.83,"text":"the","score":0.8166649179},{"start":25.956,"end":26.439,"text":"journals","score":0.9695284503},{"start":26.544,"end":26.607,"text":"and","score":0.9435737354},{"start":26.67,"end":26.712,"text":"in","score":0.778872343},{"start":26.754,"end":26.796,"text":"the","score":0.8787819404},{"start":26.88,"end":27.384,"text":"literature","score":0.928246194},{"start":27.804,"end":28.077,"text":"actually","score":0.9179609355},{"start":28.119,"end":28.266,"text":"could","score":0.8717020111},{"start":28.329,"end":28.392,"text":"be","score":0.9910494216},{"start":28.602,"end":29.169,"text":"implemented","score":0.9847475907},{"start":29.232,"end":29.274,"text":"in","score":0.9814222521},{"start":29.337,"end":29.379,"text":"the","score":0.8807633297},{"start":29.442,"end":29.736,"text":"world","score":0.9051810523},{"start":30.156,"end":30.24,"text":"if","score":0.7553217096},{"start":30.45,"end":30.471,"text":"it","score":0.0156467184}],"error":null}}If everything is filtered out, the following string will be printed:
All data has been filtered out!
