Extending WhisperX with Qwen3-ASR + Qwen3 ForcedAligner

WhisperX + Qwen3 integration overview — WhisperX + Qwen3ASR result on whisperX audio sample

#Background

Over the past months, I have been working on Automatic Speech Recognition (ASR) pipelines, with a particular focus on Dutch speech across different dialects.

My baseline stack was:

WhisperX
Whisper large-v3
Speaker diarization via pyannote
Word-level alignment using WhisperX’s alignment pipeline

This setup is already good, but I noticed a recurring issue: of performance degradation on regional accents and harsher pronunciation patterns.

This became visible when testing Dutch dialects, especially Limburg dialectal speech. I think it is known for being a harsher accent overall (It was the hardest to understand during learning dutch)

#The Challenge

WhisperX is built to be modular, but in practice:

The ASR backend is coupled to Whisper-style outputs
Alignment depends on assumptions about tokenization and timestamps
Swapping the ASR model is not plug-and-play

At the same time, newer models like Qwen3-ASR showed strong benchmark performance as in lower Word Error Rate (WER) across multilingual datasets source

However, there was no direct way to use Qwen3-ASR inside WhisperX while preserving its alignment + diarization pipeline

My goal for a weekend was to implement this new addition to WhisperX. I had two goals:

Qwen3-ASR-1.7B for transcription
Qwen3-ForcedAligner-0.6B for alignment

…while maintaining full compatibility with:

Word-level timestamps
Segment structure
Downstream pipelines (RAG, analytics, QA systems)

#Solution

#1. Pluggable ASR Backend

I introduced a new backend option:

--asr_backend qwen3

#2. Qwen3-ASR Integration

Qwen3-ASR was integrated as a drop-in transcription engine:

Model: Qwen/Qwen3-ASR-1.7B (the highest model offered in this family of ASR)
Supports multilingual input

Pipeline adaptation:

Audio → Qwen3-ASR transcription
Convert output → WhisperX segment format
Pass segments → alignment stage

#3. Forced Alignment Upgrade

WhisperX relies on its own alignment approach.

I added support for:

Qwen3-ForcedAligner-0.6B

This improves:

Word-level timestamp precision
Segment boundary consistency
Long-audio stability

#4. Alignment Pipeline Adaptation

To make this work end-to-end:

Adjusted how tokens are mapped to audio frames
Ensured compatibility with:
- diarization outputs
- segment merging logic
Preserved WhisperX’s expected JSON output format

#Benchmark Performance Official

Dataset	Whisper-large-v3	Qwen3-ASR-1.7B
MLS	8.62	8.55
CommonVoice	10.77	9.18
MLC-SLM	15.68	12.74
Fleurs	5.27	4.90

Lower WER is better.

Additional improvements:

Language Identification Accuracy
- Qwen3-ASR: 97.9%
- Whisper large-v3: 94.1%

#Alignment Accuracy

Metric	Qwen3 Align	WhisperX Align
MFA Raw Error	42.9 ms	133.2 ms
300s Concatenated Audio Error	52.9 ms	2708.4 ms

#My Local Findings (Dutch Dialects)

Beyond benchmarks, I tested this setup on:

Standard Dutch
Mixed-accent speech
Limburg dialect

Observations:

Qwen3-ASR produced more stable and coherent transcripts
Fewer hallucinations under harsh pronunciation
Better handling of:
- phonetic compression
- regional vowel shifts
Alignment stayed consistent even on difficult segments

Comparison between Whisper-large-V3 and Qwen3ASR

#Example Usage

whisperx audio.wav \
  --asr_backend qwen3 \
  --model Qwen/Qwen3-ASR-1.7B \
  --align_model Qwen/Qwen3-ForcedAligner-0.6B

#Contribution

This work has been open-sourced as a Pull Request:

PR: https://github.com/m-bain/whisperX/pull/1401