#Background
Over the past months, I have been working on Automatic Speech Recognition (ASR) pipelines, with a particular focus on Dutch speech across different dialects.
My baseline stack was:
- WhisperX
- Whisper large-v3
- Speaker diarization via pyannote
- Word-level alignment using WhisperX’s alignment pipeline
This setup is already good, but I noticed a recurring issue: of performance degradation on regional accents and harsher pronunciation patterns.
This became visible when testing Dutch dialects, especially Limburg dialectal speech. I think it is known for being a harsher accent overall (It was the hardest to understand during learning dutch)
#The Challenge
WhisperX is built to be modular, but in practice:
- The ASR backend is coupled to Whisper-style outputs
- Alignment depends on assumptions about tokenization and timestamps
- Swapping the ASR model is not plug-and-play
At the same time, newer models like Qwen3-ASR showed strong benchmark performance as in lower Word Error Rate (WER) across multilingual datasets source
However, there was no direct way to use Qwen3-ASR inside WhisperX while preserving its alignment + diarization pipeline
My goal for a weekend was to implement this new addition to WhisperX. I had two goals:
- Qwen3-ASR-1.7B for transcription
- Qwen3-ForcedAligner-0.6B for alignment
…while maintaining full compatibility with:
- Word-level timestamps
- Segment structure
- Downstream pipelines (RAG, analytics, QA systems)
#Solution
#1. Pluggable ASR Backend
I introduced a new backend option:
--asr_backend qwen3
#2. Qwen3-ASR Integration
Qwen3-ASR was integrated as a drop-in transcription engine:
- Model:
Qwen/Qwen3-ASR-1.7B(the highest model offered in this family of ASR) - Supports multilingual input
Pipeline adaptation:
- Audio → Qwen3-ASR transcription
- Convert output → WhisperX segment format
- Pass segments → alignment stage
#3. Forced Alignment Upgrade
WhisperX relies on its own alignment approach.
I added support for:
- Qwen3-ForcedAligner-0.6B
This improves:
- Word-level timestamp precision
- Segment boundary consistency
- Long-audio stability
#4. Alignment Pipeline Adaptation
To make this work end-to-end:
-
Adjusted how tokens are mapped to audio frames
-
Ensured compatibility with:
- diarization outputs
- segment merging logic
-
Preserved WhisperX’s expected JSON output format
#Benchmark Performance Official
| Dataset | Whisper-large-v3 | Qwen3-ASR-1.7B |
|---|---|---|
| MLS | 8.62 | 8.55 |
| CommonVoice | 10.77 | 9.18 |
| MLC-SLM | 15.68 | 12.74 |
| Fleurs | 5.27 | 4.90 |
Lower WER is better.
Additional improvements:
-
Language Identification Accuracy
- Qwen3-ASR: 97.9%
- Whisper large-v3: 94.1%
#Alignment Accuracy
| Metric | Qwen3 Align | WhisperX Align |
|---|---|---|
| MFA Raw Error | 42.9 ms | 133.2 ms |
| 300s Concatenated Audio Error | 52.9 ms | 2708.4 ms |
#My Local Findings (Dutch Dialects)
Beyond benchmarks, I tested this setup on:
- Standard Dutch
- Mixed-accent speech
- Limburg dialect
Observations:
-
Qwen3-ASR produced more stable and coherent transcripts
-
Fewer hallucinations under harsh pronunciation
-
Better handling of:
- phonetic compression
- regional vowel shifts
-
Alignment stayed consistent even on difficult segments
#Example Usage
whisperx audio.wav \
--asr_backend qwen3 \
--model Qwen/Qwen3-ASR-1.7B \
--align_model Qwen/Qwen3-ForcedAligner-0.6B
#Contribution
This work has been open-sourced as a Pull Request: