Logo

Extending WhisperX with Qwen3-ASR + Qwen3 ForcedAligner

How I implemented Qwen3-ASR + Qwen3 ForcedAligner in the official WhisperX repository.

Python

WhisperX + Qwen3ASR result on whisperX audio sample

#Background

Over the past months, I have been working on Automatic Speech Recognition (ASR) pipelines, with a particular focus on Dutch speech across different dialects.

My baseline stack was:

  • WhisperX
  • Whisper large-v3
  • Speaker diarization via pyannote
  • Word-level alignment using WhisperX’s alignment pipeline

This setup is already good, but I noticed a recurring issue: of performance degradation on regional accents and harsher pronunciation patterns.

This became visible when testing Dutch dialects, especially Limburg dialectal speech. I think it is known for being a harsher accent overall (It was the hardest to understand during learning dutch)


#The Challenge

WhisperX is built to be modular, but in practice:

  • The ASR backend is coupled to Whisper-style outputs
  • Alignment depends on assumptions about tokenization and timestamps
  • Swapping the ASR model is not plug-and-play

At the same time, newer models like Qwen3-ASR showed strong benchmark performance as in lower Word Error Rate (WER) across multilingual datasets source

However, there was no direct way to use Qwen3-ASR inside WhisperX while preserving its alignment + diarization pipeline

My goal for a weekend was to implement this new addition to WhisperX. I had two goals:

  • Qwen3-ASR-1.7B for transcription
  • Qwen3-ForcedAligner-0.6B for alignment

…while maintaining full compatibility with:

  • Word-level timestamps
  • Segment structure
  • Downstream pipelines (RAG, analytics, QA systems)

#Solution

#1. Pluggable ASR Backend

I introduced a new backend option:

--asr_backend qwen3

#2. Qwen3-ASR Integration

Qwen3-ASR was integrated as a drop-in transcription engine:

  • Model: Qwen/Qwen3-ASR-1.7B (the highest model offered in this family of ASR)
  • Supports multilingual input

Pipeline adaptation:

  1. Audio → Qwen3-ASR transcription
  2. Convert output → WhisperX segment format
  3. Pass segments → alignment stage

#3. Forced Alignment Upgrade

WhisperX relies on its own alignment approach.

I added support for:

  • Qwen3-ForcedAligner-0.6B

This improves:

  • Word-level timestamp precision
  • Segment boundary consistency
  • Long-audio stability

#4. Alignment Pipeline Adaptation

To make this work end-to-end:

  • Adjusted how tokens are mapped to audio frames

  • Ensured compatibility with:

    • diarization outputs
    • segment merging logic
  • Preserved WhisperX’s expected JSON output format


#Benchmark Performance Official

DatasetWhisper-large-v3Qwen3-ASR-1.7B
MLS8.628.55
CommonVoice10.779.18
MLC-SLM15.6812.74
Fleurs5.274.90

Lower WER is better.

Additional improvements:

  • Language Identification Accuracy

    • Qwen3-ASR: 97.9%
    • Whisper large-v3: 94.1%

#Alignment Accuracy

MetricQwen3 AlignWhisperX Align
MFA Raw Error42.9 ms133.2 ms
300s Concatenated Audio Error52.9 ms2708.4 ms

#My Local Findings (Dutch Dialects)

Beyond benchmarks, I tested this setup on:

  • Standard Dutch
  • Mixed-accent speech
  • Limburg dialect

Observations:

  • Qwen3-ASR produced more stable and coherent transcripts

  • Fewer hallucinations under harsh pronunciation

  • Better handling of:

    • phonetic compression
    • regional vowel shifts
  • Alignment stayed consistent even on difficult segments

Comparison between Whisper-large-V3 and Qwen3ASR

#Example Usage

whisperx audio.wav \
  --asr_backend qwen3 \
  --model Qwen/Qwen3-ASR-1.7B \
  --align_model Qwen/Qwen3-ForcedAligner-0.6B

#Contribution

This work has been open-sourced as a Pull Request: