How to Benchmark ASR Models with Word Error Rate (WER)

Evaluating ASR models means more than running a script—reliable Word Error Rate (WER) benchmarking requires solid dataset choices, proper text processing, and actionable analysis. These steps help compare and tune models for real-world use. Clear benchmarking protocols also allow reproducibility and meaningful cross-model comparisons.

1. Pick the Right Public Corpora

  • Start with reputable datasets like Common Voice, OpenSLR (LibriSpeech), GlobalPhone, MLS, TED-LIUM, and TIMIT.
  • Match your target language and scenario; supplement with real domain audio as needed .
  • Check dataset licenses and annotations for suitability with your application.

2. Build Domain-Specific Test Sets

  • Add real-world samples from your use case (e.g., medical dictation, call center chats).
  • Include terms and acoustic conditions that match your application.
  • Use HITL (human-in-the-loop) pipelines for transcript accuracy.
  • Ensure transcripts follow consistent guidelines (e.g., treatment of punctuation, speaker labels).

3. Standardize Preprocessing

  • Use the same normalization and tokenization for both reference and output.
  • Standard tools: NLTK, spaCy, or NeMo’s utilities.
  • Remove inconsistencies that artificially inflate WER.
  • Examples include lowercasing, removing non-speech tags, and consistent handling of numbers.

4. Pick Your WER Tools

  • jiwer: Python library with fast, flexible batch evaluation and a CLI (requires manual preprocessing).
  • Command-line tools: Simple for quick checks.
  • Best practice: preprocess before running your WER tool.

5. Visualize and Compare Results

  • Use the NeMo Speech Data Explorer to compare multiple models and languages.
  • Input: JSON manifest with audio paths, references, and predicted transcripts.
  • Get interactive tables for error analysis.
  • Track error types (substitutions, insertions, deletions) to identify model weaknesses.

Common Pitfalls

  • Mismatched preprocessing between references and predictions.
  • Ignoring speaker turns in multi-speaker audio.
  • Focusing only on WER; add human review for major errors.
  • Comparing WER scores across different datasets or evaluation protocols without context.

Step-by-Step Checklist

  1. Define your test scope (domain, language)
  2. Collect or source appropriate audio/data
  3. Normalize transcripts and predictions
  4. Compute WER with standardized tools
  5. Visualize and review results for actionable improvements
  6. Log evaluation configs and preprocess routines for traceability.

Key Points

  • Use domain-matched and language-appropriate corpora
  • Standardize preprocessing before WER calculation
  • Automate comparisons with tools like jiwer and NeMo SDE
  • Review both numerical and human-readable outputs
  • Document all benchmarking steps and data choices.

Take Action

  • Add real user scenarios to your benchmarks
  • Normalize the script to avoid WER inflation
  • Plug your results into NeMo SDE for fast model comparisons
  • Review outputs for domain-specific errors and keep iterating
  • Share test protocols and details when reporting WER, so others can reproduce or interpret your results accurately.

Check out two NVIDIA open source, portable models, NVIDIA Canary-Qwen-2.5B and Parakeet-TDT-0.6B-V2, which reflect the openness philosophy of Nemotron, with open datasets, weights, and recipes. They just topped the latest transcription leaderboard from ArtificialAnalysis (AA) ASR leaderboard with record WER. ➡️ Speech to Text (ASR) Providers Leaderboard & Comparison | Artificial Analysis