Whisper Fine-Tune vs Commercial APIs

Local Fine-Tunes Beat Commercial STT Services

Executive Summary

This evaluation compared transcription accuracy across five fine-tuned Whisper models and three commercial STT APIs (OpenAI Whisper, Assembly, Gladia). All models were tested on identical audio with verified ground truth transcription.

Key Finding

The fine-tuned Whisper Large V3 Turbo achieved the best performance with an accuracy of 94.16%, beating the best commercial API (Assembly at 92.70%) and all other services tested including OpenAI's Whisper API.

Transcription Accuracy Comparison

Higher is better - percentage of words transcribed correctly

Model Rankings

Rank Model Type Accuracy WER
1 Whisper Large V3 Turbo (Fine-Tune) Local 94.16% 5.84%
2 Assembly API Commercial 92.70% 7.30%
3 Gladia API Commercial 91.97% 8.03%
4 Whisper Small (Fine-Tune) Local 91.24% 8.76%
5 Whisper (OpenAI API) Commercial 91.24% 8.76%
Below 90% Accuracy Threshold
6 Whisper Base (Fine-Tune) Local 85.40% 14.60%
7 Whisper Tiny (Fine-Tune) Local 85.40% 14.60%

Fine-Tuned Whisper vs Commercial APIs

Comparing local fine-tuned models against commercial Whisper services

Error Breakdown by Type

Distribution of substitutions, deletions, and insertions

Information Preserved (WIP)

Higher is better - measures semantic accuracy

Detailed Metrics

1. Whisper Large V3 Turbo (Fine-Tune) - WINNER

Accuracy
94.16%
Hits
130/137
Substitutions
7
Deletions
0
Insertions
1
Info Preserved
89.39%

Analysis: Excellent performance with zero deletions (no lost content). Clear winner for production use, beating all commercial APIs.

2. Whisper Small (Fine-Tune)

Accuracy
91.24%
Hits
127/137
Substitutions
9
Deletions
1
Insertions
2
Info Preserved
85.31%

Analysis: Strong performance from a smaller model. Matches OpenAI's Whisper API while running locally. Good choice for real-time applications where speed matters.

2. Assembly API - BEST COMMERCIAL

Accuracy
92.70%
Hits
129/137
Substitutions
8
Deletions
0
Insertions
2
Info Preserved
87.39%

Analysis: Best commercial API, but still beaten by fine-tuned Large V3 Turbo (92.70% → 94.16% accuracy).

3. Gladia API

Accuracy
91.97%
Hits
128/137
Substitutions
9
Deletions
0
Insertions
2
Info Preserved
86.04%

Analysis: Third overall, second-best commercial API. Competitive performance.

Key Conclusions

1. Local Fine-Tunes Beat Commercial Whisper APIs

The fine-tuned Whisper Large V3 Turbo achieved 94.16% accuracy, beating the best commercial API (Assembly at 92.70%). This demonstrates that targeted fine-tuning can outperform premium commercial services on the same base model.

2. Cost & Privacy Advantages

Running local fine-tuned models eliminates per-minute API costs and keeps sensitive audio data on-premises. The performance advantage makes this even more compelling.

3. Commercial APIs Are Competitive

All three commercial APIs (Assembly 92.70%, Gladia 91.97%, OpenAI Whisper 91.24%) delivered production-ready performance. They're viable alternatives when local inference isn't feasible.

4. Production Recommendations

  • Best Performance: Whisper Large V3 Turbo (Fine-Tune) - 94.16% accuracy for local deployment
  • Best Commercial: Assembly API - 92.70% accuracy if cloud is required
  • Balanced Local: Whisper Small (Fine-Tune) - 91.24% accuracy, matches OpenAI with faster inference
  • OpenAI Alternative: Fine-tuned Small matches OpenAI's API while running locally

Local vs Commercial Performance

Model Type Accuracy vs Best Commercial
Large V3 Turbo (Fine-Tune) Local 94.16% +1.46% better
Assembly API Commercial 92.70% baseline (best commercial)
Gladia API Commercial 91.97% -0.73% worse
Whisper Small (Fine-Tune) Local 91.24% -1.46% worse
Whisper (OpenAI API) Commercial 91.24% -1.46% worse

Methodology

Test Audio Sample

Listen to the audio sample used for this evaluation:

Ground Truth Transcription

Reference transcription used to evaluate all models

I once wandered through a coastal town that smelled like sea salt and fresh bread. The locals said the tide wrote stories on the sand—short tales at low tide, epics when the moon grew bold. Every morning the boardwalk baker pulled loaves out of a brick oven, tapping the crusts so they sang a hollow, golden note. Kids would line up for the first slice, steam fogging their glasses while gulls staged slow-motion dives overhead. The best part was the lighthouse keeper, who claimed he could forecast the weather by listening to the bells on distant fishing boats. If the chimes sounded playful, the day would be calm; if they rang flat, storms stampeded in. I never learned whether his method worked, but I liked believing in a town where music, bread, and tides kept time together.

Word count: 137 words