Local Fine-Tunes Beat Commercial STT Services
This evaluation compared transcription accuracy across five fine-tuned Whisper models and three commercial STT APIs (OpenAI Whisper, Assembly, Gladia). All models were tested on identical audio with verified ground truth transcription.
The fine-tuned Whisper Large V3 Turbo achieved the best performance with an accuracy of 94.16%, beating the best commercial API (Assembly at 92.70%) and all other services tested including OpenAI's Whisper API.
Higher is better - percentage of words transcribed correctly
| Rank | Model | Type | Accuracy | WER |
|---|---|---|---|---|
| 1 | Whisper Large V3 Turbo (Fine-Tune) | Local | 94.16% | 5.84% |
| 2 | Assembly API | Commercial | 92.70% | 7.30% |
| 3 | Gladia API | Commercial | 91.97% | 8.03% |
| 4 | Whisper Small (Fine-Tune) | Local | 91.24% | 8.76% |
| 5 | Whisper (OpenAI API) | Commercial | 91.24% | 8.76% |
| Below 90% Accuracy Threshold | ||||
| 6 | Whisper Base (Fine-Tune) | Local | 85.40% | 14.60% |
| 7 | Whisper Tiny (Fine-Tune) | Local | 85.40% | 14.60% |
Comparing local fine-tuned models against commercial Whisper services
Distribution of substitutions, deletions, and insertions
Higher is better - measures semantic accuracy
Analysis: Excellent performance with zero deletions (no lost content). Clear winner for production use, beating all commercial APIs.
Analysis: Strong performance from a smaller model. Matches OpenAI's Whisper API while running locally. Good choice for real-time applications where speed matters.
Analysis: Best commercial API, but still beaten by fine-tuned Large V3 Turbo (92.70% → 94.16% accuracy).
Analysis: Third overall, second-best commercial API. Competitive performance.
The fine-tuned Whisper Large V3 Turbo achieved 94.16% accuracy, beating the best commercial API (Assembly at 92.70%). This demonstrates that targeted fine-tuning can outperform premium commercial services on the same base model.
Running local fine-tuned models eliminates per-minute API costs and keeps sensitive audio data on-premises. The performance advantage makes this even more compelling.
All three commercial APIs (Assembly 92.70%, Gladia 91.97%, OpenAI Whisper 91.24%) delivered production-ready performance. They're viable alternatives when local inference isn't feasible.
| Model | Type | Accuracy | vs Best Commercial |
|---|---|---|---|
| Large V3 Turbo (Fine-Tune) | Local | 94.16% | +1.46% better |
| Assembly API | Commercial | 92.70% | baseline (best commercial) |
| Gladia API | Commercial | 91.97% | -0.73% worse |
| Whisper Small (Fine-Tune) | Local | 91.24% | -1.46% worse |
| Whisper (OpenAI API) | Commercial | 91.24% | -1.46% worse |
Listen to the audio sample used for this evaluation:
Reference transcription used to evaluate all models
I once wandered through a coastal town that smelled like sea salt and fresh bread. The locals said the tide wrote stories on the sand—short tales at low tide, epics when the moon grew bold. Every morning the boardwalk baker pulled loaves out of a brick oven, tapping the crusts so they sang a hollow, golden note. Kids would line up for the first slice, steam fogging their glasses while gulls staged slow-motion dives overhead. The best part was the lighthouse keeper, who claimed he could forecast the weather by listening to the bells on distant fishing boats. If the chimes sounded playful, the day would be calm; if they rang flat, storms stampeded in. I never learned whether his method worked, but I liked believing in a town where music, bread, and tides kept time together.
Word count: 137 words