Whisper Fine-Tune vs Commercial APIs

Executive Summary

This evaluation compared transcription accuracy across five fine-tuned Whisper models and three commercial STT APIs (OpenAI Whisper, Assembly, Gladia). All models were tested on identical audio with verified ground truth transcription.

Key Finding

The fine-tuned Whisper Large V3 Turbo achieved the best performance with an accuracy of 94.16%, beating the best commercial API (Assembly at 92.70%) and all other services tested including OpenAI's Whisper API.

Transcription Accuracy Comparison

Higher is better - percentage of words transcribed correctly

Model Rankings

Rank	Model	Type	Accuracy	WER
1	Whisper Large V3 Turbo (Fine-Tune)	Local	94.16%	5.84%
2	Assembly API	Commercial	92.70%	7.30%
3	Gladia API	Commercial	91.97%	8.03%
4	Whisper Small (Fine-Tune)	Local	91.24%	8.76%
5	Whisper (OpenAI API)	Commercial	91.24%	8.76%
Below 90% Accuracy Threshold
6	Whisper Base (Fine-Tune)	Local	85.40%	14.60%
7	Whisper Tiny (Fine-Tune)	Local	85.40%	14.60%

Fine-Tuned Whisper vs Commercial APIs

Comparing local fine-tuned models against commercial Whisper services

Error Breakdown by Type

Distribution of substitutions, deletions, and insertions

Information Preserved (WIP)

Higher is better - measures semantic accuracy

Detailed Metrics

1. Whisper Large V3 Turbo (Fine-Tune) - WINNER

Accuracy

94.16%

Hits

130/137

Substitutions

7

Deletions

0

Insertions

1

Info Preserved

89.39%

Analysis: Excellent performance with zero deletions (no lost content). Clear winner for production use, beating all commercial APIs.

2. Whisper Small (Fine-Tune)

Accuracy

91.24%

Hits

127/137

Substitutions

9

Deletions

1

Insertions

2

Info Preserved

85.31%

Analysis: Strong performance from a smaller model. Matches OpenAI's Whisper API while running locally. Good choice for real-time applications where speed matters.

2. Assembly API - BEST COMMERCIAL

Accuracy

92.70%

Hits

129/137

Substitutions

8

Deletions

0

Insertions

2

Info Preserved

87.39%

Analysis: Best commercial API, but still beaten by fine-tuned Large V3 Turbo (92.70% → 94.16% accuracy).

3. Gladia API

Accuracy

91.97%

Hits

128/137

Substitutions

9

Deletions

0

Insertions

2

Info Preserved

86.04%

Analysis: Third overall, second-best commercial API. Competitive performance.

Key Conclusions

1. Local Fine-Tunes Beat Commercial Whisper APIs

The fine-tuned Whisper Large V3 Turbo achieved 94.16% accuracy, beating the best commercial API (Assembly at 92.70%). This demonstrates that targeted fine-tuning can outperform premium commercial services on the same base model.

2. Cost & Privacy Advantages

Running local fine-tuned models eliminates per-minute API costs and keeps sensitive audio data on-premises. The performance advantage makes this even more compelling.

3. Commercial APIs Are Competitive

All three commercial APIs (Assembly 92.70%, Gladia 91.97%, OpenAI Whisper 91.24%) delivered production-ready performance. They're viable alternatives when local inference isn't feasible.

4. Production Recommendations

Best Performance: Whisper Large V3 Turbo (Fine-Tune) - 94.16% accuracy for local deployment
Best Commercial: Assembly API - 92.70% accuracy if cloud is required
Balanced Local: Whisper Small (Fine-Tune) - 91.24% accuracy, matches OpenAI with faster inference
OpenAI Alternative: Fine-tuned Small matches OpenAI's API while running locally

Local vs Commercial Performance

Model	Type	Accuracy	vs Best Commercial
Large V3 Turbo (Fine-Tune)	Local	94.16%	+1.46% better
Assembly API	Commercial	92.70%	baseline (best commercial)
Gladia API	Commercial	91.97%	-0.73% worse
Whisper Small (Fine-Tune)	Local	91.24%	-1.46% worse
Whisper (OpenAI API)	Commercial	91.24%	-1.46% worse

Methodology

Ground Truth: Manual transcription of test audio
WER Calculation: Using jiwer library
Framework: Hugging Face Transformers pipeline
Additional Metrics: MER, WIL, WIP, error breakdown
Environment: Python 3.12, CPU inference
Test Sample: 137-word narrative about a coastal town

Test Audio Sample

Listen to the audio sample used for this evaluation:

Ground Truth Transcription

Reference transcription used to evaluate all models

I once wandered through a coastal town that smelled like sea salt and fresh bread. The locals said the tide wrote stories on the sand—short tales at low tide, epics when the moon grew bold. Every morning the boardwalk baker pulled loaves out of a brick oven, tapping the crusts so they sang a hollow, golden note. Kids would line up for the first slice, steam fogging their glasses while gulls staged slow-motion dives overhead. The best part was the lighthouse keeper, who claimed he could forecast the weather by listening to the bells on distant fishing boats. If the chimes sounded playful, the day would be calm; if they rang flat, storms stampeded in. I never learned whether his method worked, but I liked believing in a town where music, bread, and tides kept time together.

Word count: 137 words