Whisper Small Personal

This model is a fine-tuned version of openai/whisper-small trained on personal speech recordings using Mimic Record Studio. It is designed for automatic speech recognition (ASR) tasks and achieves a low Word Error Rate (WER) on the custom dataset.

Model Details

Model Type: Whisper Small
Training Dataset: Personal recordings using Mimic Record Studio
Framework: PyTorch
Language: Primarily fine-tuned on [insert language(s)]
Batch Size: 16 (with gradient accumulation)
Learning Rate: 1e-5
Mixed Precision: FP16
Evaluation Strategy: Steps (every 1000 steps)
WER on Validation Set:
- Step 2000: 0.079441

Hyperparameters

Max Training Steps: 4000
Warmup Steps: 500
Gradient Checkpointing: Enabled
Evaluation: Performed every 1000 steps
Logging: Every 25 steps to TensorBoard
Metric for Best Model: Word Error Rate (WER)

Usage

You can use this model for ASR tasks by loading it directly from the Hugging Face Model Hub:

from transformers import WhisperProcessor, WhisperForConditionalGeneration

processor = WhisperProcessor.from_pretrained("your-username/whisper-small-personal")
model = WhisperForConditionalGeneration.from_pretrained("your-username/whisper-small-personal")

# Example inference
inputs = processor("path_to_audio_file.wav", return_tensors="pt").input_features
generated_ids = model.generate(inputs)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)

Training Procedure

The model was trained using the following setup:

Training Batch Size: 16
Gradient Accumulation: 1
Evaluation Batch Size: 8
Max Generation Length: 225 tokens
Learning Rate: 1e-5
Mixed Precision: FP16
Optimizer: AdamW

The best model was selected based on the Word Error Rate (WER), and the final model was pushed to the Hugging Face Model Hub.

Results

Step	Training Loss	Validation Loss	WER
500	0.089500	0.229938	0.101203
1000	0.005200	0.215078	0.087757
1500	0.000400	0.222333	0.080502
2000	0.000200	0.226987	0.079441