Machine Learning

1 readers

1 users here now

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

founded 2 years ago

MODERATORS

communick@academy.garden

[R] Announcing Distil-Whisper - 6x faster than Whisper-large-v2 and performs within 1% WER on out-of-distribution (alien.top)

submitted 2 years ago by pvp239@alien.top to c/machinelearning@academy.garden

5 comments fedilink hide all child comments

Hey r/MachineLearning!

At Hugging Face, we've worked hard the last months to create a powerful, but fast distilled version of Whisper. We're excited to share our work with you now!

Distil-Whisper is 6x faster than Whisper-large-v2 and performs within 1% WER on out-of-distribution datasets. On long-form audio, we even achieve better results thanks to a reduction in hallucinations.

For more information, please have a look:

- GitHub page: https://github.com/huggingface/distil-whisper/tree/main

- Paper: https://github.com/huggingface/distil-whisper/blob/main/Distil_Whisper.pdf

Quick summary:

Distillation Process

We've kept the whole encoder, but reduced the decoder to just 2 layers. Encoding takes O(1) forward passes, decoding takes O(N). To improve speed, all that matters is the decoder! The encoder is frozen during distillation while we fine-tune all of the decoder. Both KL loss and pseudo-labeling next word prediction is used.

Data

We use 20,000h of open-sourced audio data coming from 9 diverse audio datasets. A WER-filter is used to make sure low-quality training data is thrown out.

Results

We've evaluated the model only on out-of-distribution datasets and are only 1% worse than Whisper-large-v2 on short-form evals (CHiME-4, Earnings-22, FLEURS, SPGISpeech). On long-form evals (Earnings, Meanwhile, Rev 16) we beat Whisper-large-v2 thanks to a reduction in hallucinations.

Robust to noise

Distil-Whisper is very robust to noise (similar to its teacher). We credit this to keeping the original encoder frozen during training.

Pushing for max inference time

Distil-Whisper is 6x faster than Whisper on both short-form and long-form audio. In addition, we employ Flash Attention and chunked decoding which helps us achieve a real-time factor of 0.01!

Checkpoints?!

Checkpoints will be released this Thursday and will be directly integrated into Transformers. All checkpoints will be licensed under MIT.

you are viewing a single comment's thread
view the rest of the comments

[–] blackkettle@alien.top 1 points 2 years ago

Very cool. Is this still compatible with faster whisper / whisperx and if not how does the speed up compare? Whisperx is already way more than 6x faster using the same v2 models (and still multi lingual).

If we can stack these improvements - awesome!

Will you release your fine tuning framework so others can find tune on their own data?

Does the improvement itself depend on large scale data or can we achieve positive results with smaller fine tuning sets via your approach here?