Machine Learning

1 readers

1 users here now

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

founded 2 years ago

MODERATORS

communick@academy.garden

[R] Announcing Distil-Whisper - 6x faster than Whisper-large-v2 and performs within 1% WER on out-of-distribution (alien.top)

submitted 2 years ago by pvp239@alien.top to c/machinelearning@academy.garden

5 comments fedilink hide all child comments

Hey r/MachineLearning!

At Hugging Face, we've worked hard the last months to create a powerful, but fast distilled version of Whisper. We're excited to share our work with you now!

Distil-Whisper is 6x faster than Whisper-large-v2 and performs within 1% WER on out-of-distribution datasets. On long-form audio, we even achieve better results thanks to a reduction in hallucinations.

For more information, please have a look:

- GitHub page: https://github.com/huggingface/distil-whisper/tree/main

- Paper: https://github.com/huggingface/distil-whisper/blob/main/Distil_Whisper.pdf

Quick summary:

Distillation Process

We've kept the whole encoder, but reduced the decoder to just 2 layers. Encoding takes O(1) forward passes, decoding takes O(N). To improve speed, all that matters is the decoder! The encoder is frozen during distillation while we fine-tune all of the decoder. Both KL loss and pseudo-labeling next word prediction is used.

Data

We use 20,000h of open-sourced audio data coming from 9 diverse audio datasets. A WER-filter is used to make sure low-quality training data is thrown out.

Results

We've evaluated the model only on out-of-distribution datasets and are only 1% worse than Whisper-large-v2 on short-form evals (CHiME-4, Earnings-22, FLEURS, SPGISpeech). On long-form evals (Earnings, Meanwhile, Rev 16) we beat Whisper-large-v2 thanks to a reduction in hallucinations.

Robust to noise

Distil-Whisper is very robust to noise (similar to its teacher). We credit this to keeping the original encoder frozen during training.

Pushing for max inference time

Distil-Whisper is 6x faster than Whisper on both short-form and long-form audio. In addition, we employ Flash Attention and chunked decoding which helps us achieve a real-time factor of 0.01!

Checkpoints?!

Checkpoints will be released this Thursday and will be directly integrated into Transformers. All checkpoints will be licensed under MIT.

you are viewing a single comment's thread
view the rest of the comments

[–] Gurrako@alien.top 1 points 2 years ago

The ideas in the paper seem to be using a very similar concept from a NMT paper called Deep Encoders, Shallow Encoders. Surprised to see no citation in the related work.