Data Hoarder

221 readers

1 users here now

We are digital librarians. Among us are represented the various reasons to keep data -- legal requirements, competitive requirements, uncertainty of permanence of cloud services, distaste for transmitting your data externally (e.g. government or corporate espionage), cultural and familial archivists, internet collapse preppers, and people who do it themselves so they're sure it's done right. Everyone has their reasons for curating the data they have decided to keep (either forever or For A Damn Long Time (tm) ). Along the way we have sought out like-minded individuals to exchange strategies, war stories, and cautionary tales of failures.

founded 2 years ago

MODERATORS

communick@selfhosted.forum

Bulk Creation of Transcripts from YouTube Playlists with Whisper (github.com)

submitted 2 years ago by dicklesworth@alien.top to c/datahoarder@selfhosted.forum

4 comments fedilink hide all child comments

I know there are various tools that are supposed to make this easy, but I couldn't find anything that did everything I wanted, so I made this today for fun. The web-based offerings all take forever and seem flaky, and you need to process one video at a time, with no control over the transcription settings. In contrast, my script lets you convert a whole playlist in bulk with full control over everything.

It's truly easy to use-- you can clone the repo, install to a venv, and be generating a folder full of high quality transcript text files in under 5 minutes. All you need to do is supply the URL to a YouTube playlist or to an individual video file and this tool does the rest automatically. It uses faster-whisper with a high beam_size, so it's a bit slower than you might expect, but this does result in higher accuracy. The best way to use this is to take an existing playlist, or create a new one on YouTube, start this script up, and come back the next morning with all your finished transcripts. It attempts to "upgrade" the output of whisper by taking all the transcript segments, gluing them together, and then splitting them back into sentences (it uses Spacy for this, or a simpler regex-based function). You end up with a single text file with the full transcript all ready to go for each video in the playlist, with a sensible file name based on the title of the video.

If you have CUDA installed, it will try to use it, but as with all things CUDA, it's annoyingly fragile and picky, so don't be surprised if you get a CUDA error even if you know for a fact CUDA is installed on your system. If you're looking for reliability, disable CUDA. But if you need to transcribe a LOT of transcripts, it does go much, much faster on a GPU.

Even if you don't have a GPU, if you have a powerful machine with a lot of RAM and cores, this script will fully saturate them and can download and process multiple videos at the same time. The default settings are pretty good for that situation. But if you have a slower machine, you might want to use a smaller Whisper model (like base.en or even tiny.en) and dial down the beam_size to 2.

you are viewing a single comment's thread
view the rest of the comments

[–] dicklesworth@alien.top 1 points 2 years ago

No, it’s using Whisper for the transcripts and whisper doesn’t currently support speaker diarisation. But even if I tried to include that using one of the projects that claims to have it, you still need to manually label which speaker is which. Since my goal was to get something totally automated here that could just grind through a huge playlist, it didn’t seem worth it.