Machine Learning

1 readers

1 users here now

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

founded 2 years ago

MODERATORS

communick@academy.garden

[D] Need advice on training neural network to trigger “laughing” for comedy skit audio files (alien.top)

submitted 2 years ago by Practical-Flamingo25@alien.top to c/machinelearning@academy.garden

2 comments fedilink hide all child comments

I have a collection of audio files from comedy skits, and I’m looking to train a neural network to autonomously decide when to trigger a “laughing” sound effect. The catch? I want to avoid manually setting cue points for laughter. Instead, I’m aiming for the neural network to determine the right moments to insert laughter, based on the content of the skit.

top 2 comments

sorted by: hot top controversial new old

[–] farmingvillein@alien.top 1 points 2 years ago

Sounds stupid and reductionist, but I'd start with doing speech-to-text and then run a small # of examples through 3.5-turbo & GPT-4, asking it to annotate where a laugh track should be added.

Good chance that it'll do a pretty decent job, with some careful prompting.

Then, based on cost requirements, you can try collecting some labels and fine-tuning a model like Mistral (which you could also just try upfront as well).

[–] saintshing@alien.top 1 points 2 years ago

Does this work in real time or your model has access to the entire sequence so you can use context from before and after the current time point?

You have to be careful with leaking when you preprocess the training data if you remove the laughter and leave an silent time interval.

The text based approach may work but it may not give you a precise timing.