Sounds stupid and reductionist, but I'd start with doing speech-to-text and then run a small # of examples through 3.5-turbo & GPT-4, asking it to annotate where a laugh track should be added.
Good chance that it'll do a pretty decent job, with some careful prompting.
Then, based on cost requirements, you can try collecting some labels and fine-tuning a model like Mistral (which you could also just try upfront as well).