Does this work in real time or your model has access to the entire sequence so you can use context from before and after the current time point?
You have to be careful with leaking when you preprocess the training data if you remove the laughter and leave an silent time interval.
The text based approach may work but it may not give you a precise timing.
https://github.com/m-bain/whisperX/issues/569