-
Notifications
You must be signed in to change notification settings - Fork 5k
Description
Hi,
Wanted to start off with a big thanks for porting whisper to C++. It has been very useful for integration on iOS.
I wanted to open this issue to see if you had any thoughts or suggestions on how to address an issue we're seeing in one of our sample audio files.
Roughly 56 seconds into the audio file, one of the people says 'What?' after a longish pause (6-7 seconds) since the previous word.
Running in through whisper.cpp:
[49.9 --> 50.47] | What| (Confidence: 0.83800673)
Running it through openai/whisper:
>>> stab_segments[-3]
{'id': 26, 'seek': 2840, 'start': 49.4, 'end': 50.4, 'text': ' What?', 'tokens': [708, 30], 'temperature': 0.0, 'avg_logprob': -0.36360436898690685, 'compression_ratio': 1.6474820143884892, 'no_speech_prob': 0.00023565757146570832}
Running it through jianfch/stable-ts:
>>> results['segments'][-2]['whole_word_timestamps']
[{'word': ' What?', 'timestamp': 56.8799991607666}]
So it looks like stable-ts has made some changes that properly detects the timing of "What?". Looks like stable-ts has some silence detection that is likely aiding it in this scenario.
So my question is is there anything I can do on the input side to help improve these scenarios? Is the only solution adjusting the core logic in the library itself? Are some of the improvements in stable-ts scheduled to be added to this repo as well?
I have attached the sample audio file. (GitHub didn't like the wav directly so I zipped it up).
Thanks again.