Skip to content

Improving timestamps for words #270

@akatkov7

Description

@akatkov7

Hi,

Wanted to start off with a big thanks for porting whisper to C++. It has been very useful for integration on iOS.

I wanted to open this issue to see if you had any thoughts or suggestions on how to address an issue we're seeing in one of our sample audio files.

Roughly 56 seconds into the audio file, one of the people says 'What?' after a longish pause (6-7 seconds) since the previous word.

Running in through whisper.cpp:

[49.9 --> 50.47] | What| (Confidence: 0.83800673)

Running it through openai/whisper:

>>> stab_segments[-3]
{'id': 26, 'seek': 2840, 'start': 49.4, 'end': 50.4, 'text': ' What?', 'tokens': [708, 30], 'temperature': 0.0, 'avg_logprob': -0.36360436898690685, 'compression_ratio': 1.6474820143884892, 'no_speech_prob': 0.00023565757146570832}

Running it through jianfch/stable-ts:

>>> results['segments'][-2]['whole_word_timestamps']
[{'word': ' What?', 'timestamp': 56.8799991607666}]

So it looks like stable-ts has made some changes that properly detects the timing of "What?". Looks like stable-ts has some silence detection that is likely aiding it in this scenario.

So my question is is there anything I can do on the input side to help improve these scenarios? Is the only solution adjusting the core logic in the library itself? Are some of the improvements in stable-ts scheduled to be added to this repo as well?

I have attached the sample audio file. (GitHub didn't like the wav directly so I zipped it up).

bad_caption_timing.wav.zip

Thanks again.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions