Improving timestamps for words

Hi,

Wanted to start off with a big thanks for porting whisper to C++. It has been very useful for integration on iOS.

I wanted to open this issue to see if you had any thoughts or suggestions on how to address an issue we're seeing in one of our sample audio files.

Roughly 56 seconds into the audio file, one of the people says 'What?' after a longish pause (6-7 seconds) since the previous word.

Running in through whisper.cpp:
```
[49.9 --> 50.47] | What| (Confidence: 0.83800673)
```

Running it through openai/whisper:
```
>>> stab_segments[-3]
{'id': 26, 'seek': 2840, 'start': 49.4, 'end': 50.4, 'text': ' What?', 'tokens': [708, 30], 'temperature': 0.0, 'avg_logprob': -0.36360436898690685, 'compression_ratio': 1.6474820143884892, 'no_speech_prob': 0.00023565757146570832}
```

Running it through jianfch/stable-ts:
```
>>> results['segments'][-2]['whole_word_timestamps']
[{'word': ' What?', 'timestamp': 56.8799991607666}]
```

So it looks like stable-ts has made some changes that properly detects the timing of "What?". Looks like stable-ts has some silence detection that is likely aiding it in this scenario.

So my question is is there anything I can do on the input side to help improve these scenarios? Is the only solution adjusting the core logic in the library itself? Are some of the improvements in stable-ts scheduled to be added to this repo as well?

I have attached the sample audio file. (GitHub didn't like the wav directly so I zipped it up).

[bad_caption_timing.wav.zip](https://github.com/ggerganov/whisper.cpp/files/10220652/bad_caption_timing.wav.zip)

Thanks again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improving timestamps for words #270

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improving timestamps for words #270

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions