Bad timestamp prediction with some finetuned Whisper models #173

lumpidu · 2024-02-25T15:23:11Z

I see the following incorrect transcriptions, when running my tests with the fine-tuned model language-and-voice-lab/whisper-large-icelandic-62640-steps-967h :

"segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.0,
      "end": 30.0,
      "text": "afbrot og refsjábyrgð eitt efnisyfirlit ...",
      "tokens": [...],
      "temperature": 0.0,
      "avg_logprob": -0.024239512714179786,
      "compression_ratio": 1.8644067796610169,
      "no_speech_prob": 8.609986252849922e-05,
      "confidence": 0.988,
      "words": [ ... ]
     ....
    },
    {
      "id": 1,
      "seek": 3000,
      "start": 30.0,
      "end": 31.58,
      "text": "ilög á grundvelli þjóðréttarsamninga tuttugu og tvö þrjú íslensk refsilög og áhrif mannréttindareglna...",
      "tokens": [ ... ],
      "confidence": 0.031,
     ...
    },
    {
      "id": 2,
      "seek": 6000,
      "start": 59.74,
      "end": 60.8,
      "text": "fsiréttar í fræðikerfi lögfræðinnar tuttugu og sjö fjögur grundvallarhugtökin afbrot og refsing tuttugu og sjö...",
      "tokens": [ ... ],
      "confidence": 0.011,
     ...
    },
...
]

Take a look at the start, end segment data:

these don't line up for segments 1 +2, i.e. start of id 2doesn't begin at end of segment 1
segment id 1 start - end is 1.58 seconds, but in fact the segment length is 29.74 seconds, as can be seen in segment 2 start time of 59.74
segment id 2 start - end is 1.06 seconds, but in fact the segment length is 29.76 seconds, as segment 3 starts at 89.5
the transcribed first words of segment 1: ilög and segment 2: fsiréttar isn't correct, because these segments start in the middle of a spoken word.
the confidence of all segments that have wrong start, end timings is very low. In the above case its << 0.1. For all non-problematic segments, it's often close to 1.0, e.g. 0.986
BUT most transcriptions of the segments are actually correct (besides first and last word)

There is no warning on stderr/stdout about non-aligning segments or low confidence values of the transcripts. There is also no way any ASR system can generate correct first or last words, if segments start or end in the middle of a spoken word. Therefore my suggestion proposes to use a less naive approach either via VAD or overlapping segments. It's not clear for me, which of these approaches already has been implemented by whisper_timestamped.

Originally posted by @lumpidu in #64 (comment)

Here is an audio file (wav converted to webm with highest possible quality), that can be used to reproduce the error:

demo1_ice.webm

I have tried several different approaches: default values, default values as stated for whisper, with or without VAD (silero-4.0). Best results were with VAD turned on.

The text was updated successfully, but these errors were encountered:

Jeronymous · 2024-02-26T10:33:16Z

Thank you @lumpidu

Can you please also give the options you use to get the transcription with the bad (too short) second segment?

If I just run

whisper_timestamped iceland.webm --model  language-and-voice-lab/whisper-large-icelandic-62640-steps-967h

I get this which seems more correct:

Jeronymous · 2024-02-26T12:20:48Z

I could see some problems with option --accurate

And here is my guess:
That model was finetuned with segments of less than 30 seconds only, without the prediction of the timestamp of the end of the segment.
That's why each text segment is quite long.
So actually, with such finetuning, Whisper models lose their ability to predict timestamps.
And you will have problems to transcribe long form audio with that (audios of more than 30 seconds).

The only thing you can do to alleviate the impact with whisper-timestamped is to use option --recompute_all_timestamps True (if you are using the CLI, other in python code it's whisper_timestamped.transcribe(..., trust_whisper_timestamps=False)).
What does this option is just to ignore the timestamps predicted by Whisper model (which seem to be quite bad with such a finetuned model).
Can you please try that option @lumpidu, and tell if it solves the issue for you?

There will still be the issue that some parts of the audio might be either repeated or missing in your transcription, when transcribing audio of more than 30 seconds with such a model.
The solution I see is to use a VAD to cut in pieces of audio of at most 30 seconds.
(this is not what does the VAD option of whispter-timestamped : this one just remove parts with silence to avoid Whisper hallucinations on them)

Jeronymous · 2024-02-26T12:26:11Z

Also another thing you could try is with the regular model --model large-v3 --language is instead of the finetuned model.
Maybe the transcription won't be as accurate on some places, but I guess you won't have those alignment issues.
And you will see that text segments are much shorter (a few seconds), corresponding more to one would see in subtitles.

lumpidu · 2024-02-26T14:41:06Z

I used for the above output beam_size=5, best_of=5, vad='silero:v4.0', temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0). These correspond with the --accurate option, right ?

The best result was using no options at all, i.e. transcribe('language-and-voice-lab/whisper-large-icelandic-62640-steps-967h', "demo1_ice.webm", language="is"). "Best results" means: there were no obvious segment meta-data issues.

I will rerun with your proposals

lumpidu · 2024-02-26T15:43:33Z

I just ran with whisper-large-v3. This splits the audio into much smaller segments and does a complete reverse normalization, which I actually don't want. Is there a possibility to prevent the reverse normalization and just get normalized text ?

Could you elaborate, what exactly would be needed for fine-tuning models to predict better timestamps ?

Jeronymous · 2024-02-26T16:09:33Z

Have you tried that with the finetuned model? whisper_timestamped.transcribe(..., trust_whisper_timestamps=False)

Concerning text normalization, you mean that there aredigits instead of numbers written with letters, upper case letters, and punctuation marks?
Except from normalizing the text as you want, I don't see other option.
Revoming upper cases and punctuation marks is easy.
Converting digits to letters can be done with, for instance, num2words (https://pypi.org/project/num2words/)

Concerning fine-tuning, models should be finetuned to predict timestamps at the end of each segment.
Most of people finetune Whisper models to only predict the transcription in small segments, without predicting the start/end timestamps. Which make Whisper lose its ability to be applied on audio of more than 30 seconds.

Jeronymous · 2024-03-01T07:57:11Z

@lumpidu have you tried option trust_whisper_timestamps=False (in python, or --recompute_all_timestamps True in the CLI) with the finetuned model?

MohammedMehdiTBER · 2024-03-19T17:06:40Z

--recompute_all_timestamps True

I am using large-v3 model but still the subtitles are a little off sync (something faster than It should be that I had to stop the video to read the text), can --recompute_all_timestamps True help to fix this issue?
Here is my settings:
!whisper_timestamped "/content/Kudüs Fatihi Selahaddin Eyyubi 17. Bölüm @trt1 [sjYA9p06bwc].mkv" --model large-v3 --output_dir "/content/drive/MyDrive/SDAB17" --task translate --language Turkish --accurate --compression_ratio_threshold 1 --compute_confidence True --punctuations_with_words True --vad True --vad silero --detect_disfluencies True --device cuda --threads 2 --verbose True

I also found that some voices are not being detected by the script maybe because of vad?

Jeronymous · 2024-03-20T11:22:13Z

can --recompute_all_timestamps True help to fix this issue?

Maybe... It cannot hurt to try and see

I also found that some voices are not being detected by the script maybe because of vad?

Indeed silero is a statistical model (neural net) with some weird behaviour sometimes.

You can try --vad auditok
(and you should use the --vad argument only once in your command line : only the last option value you specify will be effective)

Also, you can use the --plot option to watch the VAD result on your signal. To see if that's the problem.

Jeronymous changed the title ~~Incorrect start/end segment times~~ Bad timestamp prediction with some finetuned Whisper models Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad timestamp prediction with some finetuned Whisper models #173

Bad timestamp prediction with some finetuned Whisper models #173

lumpidu commented Feb 25, 2024 •

edited

Jeronymous commented Feb 26, 2024

Jeronymous commented Feb 26, 2024

Jeronymous commented Feb 26, 2024

lumpidu commented Feb 26, 2024

lumpidu commented Feb 26, 2024

Jeronymous commented Feb 26, 2024 •

edited

Jeronymous commented Mar 1, 2024

MohammedMehdiTBER commented Mar 19, 2024 •

edited

Jeronymous commented Mar 20, 2024 •

edited

Bad timestamp prediction with some finetuned Whisper models #173

Bad timestamp prediction with some finetuned Whisper models #173

Comments

lumpidu commented Feb 25, 2024 • edited

Jeronymous commented Feb 26, 2024

Jeronymous commented Feb 26, 2024

Jeronymous commented Feb 26, 2024

lumpidu commented Feb 26, 2024

lumpidu commented Feb 26, 2024

Jeronymous commented Feb 26, 2024 • edited

Jeronymous commented Mar 1, 2024

MohammedMehdiTBER commented Mar 19, 2024 • edited

Jeronymous commented Mar 20, 2024 • edited

lumpidu commented Feb 25, 2024 •

edited

Jeronymous commented Feb 26, 2024 •

edited

MohammedMehdiTBER commented Mar 19, 2024 •

edited

Jeronymous commented Mar 20, 2024 •

edited