Equivalent of transformer's chunk_length_s in whisper.cpp #2165

digikar99 · 2024-05-18T06:41:40Z

Hello, thank you very much for whisper.cpp!

While trying out a fine-tuned model with whisper.cpp (specifically, whisper-hindi-large-v2), I noted poor performance on a particular audio with whisper.cpp compared to using hugging-face directly. On a bit of debugging, I'm able to boil it down to hugging face inference example given in the repository using chunk_length_s=30. If this option is removed from theur pipeline, the performance is as poor as whisper.cpp.

I wonder if an equivalent of chunk_length_s is already implemented with whisper.cpp. Here is the implementation in the huggingface/transformers repository. If yes, what parameters I should be using?

More specific questions:

Is there anything that controls the "stride_length" of the processing?
I think I understand max-len, but in what situations is the max-context useful?

My current idea is to run whisper.cpp multiple times with appropriate "offset-t" and "duration". Obtain the outputs, and the finally, do a find_longest_common_sequence over them.

For this particular audio, it seems it is the presence of multiple speakers that is confusing whisper. So, diarizing or clustering the audio and processing each speaker/cluster individually might be a better idea than doing all this.

This is the inference code given in the repository.

import torch
from transformers import pipeline

# path to the audio file to be transcribed
audio = "/path/to/audio.format"
device = "cuda:0" if torch.cuda.is_available() else "cpu"

transcribe = pipeline(task="automatic-speech-recognition", model="vasista22/whisper-hindi-large-v2", chunk_length_s=30, device=device)
transcribe.model.config.forced_decoder_ids = transcribe.tokenizer.get_decoder_prompt_ids(language="hi", task="transcribe")

print('Transcription: ', transcribe(audio)["text"])

Output with chunk_length_s=30:

Transcription: हैलो हैलो मैं बोल रहा हूँ आपकी क्वीरी जो सॉल्व नहीं हुई थी फर्स्ट आपका मेल मिला था हमारी तरफ से तीन मेल भी गए थे उसका रिस्पॉन्स नहीं अच्छा मैं ये बता रहा था मैडम इसमें ज्यादा प्रॉब्लम होगी नहीं तो आज मैने स्पेशली अपने ब्रांच मैनेजर से कहा मैने कहा निधि का करो या उनका फिर वो फंड में हाँ तो मैं अभी क्या करूँ मैडम मैं आपकी कॉल ट्रांसफर कर रहा हूँ अभी दो एजेंट �्य और भोलू प्रसाद दोनों खाली है न तो इनमें से मैडम को कोई भी समझा देगा भोलू प्रसाद को ट्रांसफर कर देता हूँ मैं आपकी कॉल नहीं नहीं वेट वेट वेट हाँ एकलव्य सर आई थिंक अपने एकलव्य किसी का नाम या उनको ट्रांसफर कर दीजिये आई डोंट नो

Output without chunk_length_s=30:

Transcription: हैलो हैलो मैं बोल रहा हूँ आपकी क्वीरी जो सॉल्व नहीं हुई थी फर्स्ट आपका मेल मिला था हमारी तरफ से तीन मेल भी गए थे उसका रिस्पॉन्स नहीं अभी

ggerganov · 2024-05-19T08:48:57Z

There is no option currently to set the chunk size, but using --no-timestamps would be equivalent to chunk size of 30s. Try adding this flag and see if it helps

digikar99 · 2024-05-19T09:42:19Z

Thanks for getting back!

Nope, --no-timestamps does not help, it produces the same output.

I tried out whisperX today. For this particular audio, it worked amazingly well! Interestingly, merely using faster-whisper (a dependency of whisperX) alone did not help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Equivalent of transformer's chunk_length_s in whisper.cpp #2165

Equivalent of transformer's chunk_length_s in whisper.cpp #2165

digikar99 commented May 18, 2024

ggerganov commented May 19, 2024

digikar99 commented May 19, 2024

Equivalent of transformer's chunk_length_s in whisper.cpp #2165

Equivalent of transformer's chunk_length_s in whisper.cpp #2165

Comments

digikar99 commented May 18, 2024

ggerganov commented May 19, 2024

digikar99 commented May 19, 2024