Inquiry about Cross-modal Knowledge Transfer in CTC-based ASR Models #5773

theshypig · 2024-05-03T08:41:55Z

theshypig
May 3, 2024

Dear ESPnet Team,

I hope this message finds you well. I am writing to inquire about the possibility of implementing cross-modal knowledge transfer in CTC-based ASR models using a BERT pre-trained model within the ESPnet framework.

From my observations, it appears that the current ESPnet framework does not support this implementation. I noticed in a previous issue that someone had raised a similar question and received a response indicating that knowledge distillation is implemented in the TTS component. However, this does not align with my specific needs.

Upon reviewing the TTS implementation, I noticed that it requires two GPUs to perform distillation. In my opinion, this approach seems somewhat resource-intensive. My current idea is to save the last hidden layer representation of the BERT pre-trained model on the ASR dataset, load it during training, and then proceed with cross-modal knowledge transfer.

During this process, I have two areas of uncertainty where I would greatly appreciate your advice:

If I want to introduce the saved BERT embeddings during training, which parts should I modify? (e.g., iterable_dataset.py, dataset.py, preprocessor, trainer for getting batch data, training loss)
What is the purpose of aux_ctc_tasks during model training, and do I need to modify it?

I look forward to your valuable insights and suggestions. Thank you for your time and consideration.

Best regards,
ck

sw005320 · 2024-05-03T11:17:22Z

sw005320
May 3, 2024
Maintainer

Hi @theshypig,

If I want to introduce the saved BERT embeddings during training, which parts should I modify? (e.g., iterable_dataset.py, dataset.py, preprocessor, trainer for getting batch data, training loss)

You could extract the BERT embeddings separately, save them to the file, and load the model via https://github.com/espnet/espnet/blob/master/egs2/TEMPLATE/asr1/asr.sh#L753

What is the purpose of aux_ctc_tasks during model training, and do I need to modify it?

You do not need to modify it.
It is a technique to make CTC training stable.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiry about Cross-modal Knowledge Transfer in CTC-based ASR Models #5773

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Inquiry about Cross-modal Knowledge Transfer in CTC-based ASR Models #5773

theshypig May 3, 2024

Replies: 1 comment

sw005320 May 3, 2024 Maintainer

theshypig
May 3, 2024

sw005320
May 3, 2024
Maintainer