Finetuning multi-speaker model? #3733

suckrowPierre · 2024-05-11T20:16:18Z

suckrowPierre
May 11, 2024

I tried finetuning the XTTS2 multi speaker model. But I am not sure I did it the correct way. I created a train and eval dataset with the structure audio_file|text|speaker_name. For the speaker_name I used the name of my new speaker. After completion, I loaded the model in the demo gradio ui like in the following: https://www.youtube.com/watch?v=8tpDiiouGxc. Now I can generate some audio in the voice of the new speaker, but I still need to provide a reference audio. So I actually wonder if it really trained on my data or just skipped everything because of the new speaker_name.

I can't list the speaker ids with --list_speaker_idxs. Or try to run it with the tts command. When doing it I get: NotADirectoryError: [Errno 20] Not a directory: '/home/coqui/jens/model/run/training/GPT_XTTS_FT-May-11-2024_08+18AM-0000000/best_model.pth/model.pth'

What I want is a multi-speaker model with some new speakers.
Can this be done with finetuning ?
Or do I need to train a multi-model from scratch with my new speakers and the used data for XTTS-v2 ?

Any help would be greatly appreciated. Currently I am a bit lost and can't find any concrete examples for this.

phamkhactu · 2024-05-14T08:08:44Z

phamkhactu
May 14, 2024

Maybe you are wrong in inference.

You should check again speaker_id as arg pass in to model when inference.

import torch 

from scipy.io import wavfile
import os
os.environ["CUDA_VISIBLE_DEVICES"]="-1"
import numpy as np
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.models.vits import Vits, VitsAudioConfig, VitsArgs
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from vinorm import TTSnorm
import librosa 
from TTS.tts.utils.speakers import SpeakerManager
from TTS.tts.utils.synthesis import synthesis
import visen


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    
audio_config = VitsAudioConfig(
    sample_rate=22050, win_length=1024, hop_length=256, num_mels=80, mel_fmin=0, mel_fmax=8000
)


model_args = VitsArgs(
    use_speaker_embedding=True,
    use_sdp=False,
    speaker_encoder_config_path="config_se.json",
    speaker_encoder_model_path="english_se.pth.tar",
    resblock_type_decoder="2"
)

config = VitsConfig(
    model_args=model_args,
    audio=audio_config,
    text_cleaner="vietnamese_cleaners",
    use_phonemes=True,
    phoneme_language="vi",
    mixed_precision=True,
    cudnn_benchmark=False,
    # use_sdp=False,
    add_blank= False,
    # speaker_encoder_loss_alpha=9.0,
)

tokenizer, config = TTSTokenizer.init_from_config(config)


speaker_manager = SpeakerManager()
speaker_manager.name_to_id = {
    'hn-namkhanh': 0, 
    'hn-quynhanh': 1, 
    # 'hue-baoquoc': 2, 
    # 'hue-maingoc': 3,
    # 'hcm-minhquan': 4, 
    # 'hcm-phuongly': 5
    }
# init model
model = Vits(config, None, tokenizer, speaker_manager=speaker_manager)


# checkpoint = "recipes/vctk/vits/vits_vctk_en_north_speakers-April-01-2024_09+41AM-2784a39d/checkpoint_20000.pth"
checkpoint = "recipes/vctk/vits/vits_vctk_en_north_speakers-April-04-2024_02+50PM-2784a39d/checkpoint_60000.pth"
model.load_checkpoint(config=config, checkpoint_path=checkpoint)
model.eval()
model.to(device)

model.length_scale=1.2
model.noise_scale=0.667
model.noise_scale_w=0.8
model.warm_up=True

text = text
# aux_inputs = model.get_aux_input_from_test_sentences(test_sentences)
text = visen.clean_tone(text)
text = text.strip()
wav, alignment, _, _ = synthesis(
    model,
    text,
    config,
    "cuda" in str(next(model.parameters()).device),
    speaker_id=0,
    d_vector=None,
    style_wav=None,
    language_id=None,
    use_griffin_lim=True,
    do_trim_silence=False,
).values()


wav = wav / max(abs(wav)) * 32768.0

3 replies

eletroswing May 14, 2024

so, how i get the id from the custom speaker? i have five speakers, each one with a different name.

eletroswing May 14, 2024

i need to change the

speaker_manager = SpeakerManager()
speaker_manager.name_to_id = {
    'hn-namkhanh': 0, 
    'hn-quynhanh': 1, 
    # 'hue-baoquoc': 2, 
    # 'hue-maingoc': 3,
    # 'hcm-minhquan': 4, 
    # 'hcm-phuongly': 5
    }

with my speakers name?

phamkhactu May 15, 2024

Hi @eletroswing

You can use speaker_manager.get_speakers() to get list of speakers with name and id

suckrowPierre · 2024-05-14T10:02:51Z

suckrowPierre
May 14, 2024
Author

So how do I finetune a multispeaker model with multiple new speakers?

1 reply

phamkhactu May 15, 2024

you can read again guideline in train_yourtts.py which author explain quite clearly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetuning multi-speaker model? #3733

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Finetuning multi-speaker model? #3733

suckrowPierre May 11, 2024

Replies: 2 comments · 4 replies

phamkhactu May 14, 2024

eletroswing May 14, 2024

eletroswing May 14, 2024

phamkhactu May 15, 2024

suckrowPierre May 14, 2024 Author

phamkhactu May 15, 2024

suckrowPierre
May 11, 2024

Replies: 2 comments 4 replies

phamkhactu
May 14, 2024

suckrowPierre
May 14, 2024
Author