Finetuning Model on custom dataset, resulting in audio generations which have high treble #3746

chinmay-choudhary · 2024-05-17T06:25:15Z

chinmay-choudhary
May 17, 2024

I am finetuning the xtts v2 model on a custom dataset which I have formatted into the ljspeech data format. The dataset contains audios and transcripts of speakers speaking english with indian accents. Currently I am training with 10K data samples, the problem I am having is that when I listen to the audios generated using my test sentences in tensorboard, all seem to have a lot of treble I will upload one of the audio sample below where I had to reduct the treble to -10 in audacity to make it better. I was wondering if there is any way to control that or any advice on the type of data should I use for training to not face this issue?

model_args = GPTArgs(
        max_conditioning_length=132300,  # 6 secs
        min_conditioning_length=66150,  # 3 secs
        debug_loading_failures=False,
        max_wav_length=255995,  # ~11.6 seconds
        max_text_length=200,
        mel_norm_file=MEL_NORM_FILE,
        dvae_checkpoint=DVAE_CHECKPOINT,
        xtts_checkpoint=XTTS_CHECKPOINT,  # checkpoint path of the model that you want to fine-tune
        tokenizer_file=TOKENIZER_FILE,
        gpt_num_audio_tokens=1026,
        gpt_start_audio_token=1024,
        gpt_stop_audio_token=1025,
        gpt_use_masking_gt_prompt_approach=True,
        gpt_use_perceiver_resampler=True,
    )
    # define audio config
    audio_config = XttsAudioConfig(sample_rate=22050, dvae_sample_rate=22050, output_sample_rate=24000)
    # training parameters config
    print('initializing training parameters args')
    config = GPTTrainerConfig(
        languages=[
            'en',
            # 'hi',
            # 'ar'
            ],
        output_path=OUT_PATH,
        model_args=model_args,
        run_name=RUN_NAME,
        project_name=PROJECT_NAME,
        run_description="""
            GPT XTTS hindi finetuned model wights finetuning for english text with indian accent.
            """,
        dashboard_logger=DASHBOARD_LOGGER,
        logger_uri=LOGGER_URI,
        audio=audio_config,
        batch_size=BATCH_SIZE,
        batch_group_size=48,
        # epochs=10,
        # run_eval_steps=100,
        eval_batch_size=BATCH_SIZE,
        num_loader_workers=8,
        eval_split_max_size=256,
        print_step=50,
        plot_step=100,
        log_model_step=1000,
        save_step=10000,
        save_n_checkpoints=1,
        save_checkpoints=True,
        # target_loss="loss",
        print_eval=True,
        # Optimizer values like tortoise, pytorch implementation with modifications to not apply WD to non-weight parameters.
        optimizer="AdamW",
        optimizer_wd_only_on_weights=OPTIMIZER_WD_ONLY_ON_WEIGHTS,
        optimizer_params={"betas": [0.9, 0.96], "eps": 1e-8, "weight_decay": 1e-2},
        lr=5e-06,  # learning rate
        lr_scheduler="MultiStepLR",
        # it was adjusted accordly for the new step scheme
        lr_scheduler_params={"milestones": [50000 * 18, 150000 * 18, 300000 * 18], "gamma": 0.5, "last_epoch": -1},
        test_sentences=[
            {
                "text": "Hello, is there anything I can help you with today?",
                "speaker_wav": SPEAKER_REFERENCE[0],
                "language": 'en',
            },
            {
                "text": "No problem, I can definitely help you with that task please give me a moment",
                "speaker_wav": SPEAKER_REFERENCE[0],
                "language": 'en',
            },
            {
                "text": "There is an availability in the schedule tomorrow at six PM does that work for you?",
                "speaker_wav": SPEAKER_REFERENCE[0],
                "language": 'en',
            },
            {
                "text": "Please let me know the date and time you would like me to book the meeting",
                "speaker_wav": SPEAKER_REFERENCE[0],
                "language": 'en',
            },
        ],
    )

    # init the model from config
    model = GPTTrainer.init_from_config(config)

    # load training samples
    config.eval_split_size=0.1
    config.eval_split_max_size=1000
    train_samples, eval_samples = load_tts_samples(
        DATASETS_CONFIG_LIST,
        eval_split=True,
        eval_split_max_size=config.eval_split_max_size,
        eval_split_size=config.eval_split_size,
    )

    print(len(train_samples))
    print(len(eval_samples))
    print(train_samples[0])
    # init the trainer and 🚀
    trainer = Trainer(
        TrainerArgs(
            skip_train_epoch=False,
            start_with_eval=START_WITH_EVAL,
            grad_accum_steps=GRAD_ACUMM_STEPS,
        ),
        config,
        output_path=OUT_PATH,
        model=model,
        train_samples=train_samples,
        eval_samples=eval_samples,
    )

Original Enlish Audio

english.mov

Treble Reduced Audio

english_treble_reduced.wav.mov

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetuning Model on custom dataset, resulting in audio generations which have high treble #3746

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Finetuning Model on custom dataset, resulting in audio generations which have high treble #3746

chinmay-choudhary May 17, 2024

Replies: 0 comments

chinmay-choudhary
May 17, 2024