yl4579 / StyleTTS2
- среда, 22 ноября 2023 г. в 00:00:06
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS synthesis on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs.
Paper: https://arxiv.org/abs/2306.07691
Audio samples: https://styletts2.github.io/
train_second.py
(I have tried everything I could to fix this but had no success, so if you are willing to help, please see #7)git clone https://github.com/yl4579/StyleTTS2.git
cd StyleTTS2
pip install -r requirements.txt
On Windows add:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 -U
Also install phonemizer and espeak if you want to run the demo:
pip install phonemizer
sudo apt-get install espeak-ng
First stage training:
accelerate launch train_first.py --config_path ./Configs/config.yml
Second stage training (DDP version not working, so the current version uses DP, again see #7 if you want to help):
python train_second.py --config_path ./Configs/config.yml
You can run both consecutively and it will train both the first and second stages. The model will be saved in the format "epoch_1st_%05d.pth" and "epoch_2nd_%05d.pth". Checkpoints and Tensorboard logs will be saved at log_dir
.
The data list format needs to be filename.wav|transcription|speaker
, see val_list.txt as an example. The speaker labels are needed for multi-speaker models because we need to sample reference audio for style diffusion model training.
In config.yml, there are a few important configurations to take care of:
OOD_data
: The path for out-of-distribution texts for SLM adversarial training. The format should be text|anything
.min_length
: Minimum length of OOD texts for training. This is to make sure the synthesized speech has a minimum length.max_len
: Maximum length of audio for training. The unit is frame. Since the default hop size is 300, one frame is approximately 300 / 24000
(0.125) second. Lowering this if you encounter the out-of-memory issue.multispeaker
: Set to true if you want to train a multispeaker model. This is needed because the architecture of the denoiser is different for single and multispeaker models.batch_percentage
: This is to make sure during SLM adversarial training there are no out-of-memory (OOM) issues. If you encounter OOM problem, please set a lower number for this.In Utils folder, there are three pre-trained models:
batch_size
or max_len
. You may refer to issue #10 for more information.The script is modified from train_second.py
which uses DP, as DDP does not work for train_second.py
. Please see the bold section above if you are willing to help with this problem.
python train_finetune.py --config_path ./Configs/config_ft.yml
Please make sure you have the LibriTTS checkpoint downloaded and unzipped under the folder. The default configuration config_ft.yml
finetunes on LJSpeech with 1 hour of speech data (around 1k samples) for 50 epochs. This took about 4 hours to finish on four NVidia A100. The quality is slightly worse (similar to NaturalSpeech on LJSpeech) than LJSpeech model trained from scratch with 24 hours of speech data, which took around 2.5 days to finish on four A100.
Please refer to Inference_LJSpeech.ipynb (single-speaker) and Inference_LibriTTS.ipynb (multi-speaker) for details. For LibriTTS, you will also need to download reference_audio.zip and unzip it under the demo
before running the demo.
The pretrained StyleTTS 2 on LJSpeech corpus in 24 kHz can be downloaded at https://huggingface.co/yl4579/StyleTTS2-LJSpeech/tree/main.
The pretrained StyleTTS 2 model on LibriTTS can be downloaded at https://huggingface.co/yl4579/StyleTTS2-LibriTTS/tree/main.
Before using these pre-trained models, you agree to inform the listeners that the speech samples are synthesized by the pre-trained models, unless you have the permission to use the voice you synthesize. That is, you agree to only use voices whose speakers grant the permission to have their voice cloned, either directly or by license before making synthesized voices public, or you have to publicly announce that these voices are synthesized if you do not have the permission to use these voices.