Plachtaa / seed-vc
- пятница, 7 марта 2025 г. в 00:00:03
zero-shot voice conversion & singing voice conversion, with real-time support
Currently released model supports zero-shot voice conversion 🔊 , zero-shot real-time voice conversion 🗣️ and zero-shot singing voice conversion 🎶. Without any training, it is able to clone a voice given a reference speech of 1~30 seconds.
We support further fine-tuning on custom data to increase performance on specific speaker/speakers, with extremely low data requirement (minimum 1 utterance per speaker) and extremely fast training speed (minimum 100 steps, 2 min on T4)!
Real-time voice conversion is support, with algorithm delay of ~300ms and device side delay of ~100ms, suitable for online meetings, gaming and live streaming.
To find a list of demos and comparisons with previous voice conversion models, please visit our demo page🌐 and Evaluaiton📊.
We are keeping on improving the model quality and adding more features.
See EVAL.md for objective evaluation results and comparisons with other baselines.
Suggested python 3.10 on Windows, Mac M Series (Apple Silicon) or Linux. Windows and Linux:
pip install -r requirements.txt
Mac M Series:
pip install -r requirements-mac.txt
We have released 3 models for different purposes:
Version | Name | Purpose | Sampling Rate | Content Encoder | Vocoder | Hidden Dim | N Layers | Params | Remarks |
---|---|---|---|---|---|---|---|---|---|
v1.0 | seed-uvit-tat-xlsr-tiny (🤗📄) | Voice Conversion (VC) | 22050 | XLSR-large | HIFT | 384 | 9 | 25M | suitable for real-time voice conversion |
v1.0 | seed-uvit-whisper-small-wavenet (🤗📄) | Voice Conversion (VC) | 22050 | Whisper-small | BigVGAN | 512 | 13 | 98M | suitable for offline voice conversion |
v1.0 | seed-uvit-whisper-base (🤗📄) | Singing Voice Conversion (SVC) | 44100 | Whisper-small | BigVGAN | 768 | 17 | 200M | strong zero-shot performance, singing voice conversion |
Checkpoints of the latest model release will be downloaded automatically when first run inference.
If you are unable to access huggingface for network reason, try using mirror by adding HF_ENDPOINT=https://hf-mirror.com
before every command.
Command line inference:
python inference.py --source <source-wav>
--target <referene-wav>
--output <output-dir>
--diffusion-steps 25 # recommended 30~50 for singingvoice conversion
--length-adjust 1.0
--inference-cfg-rate 0.7
--f0-condition False # set to True for singing voice conversion
--auto-f0-adjust False # set to True to auto adjust source pitch to target pitch level, normally not used in singing voice conversion
--semi-tone-shift 0 # pitch shift in semitones for singing voice conversion
--checkpoint <path-to-checkpoint>
--config <path-to-config>
--fp16 True
where:
source
is the path to the speech file to convert to reference voicetarget
is the path to the speech file as voice referenceoutput
is the path to the output directorydiffusion-steps
is the number of diffusion steps to use, default is 25, use 30-50 for best quality, use 4-10 for fastest inferencelength-adjust
is the length adjustment factor, default is 1.0, set <1.0 for speed-up speech, >1.0 for slow-down speechinference-cfg-rate
has subtle difference in the output, default is 0.7f0-condition
is the flag to condition the pitch of the output to the pitch of the source audio, default is False, set to True for singing voice conversionauto-f0-adjust
is the flag to auto adjust source pitch to target pitch level, default is False, normally not used in singing voice conversionsemi-tone-shift
is the pitch shift in semitones for singing voice conversion, default is 0checkpoint
is the path to the model checkpoint if you have trained or fine-tuned your own model, leave to blank to auto-download default model from huggingface.(seed-uvit-whisper-small-wavenet
if f0-condition
is False
else seed-uvit-whisper-base
)config
is the path to the model config if you have trained or fine-tuned your own model, leave to blank to auto-download default config from huggingfacefp16
is the flag to use float16 inference, default is TrueVoice Conversion Web UI:
python app_vc.py --checkpoint <path-to-checkpoint> --config <path-to-config> --fp16 True
checkpoint
is the path to the model checkpoint if you have trained or fine-tuned your own model, leave to blank to auto-download default model from huggingface. (seed-uvit-whisper-small-wavenet
)config
is the path to the model config if you have trained or fine-tuned your own model, leave to blank to auto-download default config from huggingfaceThen open the browser and go to http://localhost:7860/
to use the web interface.
Singing Voice Conversion Web UI:
python app_svc.py --checkpoint <path-to-checkpoint> --config <path-to-config> --fp16 True
checkpoint
is the path to the model checkpoint if you have trained or fine-tuned your own model, leave to blank to auto-download default model from huggingface. (seed-uvit-whisper-base
)config
is the path to the model config if you have trained or fine-tuned your own model, leave to blank to auto-download default config from huggingfaceIntegrated Web UI:
python app.py
This will only load pretrained models for zero-shot inference. To use custom checkpoints, please run app_vc.py
or app_svc.py
as above.
Real-time voice conversion GUI:
python real-time-gui.py --checkpoint <path-to-checkpoint> --config <path-to-config>
checkpoint
is the path to the model checkpoint if you have trained or fine-tuned your own model, leave to blank to auto-download default model from huggingface. (seed-uvit-tat-xlsr-tiny
)config
is the path to the model config if you have trained or fine-tuned your own model, leave to blank to auto-download default config from huggingfaceImportant
It is strongly recommended to use a GPU for real-time voice conversion. Some performance testing has been done on a NVIDIA RTX 3060 Laptop GPU, results and recommended parameter settings are listed below:
Model Configuration | Diffusion Steps | Inference CFG Rate | Max Prompt Length | Block Time (s) | Crossfade Length (s) | Extra context (left) (s) | Extra context (right) (s) | Latency (ms) | Inference Time per Chunk (ms) |
---|---|---|---|---|---|---|---|---|---|
seed-uvit-xlsr-tiny | 10 | 0.7 | 3.0 | 0.18s | 0.04s | 2.5s | 0.02s | 430ms | 150ms |
You can adjust the parameters in the GUI according to your own device performance, the voice conversion stream should work well as long as Inference Time is less than Block Time.
Note that inference speed may drop if you are running other GPU intensive tasks (e.g. gaming, watching videos)
Explanations for real-time voice conversion GUI parameters:
Diffusion Steps
is the number of diffusion steps to use, in real-time case usually set to 4~10 for fastest inference;Inference CFG Rate
has subtle difference in the output, default is 0.7, set to 0.0 gains about 1.5x speed-up;Max Prompt Length
is the maximum length of the prompt audio, setting to a low value can speed up inference, but may reduce similarity to prompt speech;Block Time
is the time length of each audio chunk for inference, the higher the value, the higher the latency, note this value must be greater than the inference time per block, set according to your hardware condition;Crossfade Length
is the time length of crossfade between audio chunks, normally not needed to change;Extra context (left)
is the time length of extra history context for inference, the higher the value, the higher the inference time, but can increase stability;Extra context (right)
is the time length of extra future context for inference, the higher the value, the higher the inference time and latency, but can increase stability;The algorithm delay is appoximately calculated as Block Time * 2 + Extra context (right)
, device side delay is usually of ~100ms. The overall delay is the sum of the two.
You may wish to use VB-CABLE to route audio from GUI output stream to a virtual microphone.
(GUI and audio chunking logic are modified from RVC, thanks for their brilliant implementation!)
Fine-tuning on custom data allow the model to clone someone's voice more accurately. It will largely improve speaker similarity on particular speakers, but may slightly increase WER.
A Colab Tutorial is here for you to follow:
.wav
.flac
.mp3
.m4a
.opus
.ogg
configs/presets/
for fine-tuning, or create your own to train from scratch.
./configs/presets/config_dit_mel_seed_uvit_xlsr_tiny.yml
for real-time voice conversion./configs/presets/config_dit_mel_seed_uvit_whisper_small_wavenet.yml
for offline voice conversion./configs/presets/config_dit_mel_seed_uvit_whisper_base_f0_44k.yml
for singing voice conversionpython train.py
--config <path-to-config>
--dataset-dir <path-to-data>
--run-name <run-name>
--batch-size 2
--max-steps 1000
--max-epochs 1000
--save-every 500
--num-workers 0
where:
config
is the path to the model config, choose one of the above for fine-tuning or create your own for training from scratchdataset-dir
is the path to the dataset directory, which should be a folder containing all the audio filesrun-name
is the name of the run, which will be used to save the model checkpoints and logsbatch-size
is the batch size for training, choose depends on your GPU memory.max-steps
is the maximum number of steps to train, choose depends on your dataset size and training timemax-epochs
is the maximum number of epochs to train, choose depends on your dataset size and training timesave-every
is the number of steps to save the model checkpointnum-workers
is the number of workers for data loading, set to 0 for WindowsIf training accidentially stops, you can resume training by running the same command again, the training will continue from the last checkpoint. (Make sure run-name
and config
arguments are the same so that latest checkpoint can be found)
After training, you can use the trained model for inference by specifying the path to the checkpoint and config file.
./runs/<run-name>/
, with the checkpoint named ft_model.pth
and config file with the same name as the training config file.real-time-gui.py
might raise an error ModuleNotFoundError: No module named '_tkinter'
, in this case a new Python version with Tkinter support should be installed. Refer to This Guide on stack overflow for explanation of the problem and a detailed fix.