A spectrogram. It has orange and purple colors on a black background.

There are a bunch of different ways to do voice cloning, and they all come with their own ups and downs. This tutorial will show you some of those ways, and teach you a little bit about how voice cloning even works.

Intro to synthetic voices

Different types of synthetic voices

There are basically two different types of synthetic voice software: text-to-speech (TTS) and speech-to-speech (STS, a.k.a. voice-to-voice (V2V) and voice conversion (VC)).

Text-to-speech (TTS)

Text-to-speech is the most widespread. It is what makes Siri speak, what allows articles to be read aloud automatically and is an essential tool for people with reduced sight. To use it, you input some text to the software, which is converted to speech.

This is the area in which most synthetic voice research has been done. It has a long history that goes back to at least the middle of the 20th century. Before machine learning entered the arena, it was done using a method called concatenative synthesis, which worked by dividing sounds into small linguistic units and then putting them together through handwritten code. Now, with machine learning, we can let the computer divide and put them together for us, which saves us a whole lotta time. For more info on the history of speech synthesis, the Wikipedia article on the topic gives an excellent introduction: https://en.wikipedia.org/wiki/Speech_synthesis.

Training a machine learning model to do text-to-speech requires two types of data: recordings of speech and transcriptions of those recordings. The connections between the transcriptions and recordings are then found by the machine learning model itself. Training a TTS model from scratch usually requires at least one hour of voice data, but preferably somewhere between 8 and 24 hours. Using platforms like ElevenLabs, we need much less data than this, and do not require transcriptions. We will get back to this later, and why this comes with its own downsides.

Speech-to-speech (STS)

This type of synthetic speech is much newer, and has really only become a thing thanks to machine learning. It is also sometimes known as voice-to-voice (V2V) or voice conversion (VC). STS software works by inputting a voice recording, which is then transformed to sound like the voice trained by the model.

Training an STS model only requires a dataset of voice recordings. The amount of data required is significantly less than TTS. In some cases, you need no more than 1 minute. Some newer frameworks also claim to work with only 10 seconds, but this should definitely be taken with a grain of salt.

Hybrid forms

It is also possible to combine the two models in one flow in an attempt to get the benefits from both.

While they do not say so themselves, ElevenLabs (one of the most popular platforms for voice synthesis) seems to use a hybrid form, since they only require 1 minute of data for both TTS and STS.

Hybrid forms work by first producing the text-to-speech output with a generalized voice model, and then transforming that to sound like your desired voice with a custom STS model.

The easy way to do voice cloning

If you just want to get started doing voice cloning, you can use a commercial platform like ElevenLabs or resemble.ai. They are easy to use and work even with very little data. These platforms also don’t require you to train separate TTS and STS models.

There are a couple of downsides of doing it the easy way. One thing is the lack of control of what’s happening. These platforms do a lot of work behind the scenes to make it as easy as possible to use, but this means that they can be unpredictable.

In one test, we found that ElevenLabs might alter the pitch drastically, if you change the text input in certain ways. In our case, we got a significantly lower pitch by simply exchanging the words “he said slowly” with “he said very, very slowly”. When we tried with “he said rapidly”, the pitch increased slightly. You can hear our experiments on our Soundcloud: https://soundcloud.com/vocal-imaginaries/sets/pacing-with-elevenlabs.

Another downside is that you cannot run it locally on a Raspberry Pi or similar, which means that you need to integrate with the service’s API. This both costs money and makes you reliant on a stable internet connection.

The difficult (but powerful way) to do voice cloning

If you want more control of what’s happening, you can train your own models from scratch by using open source software in Google Colab.

When training your own models with the following open source approaches, you need to decide whether you want to train a TTS or STS model. We provide instructions for both here.

To train your own model and build a dataset, you require the following:

  1. A Google account for gaining access to Google Colab
  2. A Google Drive with at least a couple of GBs free
  3. Audacity (or similar) for editing your voice recording dataset
  4. For TTS, you also need a text editor for editing your transcription

Open source TTS

For text-to-speech, I recommend training a model based on the Piper framework. It is built to run on Raspberry Pi, so it’s very flexible, but can still achieve a high fidelity voice.

Mateo Cedillo has been so kind to create a series of Google Colab notebooks for training, running and exporting a Piper model.

To get started, you need to begin with the training notebook: https://colab.research.google.com/github/rmcpantoja/piper/blob/master/notebooks/piper_multilingual_training_notebook.ipynb

Before training, you need to gather your dataset. This requires two parts:

  1. At least 30 minutes of voice recordings, but the more you have, the better.
  2. Transcriptions of what’s being said in those voice recordings.

In the following, we guide you through how to format those datasets.

If you need a script to read from, Microsoft has some open source ones in many languages here: https://github.com/Azure-Samples/Cognitive-Speech-TTS/tree/master/CustomVoice/script. Using a prewritten script makes the transcription process a lot easier.

If you don’t want to record your own data, but want to use some public domain data instead, I recommend looking at the audiobook website LibriVox. The transcriptions for the books can also often be found on Project Gutenberg or similar, which makes the transcription process easier.

Voice recordings

The voice data needs to be split into smaller files with a length of between 2 and 10 seconds.

The files need to be WAV format, 16-bits, mono and 16000 hz or 22050 hz.

In Audacity, you can easily split up a large file into smaller bits based on the silent parts. This Youtube tutorial shows how to do that: https://www.youtube.com/watch?v=-gwLQobwj98

It might require a bit of fidgeting with the settings to get it exactly right, but it is by far one of the easiest ways to do it. If you can’t get the perfect settings, you can also supplement the automatic process by editing the labels manually.

When exporting, double check that the export settings match the format requirements above (WAV, 16-bits, mono and 16000 hz or 22050 hz). The names of the files don’t matter, because you will enter them in the transcription file, but I recommend just using numbers, e.g. 01.wav, 02.wav etc.

Once you have all your files, you need to compress them to a .zip file. To do this properly, you should select all the files and compress them. If you instead put the files into a folder and compress the folder, the Colab notebook will not be able to recognize the files. When you have zipped the file, name it wavs.zip. When unzipping the file, you should get a folder called wavs with all your files inside. If you get a folder with another folder inside it, you have compressed the files the wrong way.

Once this is done, you can move onto the transcription process.


Your audio recordings need to be transcribed, so the machine learning model can learn the connections between text and audio. To do this, we have to create a .txt-file called list.txt. For each file in your audio dataset, you need a line that follows the LJSpeech format. The LJSpeech format looks like this:


The filepath in your case should include the wavs folder. This is an example of what that can look like:

wavs/01.wav|Chapter one of Eight Girls and a Dog.
wavs/02.wav|This is a librivox recording. All librivox recordings are in the public domain.
wavs/03.wav|For more information or to volunteer, please visit librivox dot org.

Piper also supports building voice models with multiple speakers inside, a.k.a. multispeaker mode. This is a feature that makes it easier to switch between different voices when using the model, since you only have to load one model instead of multiple. Use cases for this include having multiple languages or different modes of expression such as happy, sad, angry, etc. When training a multispeaker model, the transcription format looks a bit different:


In the example above, it would be something like this:

wavs/01.wav|speaker1|Chapter one of Eight Girls and a Dog.
wavs/02.wav|speaker2|This is a librivox recording. All librivox recordings are in the public domain.
wavs/03.wav|speaker1|For more information or to volunteer, please visit librivox dot org.

Transcription can be quite a tedious process, but there are various ways to make it easier:

  1. You can use transcription software to help you out. This usually costs money, but might be worth the trouble for you.
  2. You can autogenerate the list of filepaths using terminal commands to get started. There’s many ways to do this, and it varies from OS to OS, so I suggest finding your way on the Internet.
  3. If you are using a Librivox recording, you can try to find a .txt format version of the book as a starting point. For example on Project Gutenberg.
  4. If you are using a Microsoft TTS script, you can use that as a starting point. It will most likely still have to be edited to match how your files have been split and named.
  5. You can use the autotranscription feature in the training notebook. This is not recommended as the only solution, but might be a nice way to get started.

If you are using a text editor like VSCode, I suggest learning the keyboard shortcuts for editing multiple lines at the same time.

Once your transcription is ready, you are ready for the next step: training.


The first step in training your voice model is to upload the dataset to Google Drive. You should upload both the wavs.zip and list.txt file.

Once this is done, you can open the Google Colab training notebook: https://colab.research.google.com/github/rmcpantoja/piper/blob/master/notebooks/piper_multilingual_training_notebook.ipynb

Follow the instructions in the notebook, and be sure to:

  1. Change the zip_path to match where you put the dataset.
  2. Make sure that the sample_rate field matches that of your audio files.
  3. Change model_name to something nice and descriptive
  4. You can consider changing the output_path to control where the notebook saves the voice model.

Aside from changing these things, I recommend using all the default training settings.

At the end of the notebook, the training process will start. This process needs to run for at least 8 hours. If the Colab shuts down before that, you should change the action field to “Continue training” and run through the notebook again.

Testing and using the model

If you want to use or test the model, Mateo Cedillo also has you covered. He has created another notebook for this purpose: https://colab.research.google.com/github/rmcpantoja/piper/blob/master/notebooks/piper_inference_(ckpt).ipynb

In this notebook, you have to fill out two fields: model_url_or_id and config_url_or_id.

Inside your voice model folder in Google Drive, you should see a file called config.json. You need to change its sharing settings to be that anyone with the link can view it. The link to the file should be entered into the config_url_or_id field.

Inside your voice model folder, you should also see a folder called lightning_logs. Inside that, you will see one or more folders called version_<number>. Inside these folders, there should be a folder called checkpoints. This contains your voice model as a .ckpt file. This also needs to have its sharing settings changed similar to the config file. The link to the .ckpt file should go into the model_url_or_id field.

Once you have those fields filled out, you can run the inference, which is the process of generating speech from text with the model. When running the inference, the notebook will show you a text-to-speech interface as well as some sliders to fiddle with. If you trained a multispeaker model, you should also be able to choose between these in the interface. The default setting for the sliders are pretty stable, but it’s also a lot of fun to mess with them to see what happens. In some cases, this might crash the Colab instance due to the voice model having trouble figuring out what to do. But if that happens, you can just restart the inference notebook from the beginning.

Using the model on Raspberry Pi

Once you have your Piper model, you can also use it on a Raspberry Pi. To do this, you have to export the model to the ONNX format. You can do that with this Colab Notebook: https://colab.research.google.com/github/rmcpantoja/piper/blob/master/notebooks/piper_model_exporter.ipynb

When you have successfully exported the ONNX model, you can then run it via the TTS software Piper, which is compatible with Raspberry Pi: https://github.com/rhasspy/piper?tab=readme-ov-file

Open source STS

Creating an open source STS voice model is much easier than TTS, since you only require the audio data itself. I recommend using the SVC model through this Colab Notebook, maintained by justinjohn03 from the synthetic AI website FakeYou: https://colab.research.google.com/github/justinjohn0306/so-vits-svc-fork/blob/main/notebooks/so_vits_svc_fork_4_0.ipynb

The notebook explains much of the process itself, so I won’t go into much detail here.


So, that’s it! That’s some of the ways in which you can train your own voice model with ups and downs of all techniques.

If you have any questions or need help, you can always reach out to me on hello@ada-ada-ada.art, and I’ll see what I can do.