A few weeks ago, I read about how easy it has become to clone the voices of ordinary people, pop stars, or politicians. From my point of view, one of the better uses was generating songs that were better than the originals by the pop stars themselves. Less favorable uses included scams targeting the elderly, like the grandchild trick that harms them.
This piqued my curiosity, and I wanted to find out how easy it has become to clone any voice. And yes, it has really become simple to clone your own voice on your own computer. In this article, I am again using one of the freely available Text-to-Speech (TTS) models from Huggingface. In this case, it is XTTS-v2 (https://huggingface.co/coqui/XTTS-v2).
In this article, I want to describe very simply how an interested person can run this model on their computer.
This article seeks to equip readers with the knowledge and tools necessary to embark on a journey of exploration and innovation with TTS models. By unraveling the technical complexities and presenting actionable steps for implementation, it aims to demystify the process of integrating TTS models into personal projects or professional workflows.
Whether you are a computer scientist, a developer, or simply someone fascinated by the potential of AI, this guide endeavors to provide valuable insights into harnessing the power of AI models on your personal computer.
Introduction to Text-to-Speech and XTTS-v2
Text-to-Speech (TTS) models are technologies that convert written text into spoken speech. These models find a wide range of applications, such as assisting visually impaired individuals, powering interactive voice response systems, and enhancing the usability of devices and software applications.
Basics of TTS Technology:
- Text Analysis: The input text is first transformed into an internal form that includes phonetic representations. This step may also involve text normalization, where abbreviations, numbers, and special symbols are converted into words.
- Phonemic Transcription: The text is converted into phonemes, which are the basic sound units of a language.
- Prosody Modeling: This involves modeling the emphasis and intonation of speech to make the speech output sound more natural. This includes the modulation of rhythm, pitch, and speaking speed.
- Speech Synthesis: In this final step, the phonemes and prosodic information are used to generate the actual speech output. This can be done either by digitally processing recorded human voices (concatenative synthesis) or by using generative models that build speech from basic elements (parametric synthesis).
TTS systems increasingly utilize machine learning and neural networks to produce more natural and adaptable speech outputs.
Advances in AI and machine learning have made it possible to clone voices and incorporate emotions and specific styles into the speech output, making interactions with voice-driven assistants and other systems much more pleasant and human-like.
The applications of TTS are diverse and continually expanding as new technologies and improved algorithms are developed, enabling ever more realistic and flexible speech synthesis.
Specific Features of XTTS-v2: This model offers voice cloning capabilities with just a six-second audio clip and supports multilingual speech generation across 17 languages including English, Spanish, French, and more. It has made significant improvements over its predecessor, XTTS-v1, in terms of speaker conditioning, stability, prosody, and overall audio quality. XTTS-v2 also enables cross-language voice cloning and emotional style transfer, enhancing the diversity and applicability of its synthesized speech outputs. Please observe the licensing terms of the model.
Setting Up Your Environment
Before you start experimenting with XTTS-v2, you need to ensure your computer is properly configured with the necessary tools and files. Here’s a step-by-step guide to get you set up:
Prerequisites
To set up your environment for voice cloning using XTTS-v2, you’ll need a few basic items and software installations:
- WAV Clip of Your Voice: Obtain a short audio recording of your voice. You can create this clip using any standard audio recording software. For example, Mac users can use QuickTime Player to record their voice. Guide to recording with Quicktime.
- Compiler Installation:
- Mac: Install GCC, which is required for some of the Python packages. Use Homebrew by running the command: brew install gcc.
- Python Installation:
- Ensure you have Python 3.9 installed on your computer. You can download it from python.org or use a package manager like Homebrew on macOS (brew install python@3.9).
- Git Large File Storage (LFS):
- Install Git LFS which is necessary for handling large files. Install it by running: git lfs install.
- Repository Clone:
- Clone the XTTS-v2 repository from Huggingface to your local machine using: git clone https://huggingface.co/coqui/XTTS-v2.
- Python Dependencies:
- pip install soundfile
- pip install numpy
- pip install tts
Setup Instructions
- Prepare the Voice Sample:
- Store your WAV file in the designated folder within the cloned repository. It should be placed under /samples/test.wav. This file will be used as the reference voice for cloning.
- Environment Configuration:
- Open your terminal or command prompt.
- Navigate to the XTTS-v2 repository folder.
- Ensure all dependencies are installed and environment variables are set (if needed).
- Implementing the following code in a file in the XTTS-v2 repository folder.
Implementing the Code
Create a Python script named yourfile.py. This script will handle the loading of the model, processing of text input, and synthesizing speech output. Here’s the code you need to include:
import soundfile as sf
import numpy as np
import re
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
# Loading the configuration and model from JSON files and initializing them
print("Loading model:")
config = XttsConfig()
config.load_json("./config.json") # Load model configuration from JSON
model = Xtts.init_from_config(config) # Initialize the model with the loaded configuration
model.load_checkpoint(config, checkpoint_dir=".", eval=True) # Load model weights from checkpoint for evaluation
model.cpu() # Ensure the model runs on CPU
# Function to split the text into smaller parts based on punctuation marks
def split_text(text, max_length=253):
sentences = re.split(r'(?<=[.!?])\s', text) # Split the text at sentence-ending punctuations
parts = []
current_part = " "
for sentence in sentences:
if len(current_part) + len(sentence) < max_length:
current_part += sentence # Add sentence to current part if length is within limit
else:
parts.append(current_part) # Store the current part and start a new one
current_part = " " + sentence
if current_part:
parts.append(current_part) # Add the last part if not empty
return parts
# Splitting the initial long text into manageable parts
prompt = """
BBHT Solutions, based in Romania, was founded in 2019 and focuses on software testing, test automation, software development, and M/TEXT development.
"""
prompt_parts = split_text(prompt) # Apply text splitting function to the prompt
# Synthesizing audio data for each text part
audio_data_combined = np.array([])
for part_index, part in enumerate(prompt_parts):
print(f"Generating text section {part_index + 1}: {part}")
outputs = model.synthesize(part, config, speaker_wav="./samples/test.wav", gpt_cond_len=3, language="en") # Generate audio from text
audio_data = outputs['wav']
# Ensure the audio data is a numpy array and adjust dimensions if necessary
if isinstance(audio_data, np.ndarray):
if len(audio_data.shape) == 1:
audio_data = np.expand_dims(audio_data, axis=-1) # Make sure audio data has correct shape
else:
print("The audio data is not a numpy array. Current type:", type(audio_data))
continue
audio_data_combined = np.concatenate((audio_data_combined, audio_data.flatten())) # Combine audio from all parts
# Saving the final combined audio data to a WAV file
sf.write('output.wav', audio_data_combined, config.audio.sample_rate) # Write combined audio to file with specified sample rate
Executing the program
python3 yourfile.py
Loading model:
Generating text section 1:
BBHT Solutions, based in Romania, was founded in 2019 and focuses on software testing, test automation, software development, and M/TEXT development.
The result is in the file: output.wav
Performance Insights on Test System
The compute time for generating audio files can vary significantly based on system specifications. For our tests:
Test System Specifications:
Model Name: Mac Studio
Chip: Apple M1 Max
Total Number of CPU-Cores: 10 (8 performance and 2 efficiency)
Total Number of GPU-Cores: 24
Memory: 32 GB
Compute Time of the example:
Ca. 40 seconds