# Text to speech (TTS)

Source

* [Intro to text-to-speech models](https://huggingface.co/learn/audio-course/en/chapter6/pre-trained_models)

Dependencies

```bash
    librosa 
    soundfile 
    speechbrain
    torchaudio
```

### Setting up

```python
from speechbrain.pretrained import EncoderClassifier
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan

```

**1. Load the Processor and Feature Extraction model**

```python
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts") # used like a tokenizer
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts") # used for speech feature extraction
```

**2. Load the Speech Embedding model (Optional)**

This model encodes the sound wav files to xvectors which is a popular feature vector used for sound models.\
This step is optional and is only loaded if the dataset is not in xvector form and you need to convert the file to xvector form.

```python
classifier = EncoderClassifier.from_hparams(source="speechbrain/spkrec-xvect-voxceleb", savedir="pretrained_models/spkrec-xvect-voxceleb")
```

```
/var/folders/rc/5ny4rz796d7gqs_j5kcvk6nh0000gn/T/ipykernel_45536/1453390625.py:1: UserWarning: Module 'speechbrain.pretrained' was deprecated, redirecting to 'speechbrain.inference'. Please update your script. This is a change from SpeechBrain 1.0. See: https://github.com/speechbrain/speechbrain/releases/tag/v1.0.0
  from speechbrain.pretrained import EncoderClassifier
/opt/homebrew/Caskroom/miniforge/base/envs/myenv/lib/python3.10/site-packages/speechbrain/utils/autocast.py:188: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  wrapped_fwd = torch.cuda.amp.custom_fwd(fwd, cast_inputs=cast_inputs)
```

**3. Load a Spectogram encoder**

This model is used to convert spectograms into waveforms. Specifically, the loaded vocoder operates on 80-bin mel-spectrograms to reconstruct the audio signal.&#x20;

<pre class="language-python"><code class="lang-python"><strong># loading the vocoder model 
</strong><strong>vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
</strong></code></pre>

### Example

```python
# Load the pkgs

import torchaudio
import torchaudio.transforms as T
from IPython.display import Audio
```

{% code overflow="wrap" %}

```python
# In this example, a wave file is loaded and this will be used as the referencing or conditioning feature vector for the sound model

sound_path = "01-00.04.75_00.07.46.wav"

# Load your .wav file
signal, fs = torchaudio.load(sound_path)

print(f"Shape of signal: {signal.shape} with resample {fs}Hz. ")

if signal.size(0) == 2:
    print("Converting the signal to process as mono channel waveform")
    signal = signal.mean(dim=0, keepdim=True)

if fs != 16000:
    print(f"Resampling from {fs}Hz to 16000Hz")
    resampler = T.Resample(orig_freq=fs, new_freq=16000)
    signal = resampler(signal)
    fs = 16000  # Update fs to the new sample rate

if signal.size(0) == 2:
    signal = signal.mean(dim=0, keepdim=True)

# Extract x-vector using the classifier
embedding = classifier.encode_batch(signal) # NOTE: this audio file is stero so it comes with 2 channels, slicing the first will give you the monowave

# To get numpy vector
# xvector = embedding.squeeze().detach().cpu().numpy()
print(f"Embedding shape: {embedding.shape}")
```

{% endcode %}

{% code title="Output" overflow="wrap" %}

```tex
Shape of signal: torch.Size([2, 119511]) with resample 44100Hz. 
Converting the signal to process as mono channel waveform
Resampling from 44100Hz to 16000Hz
Embedding shape: torch.Size([1, 1, 512])
```

{% endcode %}

To ensure compatibility with the SpeechT5 model, the audio is first converted to a mono channel. Additionally, the sampling rate is resampled from 44,100 Hz to 16,000 Hz, which is the expected input rate for the model. After processing, the audio is embedded into a feature representation with shape torch.Size(\[1, 1, 512]), suitable for downstream tasks like speech synthesis or recognition.&#x20;

{% code overflow="wrap" %}

```python
# Insert text message to sound out
inputs = processor(text="The aroma of fresh coffee filled the room, making it the perfect start to the day. He paused at the edge of the lake, staring at the still water, reflecting the clear blue sky above.", return_tensors="pt")
# Run the model
speech = model.generate_speech(inputs["input_ids"], embedding.squeeze(0), vocoder=vocoder)
```

{% endcode %}

<pre class="language-python"><code class="lang-python"><strong># to play in notebook 
</strong><strong>Audio(speech, rate=fs) 
</strong># to save sound file with torchaudio
torchaudio.save("output.wav", speech.unsqueeze(0), fs)
</code></pre>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://whoamimi.gitbook.io/blog/ai-ml-and-data-science-development/speech-models/text-to-speech-tts.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
