githubEdit

Text to speech (TTS)

Demo on running TTS models with python libraries from Hugging Face.

Source

Dependencies

    librosa 
    soundfile 
    speechbrain
    torchaudio

Setting up

from speechbrain.pretrained import EncoderClassifier
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan

1. Load the Processor and Feature Extraction model

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts") # used like a tokenizer
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts") # used for speech feature extraction

2. Load the Speech Embedding model (Optional)

This model encodes the sound wav files to xvectors which is a popular feature vector used for sound models. This step is optional and is only loaded if the dataset is not in xvector form and you need to convert the file to xvector form.

classifier = EncoderClassifier.from_hparams(source="speechbrain/spkrec-xvect-voxceleb", savedir="pretrained_models/spkrec-xvect-voxceleb")
/var/folders/rc/5ny4rz796d7gqs_j5kcvk6nh0000gn/T/ipykernel_45536/1453390625.py:1: UserWarning: Module 'speechbrain.pretrained' was deprecated, redirecting to 'speechbrain.inference'. Please update your script. This is a change from SpeechBrain 1.0. See: https://github.com/speechbrain/speechbrain/releases/tag/v1.0.0
  from speechbrain.pretrained import EncoderClassifier
/opt/homebrew/Caskroom/miniforge/base/envs/myenv/lib/python3.10/site-packages/speechbrain/utils/autocast.py:188: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  wrapped_fwd = torch.cuda.amp.custom_fwd(fwd, cast_inputs=cast_inputs)

3. Load a Spectogram encoder

This model is used to convert spectograms into waveforms. Specifically, the loaded vocoder operates on 80-bin mel-spectrograms to reconstruct the audio signal.

Example

To ensure compatibility with the SpeechT5 model, the audio is first converted to a mono channel. Additionally, the sampling rate is resampled from 44,100 Hz to 16,000 Hz, which is the expected input rate for the model. After processing, the audio is embedded into a feature representation with shape torch.Size([1, 1, 512]), suitable for downstream tasks like speech synthesis or recognition.

Last updated

Was this helpful?