hf_text2speech#

class baf.nlp.text2speech.hf_text2speech.HFText2Speech(agent, model_name, language=None)[source]#

Bases: Text2Speech

A Hugging Face Text2Speech.

It loads a Speech2Text Hugging Face model to perform the Speech2Text task.

Parameters:
  • agent (Agent) – The agent instance.

  • model_name (str) – The Hugging Face model name.

  • language (str, optional) – Language code.

_model_name#

The Hugging Face model name

Type:

str

_tts#

The Transformer Text-to-Speech Pipeline

_tokenizer#

The Vits Tokenizer. Also supports MMS-TTS.

_model#

The complete VITS model

_abc_impl = <_abc._abc_data object>#
text2speech(text, return_tensor='pt')[source]#

Synthesize a text into its corresponding audio speech signal.

Parameters:
  • text (str) – the text that wants to be synthesized

  • return_tensor (str, optional) – Property for the HFText2Speech agent component. If set, will return tensors instead of list of python integers. Acceptable values are:

  • 'tf' – Return TensorFlow tf.constant objects.

  • 'pt' – Return PyTorch torch.Tensor objects.

  • 'np' – Return Numpy np.ndarray objects.

  • namenlp.text2speech.hf.rt

  • typestr

  • value (default) – pt

Returns:

the speech synthesis as a dictionary containing 2 keys:
audio (np.ndarray): the generated audio waveform as a numpy array with dimensions (nb_channels, audio_length),

where nb_channels is the number of audio channels (usually 1 for mono) and audio_length is the number of samples in the audio

sampling_rate (int): an integer value containing the sampling rate, e.g. how many samples correspond to

one second of audio

Return type:

dict