Text-to-Speech synthesis using OuteTTS and OpenVINO¶

Important note: This notebook requires python >= 3.10. Please make sure that your environment fulfill to this requirement before running it

OuteTTS-0.1-350M is a novel text-to-speech synthesis model that leverages pure language modeling without external adapters or complex architectures, built upon the LLaMa architecture. It demonstrates that high-quality speech synthesis is achievable through a straightforward approach using crafted prompts and audio tokens.

More details about model can be found in original repo.

In this tutorial we consider how to run OuteTTS pipeline using OpenVINO.

Table of contents:¶

Prerequisites
Convert model
Run model inference
- Text-to-Speech generation
- Text-to-Speech generation with Voice Cloning
Quantization
Interactive demo

Installation Instructions¶

This is a self-contained example that relies solely on its own code.

We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.

Prerequisites¶

back to top ⬆️

In [1]:

import requests
from pathlib import Path

utility_files = ["skip_kernel_extension.py", "cmd_helper.py", "notebook_utils.py", "pip_helper.py"]
base_utility_url = "https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/"

for utility_file in utility_files:
    if not Path(utility_file).exists():
        r = requests.get(base_utility_url + utility_file)
        with Path(utility_file).open("w") as f:
            f.write(r.text)


helper_files = ["gradio_helper.py", "ov_outetts_helper.py"]
base_helper_url = "https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/outetts-text-to-speech"

for helper_file in helper_files:
    if not Path(helper_file).exists():
        r = requests.get(base_helper_url + helper_file)
        with Path(helper_file).open("w") as f:
            f.write(r.text)

%load_ext skip_kernel_extension

In [2]:

import platform
from pip_helper import pip_install

pip_install(
    "-q",
    "torch>=2.1",
    "torchaudio",
    "einops",
    "transformers>=4.46.1",
    "loguru",
    "inflect",
    "pesq",
    "torchcrepe",
    "natsort",
    "polars",
    "uroman",
    "mecab-python3",
    "openai-whisper>=20240930",
    "unidic-lite",
    "--extra-index-url",
    "https://download.pytorch.org/whl/cpu",
)
pip_install(
    "-q",
    "gradio>=4.19",
    "openvino>=2024.4.0",
    "tqdm",
    "pyyaml",
    "librosa",
    "soundfile",
    "nncf",
)
pip_install("-q", "git+https://github.com/huggingface/optimum-intel.git", "--extra-index-url", "https://download.pytorch.org/whl/cpu")

if platform.system() == "Darwin":
    pip_install("-q", "numpy<2.0.0")

In [3]:

# Read more about telemetry collection at https://github.com/openvinotoolkit/openvino_notebooks?tab=readme-ov-file#-telemetry
from notebook_utils import collect_telemetry

collect_telemetry("outetts-text-to-speech.ipynb")

In [4]:

from cmd_helper import clone_repo

repo_path = clone_repo("https://github.com/edwko/OuteTTS.git", revision="0.3.2")

In [ ]:

%pip install -q {repo_path} --extra-index-url https://download.pytorch.org/whl/cpu

Convert model¶

back to top ⬆️

OpenVINO supports PyTorch models via conversion to OpenVINO Intermediate Representation format. For convenience, we will use OpenVINO integration with HuggingFace Optimum. 🤗 Optimum Intel is the interface between the 🤗 Transformers and Diffusers libraries and the different tools and libraries provided by Intel to accelerate end-to-end pipelines on Intel architectures.

Among other use cases, Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime. optimum-cli provides command line interface for model conversion and optimization.

General command format:

optimum-cli export openvino --model <model_id_or_path> --task <task> <output_dir>

where task is task to export the model for, if not specified, the task will be auto-inferred based on the model. You can find a mapping between tasks and model classes in Optimum TaskManager documentation. Additionally, you can specify weights compression using --weight-format argument with one of following options: fp32, fp16, int8 and int4. Fro int8 and int4 nncf will be used for weight compression. More details about model export provided in Optimum Intel documentation.

As OuteTTS utilizes pure language modeling approach, model conversion process remains the same like conversion LLaMa models family for text generation purposes.

In [6]:

from cmd_helper import optimum_cli
from pathlib import Path

model_id = "OuteAI/OuteTTS-0.1-350M"
model_dir = Path(model_id.split("/")[-1] + "-ov")

if not model_dir.exists():
    optimum_cli(model_id, model_dir, additional_args={"task": "text-generation-with-past"})

Run model inference¶

back to top ⬆️

OpenVINO integration with Optimum Intel provides ready-to-use API for model inference that can be used for smooth integration with transformers-based solutions. For loading model, we will use OVModelForCausalLM class that have compatible interface with Transformers LLaMa implementation. For loading a model, from_pretrained method should be used. It accepts path to the model directory or model_id from HuggingFace hub (if model is not converted to OpenVINO format, conversion will be triggered automatically). Additionally, we can provide an inference device, quantization config (if model has not been quantized yet) and device-specific OpenVINO Runtime configuration. More details about model inference with Optimum Intel can be found in documentation. We will use OVModelForCausalLM as replacement of original AutoModelForCausalLM in InterfaceHF.

In [7]:

from notebook_utils import device_widget

device = device_widget(exclude=["NPU"])

device

Out[7]:

Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

In [8]:

from ov_outetts_helper import InterfaceOV, OVHFModel  # noqa: F401

# Uncomment these lines to see pipeline details
# ??InterfaceOV
# ??OVHFModel

/home/ea/work/my_optimum_intel/optimum_env_new/lib/python3.11/site-packages/awq/__init__.py:21: DeprecationWarning: 
I have left this message as the final dev message to help you transition.

Important Notice:
- AutoAWQ is officially deprecated and will no longer be maintained.
- The last tested configuration used Torch 2.6.0 and Transformers 4.51.3.
- If future versions of Transformers break AutoAWQ compatibility, please report the issue to the Transformers project.

Alternative:
- AutoAWQ has been adopted by the vLLM Project: https://github.com/vllm-project/llm-compressor

For further inquiries, feel free to reach out:
- X: https://x.com/casper_hansen_
- LinkedIn: https://www.linkedin.com/in/casper-hansen-804005170/

  warnings.warn(_FINAL_DEV_MESSAGE, category=DeprecationWarning, stacklevel=1)
2025-05-16 08:30:26.721 | WARNING  | outetts.version.playback:<module>:11 - [playback] Failed to import sounddevice.
2025-05-16 08:30:26.722 | WARNING  | outetts.version.playback:<module>:15 - [playback] Failed to pygame sounddevice.

In [9]:

interface = InterfaceOV(model_dir, device.value)

making attention of type 'vanilla' with 768 in_channels

Text-to-Speech generation¶

back to top ⬆️

Now let's see model in action. Providing input text to generate method of interface, model returns tensor that represents output audio with random speaker characteristics.

In [10]:

tts_output = interface.generate(text="Hello, I'm working!", temperature=0.1, repetition_penalty=1.1, max_length=4096)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

In [11]:

import IPython.display as ipd

def play(data, rate=None):
    ipd.display(ipd.Audio(data, rate=rate))

play(tts_output.audio[0].numpy(), rate=tts_output.sr)

Text-to-Speech generation with Voice Cloning¶

back to top ⬆️

Additionally, we can specify reference voice for generation by providing reference audio and transcript for it. interface.create_speaker processes reference audio and text to set of features used for audio description.

In [12]:

from notebook_utils import download_file

ref_audio_url = "https://huggingface.co/OuteAI/OuteTTS-0.1-350M/resolve/main/samples/2.wav"
file_path = Path("2.wav")

if not file_path.exists():
    file_path = download_file(ref_audio_url)

In [13]:

play(file_path)

In [14]:

speaker = interface.create_speaker(file_path, "Hello, I can speak pretty well, but sometimes I make some mistakes.")

# Save the speaker to a file
interface.save_speaker(speaker, "speaker.pkl")

# Load the speaker from a file
speaker = interface.load_speaker("speaker.pkl")

# Generate TTS with the custom voice
cloned_output = interface.generate(
    text="This is a cloned voice speaking",
    speaker=speaker,
    temperature=0.1,
    repetition_penalty=1.1,
    max_length=4096,
)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.

In [15]:

play(cloned_output.audio[0].numpy(), rate=cloned_output.sr)

Quantization¶

back to top ⬆️

NNCF enables post-training quantization by adding the quantization layers into the model graph and then using a subset of the training dataset to initialize the parameters of these additional quantization layers. The framework is designed so that modifications to your original training code are minor.

The optimization process contains the following steps:

Create a calibration dataset for quantization.
Run nncf.quantize to obtain quantized model.
Serialize the INT8 model.

Note: Quantization is a time and memory-consuming operation. Running the quantization code below may take some time.

In [16]:

from notebook_utils import quantization_widget

to_quantize = quantization_widget()

to_quantize

Out[16]:

Checkbox(value=True, description='Quantization')

Prepare calibration dataset¶

back to top ⬆️

The first step is to prepare calibration datasets for quantization. We will utilize Filtered LibriTTS-R dataset as it was used to train the original model.

In [17]:

%%skip not $to_quantize.value

from datasets import load_dataset

libritts = load_dataset("parler-tts/libritts_r_filtered", "clean", split="test.clean", streaming=True)

Resolving data files:   0%|          | 0/18 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/64 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/18 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/64 [00:00<?, ?it/s]

Quantize model¶

back to top ⬆️

Below we run the quantize function which calls nncf.quantize on the OpenVINO IR model and collected dataset.

In [18]:

%%skip not $to_quantize.value

import nncf
from functools import partial
import numpy as np

def transform_fn(item, interface):
    text_normalized = item["text_normalized"]
    prompt = interface.prompt_processor.get_completion_prompt(text_normalized, interface.language, None)
    encoded = interface.prompt_processor.tokenizer(prompt, return_tensors="np")

    input_ids = encoded["input_ids"]
    attention_mask = encoded["attention_mask"]
    inputs = {"input_ids": input_ids, "attention_mask": attention_mask}

    position_ids = np.cumsum(attention_mask, axis=1) - 1
    position_ids[attention_mask == 0] = 1
    inputs["position_ids"] = position_ids

    batch_size = input_ids.shape[0]
    inputs["beam_idx"] = np.arange(batch_size, dtype=int)

    return inputs

hf_model = OVHFModel(model_dir, device.value).model
dataset = nncf.Dataset(libritts, partial(transform_fn, interface=interface))

quantized_model = nncf.quantize(
    hf_model.model,
    dataset,
    preset=nncf.QuantizationPreset.MIXED,
    model_type=nncf.ModelType.TRANSFORMER,
    ignored_scope=nncf.IgnoredScope(
        patterns=[
            # We need to use ignored scope for this pattern to generate the most efficient model
            "__module.model.layers.*.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention"
        ]
    )
)

hf_model.model = quantized_model
model_dir_quantized = Path(f"{model_dir}_quantized")
hf_model.save_pretrained(model_dir_quantized)
interface.prompt_processor.tokenizer.save_pretrained(model_dir_quantized)

Output()

Output()

INFO:nncf:20 ignored nodes were found by patterns in the NNCFGraph
INFO:nncf:Not adding activation input quantizer for operation: 367 __module.model.layers.0.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention
INFO:nncf:Not adding activation input quantizer for operation: 368 __module.model.layers.1.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention
INFO:nncf:Not adding activation input quantizer for operation: 369 __module.model.layers.2.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention
INFO:nncf:Not adding activation input quantizer for operation: 370 __module.model.layers.3.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention
INFO:nncf:Not adding activation input quantizer for operation: 371 __module.model.layers.4.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention
INFO:nncf:Not adding activation input quantizer for operation: 372 __module.model.layers.5.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention
INFO:nncf:Not adding activation input quantizer for operation: 373 __module.model.layers.6.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention
INFO:nncf:Not adding activation input quantizer for operation: 374 __module.model.layers.7.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention
INFO:nncf:Not adding activation input quantizer for operation: 375 __module.model.layers.8.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention
INFO:nncf:Not adding activation input quantizer for operation: 376 __module.model.layers.9.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention
INFO:nncf:Not adding activation input quantizer for operation: 377 __module.model.layers.10.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention
INFO:nncf:Not adding activation input quantizer for operation: 378 __module.model.layers.11.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention
INFO:nncf:Not adding activation input quantizer for operation: 379 __module.model.layers.12.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention
INFO:nncf:Not adding activation input quantizer for operation: 380 __module.model.layers.13.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention
INFO:nncf:Not adding activation input quantizer for operation: 381 __module.model.layers.14.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention
INFO:nncf:Not adding activation input quantizer for operation: 382 __module.model.layers.15.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention
INFO:nncf:Not adding activation input quantizer for operation: 383 __module.model.layers.16.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention
INFO:nncf:Not adding activation input quantizer for operation: 384 __module.model.layers.17.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention
INFO:nncf:Not adding activation input quantizer for operation: 385 __module.model.layers.18.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention
INFO:nncf:Not adding activation input quantizer for operation: 386 __module.model.layers.19.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention

Output()

Verifying quantized model execution¶

back to top ⬆️

In order to verify the quality of the quantized model, we will generate outputs based on the same texts and speakers used for the non-quantized model. First, we will save the quantized model and recreate the pipelines for validation. Then we will generate the outputs and try to compare them with the previously obtained outputs.

In [19]:

%%skip not $to_quantize.value

interface_quantized = InterfaceOV(model_dir_quantized, device.value)

making attention of type 'vanilla' with 768 in_channels

In [20]:

%%skip not $to_quantize.value

tts_output_quantized = interface_quantized.generate(text="Hello, I'm working!", temperature=0.1, repetition_penalty=1.1, max_length=4096)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.

In [21]:

%%skip not $to_quantize.value
# Non-quantized model output:
play(tts_output.audio[0].numpy(), rate=tts_output.sr)

In [22]:

%%skip not $to_quantize.value
# Quantized model output:
play(tts_output_quantized.audio[0].numpy(), rate=tts_output_quantized.sr)

In [23]:

%%skip not $to_quantize.value

speaker_quantized = interface_quantized.load_speaker("speaker.pkl")
cloned_output_quantized = interface_quantized.generate(
    text="This is a cloned voice speaking",
    speaker=speaker,
    temperature=0.1,
    repetition_penalty=1.1,
    max_length=4096,
)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.

In [24]:

%%skip not $to_quantize.value
# Non-quantized model output:
play(cloned_output.audio[0].numpy(), rate=cloned_output.sr)

In [25]:

%%skip not $to_quantize.value
# Quantized model output:
play(cloned_output_quantized.audio[0].numpy(), rate=cloned_output_quantized.sr)

Comparing model performance¶

back to top ⬆️

In [ ]:

%%skip not $to_quantize.value

import time
import tqdm

def calculate_inference_time(interface, dataset, limit):
    inference_time = []
    for i, item in tqdm.tqdm(enumerate(dataset), total=limit):
        if i > limit: break
        start = time.perf_counter()
        _ = interface.generate(
            text=item["text_normalized"],
            max_length=256,
            additional_gen_config={
                "pad_token_id": interface.prompt_processor.tokenizer.eos_token_id
            }
        )
        end = time.perf_counter()
        delta = end - start
        inference_time.append(delta)
    return np.median(inference_time)

interface = InterfaceOV(model_dir, device.value)
limit = 25

fp_inference_time = calculate_inference_time(interface, libritts, limit)
print(f"Original model generate time: {fp_inference_time}")

interface_quantized = InterfaceOV(model_dir_quantized, device.value)
int_inference_time = calculate_inference_time(interface_quantized, libritts, limit)
print(f"Quantized model generate time: {int_inference_time}")

making attention of type 'vanilla' with 768 in_channels

26it [02:05,  4.82s/it]

Original model generate time: 5.085656422015745
making attention of type 'vanilla' with 768 in_channels

 16%|██████████████████████████████████████▋                                                                                                                                                                                                           | 4/25 [00:19<01:27,  4.16s/it]

Interactive demo¶

back to top ⬆️

In [ ]:

import ipywidgets as widgets

quantized_model_present = model_dir_quantized.exists()

use_quantized_model = widgets.Checkbox(
    value=True if quantized_model_present else False,
    description="Use quantized model",
    disabled=False,
)

use_quantized_model

In [ ]:

from gradio_helper import make_demo

if use_quantized_model:
    demo_interface = InterfaceOV(model_dir_quantized, device.value)
else:
    demo_interface = InterfaceOV(model_dir, device.value)

demo = make_demo(demo_interface)

try:
    demo.launch(debug=True)
except Exception:
    demo.launch(share=True, debug=True)