OuteTTS-0.1-350M is a novel text-to-speech synthesis model that leverages pure language modeling without external adapters or complex architectures, built upon the LLaMa architecture. It demonstrates that high-quality speech synthesis is achievable through a straightforward approach using crafted prompts and audio tokens.
More details about model can be found in original repo.
In this tutorial we consider how to run OuteTTS pipeline using OpenVINO.
This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.
import requests
from pathlib import Path
utility_files = ["skip_kernel_extension.py", "cmd_helper.py", "notebook_utils.py", "pip_helper.py"]
base_utility_url = "https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/"
for utility_file in utility_files:
if not Path(utility_file).exists():
r = requests.get(base_utility_url + utility_file)
with Path(utility_file).open("w") as f:
f.write(r.text)
helper_files = ["gradio_helper.py", "ov_outetts_helper.py"]
base_helper_url = "https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/outetts-text-to-speech"
for helper_file in helper_files:
if not Path(helper_file).exists():
r = requests.get(base_helper_url + helper_file)
with Path(helper_file).open("w") as f:
f.write(r.text)
%load_ext skip_kernel_extension
import platform
from pip_helper import pip_install
pip_install(
"-q",
"torch>=2.1",
"torchaudio",
"einops",
"transformers>=4.46.1",
"loguru",
"inflect",
"pesq",
"torchcrepe",
"natsort",
"polars",
"uroman",
"mecab-python3",
"openai-whisper>=20240930",
"unidic-lite",
"--extra-index-url",
"https://download.pytorch.org/whl/cpu",
)
pip_install(
"-q",
"gradio>=4.19",
"openvino>=2024.4.0",
"tqdm",
"pyyaml",
"librosa",
"soundfile",
"nncf",
)
pip_install("-q", "git+https://github.com/huggingface/optimum-intel.git", "--extra-index-url", "https://download.pytorch.org/whl/cpu")
if platform.system() == "Darwin":
pip_install("-q", "numpy<2.0.0")
# Read more about telemetry collection at https://github.com/openvinotoolkit/openvino_notebooks?tab=readme-ov-file#-telemetry
from notebook_utils import collect_telemetry
collect_telemetry("outetts-text-to-speech.ipynb")
from cmd_helper import clone_repo
repo_path = clone_repo("https://github.com/edwko/OuteTTS.git", revision="0.3.2")
%pip install -q {repo_path} --extra-index-url https://download.pytorch.org/whl/cpu
OpenVINO supports PyTorch models via conversion to OpenVINO Intermediate Representation format. For convenience, we will use OpenVINO integration with HuggingFace Optimum. 🤗 Optimum Intel is the interface between the 🤗 Transformers and Diffusers libraries and the different tools and libraries provided by Intel to accelerate end-to-end pipelines on Intel architectures.
Among other use cases, Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime. optimum-cli
provides command line interface for model conversion and optimization.
General command format:
optimum-cli export openvino --model <model_id_or_path> --task <task> <output_dir>
where task is task to export the model for, if not specified, the task will be auto-inferred based on the model. You can find a mapping between tasks and model classes in Optimum TaskManager documentation. Additionally, you can specify weights compression using --weight-format
argument with one of following options: fp32
, fp16
, int8
and int4
. Fro int8 and int4 nncf will be used for weight compression. More details about model export provided in Optimum Intel documentation.
As OuteTTS utilizes pure language modeling approach, model conversion process remains the same like conversion LLaMa models family for text generation purposes.
from cmd_helper import optimum_cli
from pathlib import Path
model_id = "OuteAI/OuteTTS-0.1-350M"
model_dir = Path(model_id.split("/")[-1] + "-ov")
if not model_dir.exists():
optimum_cli(model_id, model_dir, additional_args={"task": "text-generation-with-past"})
OpenVINO integration with Optimum Intel provides ready-to-use API for model inference that can be used for smooth integration with transformers-based solutions. For loading model, we will use OVModelForCausalLM
class that have compatible interface with Transformers LLaMa implementation. For loading a model, from_pretrained
method should be used. It accepts path to the model directory or model_id from HuggingFace hub (if model is not converted to OpenVINO format, conversion will be triggered automatically). Additionally, we can provide an inference device, quantization config (if model has not been quantized yet) and device-specific OpenVINO Runtime configuration. More details about model inference with Optimum Intel can be found in documentation. We will use OVModelForCausalLM
as replacement of original AutoModelForCausalLM
in InterfaceHF
.
from notebook_utils import device_widget
device = device_widget(exclude=["NPU"])
device
Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')
from ov_outetts_helper import InterfaceOV, OVHFModel # noqa: F401
# Uncomment these lines to see pipeline details
# ??InterfaceOV
# ??OVHFModel
/home/ea/work/my_optimum_intel/optimum_env_new/lib/python3.11/site-packages/awq/__init__.py:21: DeprecationWarning: I have left this message as the final dev message to help you transition. Important Notice: - AutoAWQ is officially deprecated and will no longer be maintained. - The last tested configuration used Torch 2.6.0 and Transformers 4.51.3. - If future versions of Transformers break AutoAWQ compatibility, please report the issue to the Transformers project. Alternative: - AutoAWQ has been adopted by the vLLM Project: https://github.com/vllm-project/llm-compressor For further inquiries, feel free to reach out: - X: https://x.com/casper_hansen_ - LinkedIn: https://www.linkedin.com/in/casper-hansen-804005170/ warnings.warn(_FINAL_DEV_MESSAGE, category=DeprecationWarning, stacklevel=1) 2025-05-16 08:30:26.721 | WARNING | outetts.version.playback:<module>:11 - [playback] Failed to import sounddevice. 2025-05-16 08:30:26.722 | WARNING | outetts.version.playback:<module>:15 - [playback] Failed to pygame sounddevice.
interface = InterfaceOV(model_dir, device.value)
making attention of type 'vanilla' with 768 in_channels
Now let's see model in action. Providing input text to generate
method of interface, model returns tensor that represents output audio with random speaker characteristics.
tts_output = interface.generate(text="Hello, I'm working!", temperature=0.1, repetition_penalty=1.1, max_length=4096)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:2 for open-end generation. The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
import IPython.display as ipd
def play(data, rate=None):
ipd.display(ipd.Audio(data, rate=rate))
play(tts_output.audio[0].numpy(), rate=tts_output.sr)
Additionally, we can specify reference voice for generation by providing reference audio and transcript for it. interface.create_speaker
processes reference audio and text to set of features used for audio description.
from notebook_utils import download_file
ref_audio_url = "https://huggingface.co/OuteAI/OuteTTS-0.1-350M/resolve/main/samples/2.wav"
file_path = Path("2.wav")
if not file_path.exists():
file_path = download_file(ref_audio_url)
play(file_path)
speaker = interface.create_speaker(file_path, "Hello, I can speak pretty well, but sometimes I make some mistakes.")
# Save the speaker to a file
interface.save_speaker(speaker, "speaker.pkl")
# Load the speaker from a file
speaker = interface.load_speaker("speaker.pkl")
# Generate TTS with the custom voice
cloned_output = interface.generate(
text="This is a cloned voice speaking",
speaker=speaker,
temperature=0.1,
repetition_penalty=1.1,
max_length=4096,
)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
play(cloned_output.audio[0].numpy(), rate=cloned_output.sr)
NNCF enables post-training quantization by adding the quantization layers into the model graph and then using a subset of the training dataset to initialize the parameters of these additional quantization layers. The framework is designed so that modifications to your original training code are minor.
The optimization process contains the following steps:
nncf.quantize
to obtain quantized model.INT8
model.Note: Quantization is a time and memory-consuming operation. Running the quantization code below may take some time.
from notebook_utils import quantization_widget
to_quantize = quantization_widget()
to_quantize
Checkbox(value=True, description='Quantization')
The first step is to prepare calibration datasets for quantization. We will utilize Filtered LibriTTS-R dataset as it was used to train the original model.
%%skip not $to_quantize.value
from datasets import load_dataset
libritts = load_dataset("parler-tts/libritts_r_filtered", "clean", split="test.clean", streaming=True)
Resolving data files: 0%| | 0/18 [00:00<?, ?it/s]
Resolving data files: 0%| | 0/64 [00:00<?, ?it/s]
Resolving data files: 0%| | 0/18 [00:00<?, ?it/s]
Resolving data files: 0%| | 0/64 [00:00<?, ?it/s]
Below we run the quantize function which calls nncf.quantize
on the OpenVINO IR model and collected dataset.
%%skip not $to_quantize.value
import nncf
from functools import partial
import numpy as np
def transform_fn(item, interface):
text_normalized = item["text_normalized"]
prompt = interface.prompt_processor.get_completion_prompt(text_normalized, interface.language, None)
encoded = interface.prompt_processor.tokenizer(prompt, return_tensors="np")
input_ids = encoded["input_ids"]
attention_mask = encoded["attention_mask"]
inputs = {"input_ids": input_ids, "attention_mask": attention_mask}
position_ids = np.cumsum(attention_mask, axis=1) - 1
position_ids[attention_mask == 0] = 1
inputs["position_ids"] = position_ids
batch_size = input_ids.shape[0]
inputs["beam_idx"] = np.arange(batch_size, dtype=int)
return inputs
hf_model = OVHFModel(model_dir, device.value).model
dataset = nncf.Dataset(libritts, partial(transform_fn, interface=interface))
quantized_model = nncf.quantize(
hf_model.model,
dataset,
preset=nncf.QuantizationPreset.MIXED,
model_type=nncf.ModelType.TRANSFORMER,
ignored_scope=nncf.IgnoredScope(
patterns=[
# We need to use ignored scope for this pattern to generate the most efficient model
"__module.model.layers.*.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention"
]
)
)
hf_model.model = quantized_model
model_dir_quantized = Path(f"{model_dir}_quantized")
hf_model.save_pretrained(model_dir_quantized)
interface.prompt_processor.tokenizer.save_pretrained(model_dir_quantized)
Output()
Output()
INFO:nncf:20 ignored nodes were found by patterns in the NNCFGraph INFO:nncf:Not adding activation input quantizer for operation: 367 __module.model.layers.0.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention INFO:nncf:Not adding activation input quantizer for operation: 368 __module.model.layers.1.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention INFO:nncf:Not adding activation input quantizer for operation: 369 __module.model.layers.2.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention INFO:nncf:Not adding activation input quantizer for operation: 370 __module.model.layers.3.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention INFO:nncf:Not adding activation input quantizer for operation: 371 __module.model.layers.4.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention INFO:nncf:Not adding activation input quantizer for operation: 372 __module.model.layers.5.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention INFO:nncf:Not adding activation input quantizer for operation: 373 __module.model.layers.6.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention INFO:nncf:Not adding activation input quantizer for operation: 374 __module.model.layers.7.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention INFO:nncf:Not adding activation input quantizer for operation: 375 __module.model.layers.8.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention INFO:nncf:Not adding activation input quantizer for operation: 376 __module.model.layers.9.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention INFO:nncf:Not adding activation input quantizer for operation: 377 __module.model.layers.10.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention INFO:nncf:Not adding activation input quantizer for operation: 378 __module.model.layers.11.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention INFO:nncf:Not adding activation input quantizer for operation: 379 __module.model.layers.12.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention INFO:nncf:Not adding activation input quantizer for operation: 380 __module.model.layers.13.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention INFO:nncf:Not adding activation input quantizer for operation: 381 __module.model.layers.14.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention INFO:nncf:Not adding activation input quantizer for operation: 382 __module.model.layers.15.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention INFO:nncf:Not adding activation input quantizer for operation: 383 __module.model.layers.16.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention INFO:nncf:Not adding activation input quantizer for operation: 384 __module.model.layers.17.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention INFO:nncf:Not adding activation input quantizer for operation: 385 __module.model.layers.18.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention INFO:nncf:Not adding activation input quantizer for operation: 386 __module.model.layers.19.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention
Output()
In order to verify the quality of the quantized model, we will generate outputs based on the same texts and speakers used for the non-quantized model. First, we will save the quantized model and recreate the pipelines for validation. Then we will generate the outputs and try to compare them with the previously obtained outputs.
%%skip not $to_quantize.value
interface_quantized = InterfaceOV(model_dir_quantized, device.value)
making attention of type 'vanilla' with 768 in_channels
%%skip not $to_quantize.value
tts_output_quantized = interface_quantized.generate(text="Hello, I'm working!", temperature=0.1, repetition_penalty=1.1, max_length=4096)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
%%skip not $to_quantize.value
# Non-quantized model output:
play(tts_output.audio[0].numpy(), rate=tts_output.sr)
%%skip not $to_quantize.value
# Quantized model output:
play(tts_output_quantized.audio[0].numpy(), rate=tts_output_quantized.sr)
%%skip not $to_quantize.value
speaker_quantized = interface_quantized.load_speaker("speaker.pkl")
cloned_output_quantized = interface_quantized.generate(
text="This is a cloned voice speaking",
speaker=speaker,
temperature=0.1,
repetition_penalty=1.1,
max_length=4096,
)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
%%skip not $to_quantize.value
# Non-quantized model output:
play(cloned_output.audio[0].numpy(), rate=cloned_output.sr)
%%skip not $to_quantize.value
# Quantized model output:
play(cloned_output_quantized.audio[0].numpy(), rate=cloned_output_quantized.sr)
%%skip not $to_quantize.value
import time
import tqdm
def calculate_inference_time(interface, dataset, limit):
inference_time = []
for i, item in tqdm.tqdm(enumerate(dataset), total=limit):
if i > limit: break
start = time.perf_counter()
_ = interface.generate(
text=item["text_normalized"],
max_length=256,
additional_gen_config={
"pad_token_id": interface.prompt_processor.tokenizer.eos_token_id
}
)
end = time.perf_counter()
delta = end - start
inference_time.append(delta)
return np.median(inference_time)
interface = InterfaceOV(model_dir, device.value)
limit = 25
fp_inference_time = calculate_inference_time(interface, libritts, limit)
print(f"Original model generate time: {fp_inference_time}")
interface_quantized = InterfaceOV(model_dir_quantized, device.value)
int_inference_time = calculate_inference_time(interface_quantized, libritts, limit)
print(f"Quantized model generate time: {int_inference_time}")
making attention of type 'vanilla' with 768 in_channels
26it [02:05, 4.82s/it]
Original model generate time: 5.085656422015745 making attention of type 'vanilla' with 768 in_channels
16%|██████████████████████████████████████▋ | 4/25 [00:19<01:27, 4.16s/it]
import ipywidgets as widgets
quantized_model_present = model_dir_quantized.exists()
use_quantized_model = widgets.Checkbox(
value=True if quantized_model_present else False,
description="Use quantized model",
disabled=False,
)
use_quantized_model
from gradio_helper import make_demo
if use_quantized_model:
demo_interface = InterfaceOV(model_dir_quantized, device.value)
else:
demo_interface = InterfaceOV(model_dir, device.value)
demo = make_demo(demo_interface)
try:
demo.launch(debug=True)
except Exception:
demo.launch(share=True, debug=True)