Notebook

Notebook 4: TTS Workflow¶

We have the exact podcast transcripts ready now to generate our audio for the Podcast.

In this notebook, we will learn how to generate Audio using both suno/bark and parler-tts/parler-tts-mini-v1 models first.

After that, we will use the output from Notebook 3 to generate our complete podcast

Note: Please feel free to extend this notebook with newer models. The above two were chosen after some tests using a sample prompt.

⚠️ Warning: This notebook likes have transformers version to be 4.43.3 or earlier so we will downgrade our environment to make sure things run smoothly

Credit: This Colab was used for starter code

We can install these packages for speedups

In [1]:

#!pip3 install optimum
#!pip install -U flash-attn --no-build-isolation
#!pip install transformers==4.43.3

Let's import the necessary frameworks

In [2]:

from IPython.display import Audio
import IPython.display as ipd
from tqdm import tqdm

In [3]:

from transformers import BarkModel, AutoProcessor, AutoTokenizer
import torch
import json
import numpy as np
from parler_tts import ParlerTTSForConditionalGeneration

Flash attention 2 is not installed

Testing the Audio Generation¶

Let's try generating audio using both the models to understand how they work.

Note the subtle differences in prompting:

Parler: Takes in a description prompt that can be used to set the speaker profile and generation speeds
Suno: Takes in expression words like [sigh], [laughs] etc. You can find more notes on the experiments that were run for this notebook in the TTS_Notes.md file to learn more.

Please set device = "cuda" below if you're using a single GPU node.

Parler Model¶

Let's try using the Parler Model first and generate a short segment with speaker Laura's voice

In [4]:

# Set up device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model and tokenizer
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to(device)
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")

# Define text and description
text_prompt = """
Exactly! And the distillation part is where you take a LARGE-model,and compress-it down into a smaller, more efficient model that can run on devices with limited resources.
"""
description = """
Laura's voice is expressive and dramatic in delivery, speaking at a fast pace with a very close recording that almost has no background noise.
"""
# Tokenize inputs
input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(text_prompt, return_tensors="pt").input_ids.to(device)

# Generate audio
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().numpy().squeeze()

# Play audio in notebook
ipd.Audio(audio_arr, rate=model.config.sampling_rate)

config.json:   0%|          | 0.00/6.93k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.51G [00:00<?, ?B/s]

/opt/conda/lib/python3.11/site-packages/torch/nn/utils/weight_norm.py:143: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`.
  WeightNorm.apply(module, name, dim)
Config of the text_encoder: <class 'transformers.models.t5.modeling_t5.T5EncoderModel'> is overwritten by shared text_encoder config: T5Config {
  "_name_or_path": "google/flan-t5-large",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "classifier_dropout": 0.0,
  "d_ff": 2816,
  "d_kv": 64,
  "d_model": 1024,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 24,
  "num_heads": 16,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "transformers_version": "4.46.1",
  "use_cache": true,
  "vocab_size": 32128
}

Config of the audio_encoder: <class 'parler_tts.dac_wrapper.modeling_dac.DACModel'> is overwritten by shared audio_encoder config: DACConfig {
  "_name_or_path": "parler-tts/dac_44khZ_8kbps",
  "architectures": [
    "DACModel"
  ],
  "codebook_size": 1024,
  "frame_rate": 86,
  "latent_dim": 1024,
  "model_bitrate": 8,
  "model_type": "dac_on_the_hub",
  "num_codebooks": 9,
  "sampling_rate": 44100,
  "torch_dtype": "float32",
  "transformers_version": "4.46.1"
}

Config of the decoder: <class 'parler_tts.modeling_parler_tts.ParlerTTSForCausalLM'> is overwritten by shared decoder config: ParlerTTSDecoderConfig {
  "_name_or_path": "/fsx/yoach/tmp/artefacts/parler-tts-mini/decoder",
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_cross_attention": true,
  "architectures": [
    "ParlerTTSForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1025,
  "codebook_weights": null,
  "cross_attention_implementation_strategy": null,
  "dropout": 0.1,
  "eos_token_id": 1024,
  "ffn_dim": 4096,
  "hidden_size": 1024,
  "initializer_factor": 0.02,
  "is_decoder": true,
  "layerdrop": 0.0,
  "max_position_embeddings": 4096,
  "model_type": "parler_tts_decoder",
  "num_attention_heads": 16,
  "num_codebooks": 9,
  "num_cross_attention_key_value_heads": 16,
  "num_hidden_layers": 24,
  "num_key_value_heads": 16,
  "pad_token_id": 1024,
  "rope_embeddings": false,
  "rope_theta": 10000.0,
  "scale_embedding": false,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "4.46.1",
  "use_cache": true,
  "use_fused_lm_heads": false,
  "vocab_size": 1088
}

generation_config.json:   0%|          | 0.00/265 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.8k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

Out[4]:

Bark Model¶

Amazing, let's try the same with bark now:

We will set the voice_preset to our favorite speaker
This time we can include expression prompts inside our generation prompt
Note you can CAPTILISE words to make the model emphasise on these
You can add hyphens to make the model pause on certain words

In [5]:

voice_preset = "v2/en_speaker_6"
sampling_rate = 24000

In [6]:

device = "cuda"

processor = AutoProcessor.from_pretrained("suno/bark")

#model =  model.to_bettertransformer()
#model = BarkModel.from_pretrained("suno/bark", torch_dtype=torch.float16, attn_implementation="flash_attention_2").to(device)
model = BarkModel.from_pretrained("suno/bark", torch_dtype=torch.float16).to(device)#.to_bettertransformer()

tokenizer_config.json:   0%|          | 0.00/353 [00:00<?, ?B/s]

speaker_embeddings_path.json:   0%|          | 0.00/61.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.92M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/8.81k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/4.49G [00:00<?, ?B/s]

/opt/conda/lib/python3.11/site-packages/transformers/models/encodec/modeling_encodec.py:124: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.register_buffer("padding_total", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)

generation_config.json:   0%|          | 0.00/4.91k [00:00<?, ?B/s]

In [7]:

text_prompt = """
Exactly! [sigh] And the distillation part is where you take a LARGE-model,and compress-it down into a smaller, more efficient model that can run on devices with limited resources.
"""
inputs = processor(text_prompt, voice_preset=voice_preset).to(device)

speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8)
Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)

en_speaker_6_semantic_prompt.npy:   0%|          | 0.00/2.60k [00:00<?, ?B/s]

en_speaker_6_coarse_prompt.npy:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

en_speaker_6_fine_prompt.npy:   0%|          | 0.00/15.0k [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.

Out[7]:

Bringing it together: Making the Podcast¶

Okay now that we understand everything-we can now use the complete pipeline to generate the entire podcast

Let's load in our pickle file from earlier and proceed:

In [8]:

import pickle

with open('./resources/podcast_ready_data.pkl', 'rb') as file:
    PODCAST_TEXT = pickle.load(file)

Let's define load in the bark model and set it's hyper-parameters for discussions

In [9]:

bark_processor = AutoProcessor.from_pretrained("suno/bark")
bark_model = BarkModel.from_pretrained("suno/bark", torch_dtype=torch.float16).to("cuda")
bark_sampling_rate = 24000

Now for the Parler model:

In [10]:

parler_model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1").to("cuda")
parler_tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-v1")

Config of the text_encoder: <class 'transformers.models.t5.modeling_t5.T5EncoderModel'> is overwritten by shared text_encoder config: T5Config {
  "_name_or_path": "google/flan-t5-large",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "classifier_dropout": 0.0,
  "d_ff": 2816,
  "d_kv": 64,
  "d_model": 1024,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 24,
  "num_heads": 16,
  "num_layers": 24,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "transformers_version": "4.46.1",
  "use_cache": true,
  "vocab_size": 32128
}

Config of the audio_encoder: <class 'parler_tts.dac_wrapper.modeling_dac.DACModel'> is overwritten by shared audio_encoder config: DACConfig {
  "_name_or_path": "parler-tts/dac_44khZ_8kbps",
  "architectures": [
    "DACModel"
  ],
  "codebook_size": 1024,
  "frame_rate": 86,
  "latent_dim": 1024,
  "model_bitrate": 8,
  "model_type": "dac_on_the_hub",
  "num_codebooks": 9,
  "sampling_rate": 44100,
  "torch_dtype": "float32",
  "transformers_version": "4.46.1"
}

Config of the decoder: <class 'parler_tts.modeling_parler_tts.ParlerTTSForCausalLM'> is overwritten by shared decoder config: ParlerTTSDecoderConfig {
  "_name_or_path": "/fsx/yoach/tmp/artefacts/parler-tts-mini/decoder",
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_cross_attention": true,
  "architectures": [
    "ParlerTTSForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1025,
  "codebook_weights": null,
  "cross_attention_implementation_strategy": null,
  "dropout": 0.1,
  "eos_token_id": 1024,
  "ffn_dim": 4096,
  "hidden_size": 1024,
  "initializer_factor": 0.02,
  "is_decoder": true,
  "layerdrop": 0.0,
  "max_position_embeddings": 4096,
  "model_type": "parler_tts_decoder",
  "num_attention_heads": 16,
  "num_codebooks": 9,
  "num_cross_attention_key_value_heads": 16,
  "num_hidden_layers": 24,
  "num_key_value_heads": 16,
  "pad_token_id": 1024,
  "rope_embeddings": false,
  "rope_theta": 10000.0,
  "scale_embedding": false,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "4.46.1",
  "use_cache": true,
  "use_fused_lm_heads": false,
  "vocab_size": 1088
}

In [11]:

speaker1_description = """
Laura's voice is expressive and dramatic in delivery, speaking at a moderately fast pace with a very close recording that almost has no background noise.
"""
speaker2_description = """
Toms's voice is smooth and suave in delivery, speaking at a slow, methodic pace with a very close recording that almost has no background noise.
"""

We will concatenate the generated segments of audio and also their respective sampling rates since we will require this to generate the final audio

In [12]:

generated_segments = []
sampling_rates = []  # We'll need to keep track of sampling rates for each segment

In [13]:

device="cuda"

Function generate text for speaker 1

In [14]:

def generate_speaker1_audio(text):
    """Generate audio using ParlerTTS for Speaker 1"""
    input_ids = parler_tokenizer(speaker1_description, return_tensors="pt").input_ids.to(device)
    prompt_input_ids = parler_tokenizer(text, return_tensors="pt").input_ids.to(device)
    generation = parler_model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
    audio_arr = generation.cpu().numpy().squeeze()
    return audio_arr, parler_model.config.sampling_rate

Function to generate text for speaker 2

In [15]:

# def generate_speaker2_audio(text):
#     """Generate audio using ParlerTTS for Speaker 2"""
#     input_ids = parler_tokenizer(speaker2_description, return_tensors="pt").input_ids.to(device)
#     prompt_input_ids = parler_tokenizer(text, return_tensors="pt").input_ids.to(device)
#     generation = parler_model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
#     audio_arr = generation.cpu().numpy().squeeze()
#     return audio_arr, parler_model.config.sampling_rate
def generate_speaker2_audio(text):
    """Generate audio using Bark for Speaker 2"""
    inputs = bark_processor(text, voice_preset="v2/en_speaker_6").to(device)
    speech_output = bark_model.generate(**inputs, temperature=0.9, semantic_temperature=0.8)
    audio_arr = speech_output[0].cpu().numpy()
    return audio_arr, bark_sampling_rate

Helper function to convert the numpy output from the models into audio

In [16]:

import io
from scipy.io import wavfile
from pydub import AudioSegment

def numpy_to_audio_segment(audio_arr, sampling_rate):
    """Convert numpy array to AudioSegment"""
    # Convert to 16-bit PCM
    audio_int16 = (audio_arr * 32767).astype(np.int16)
    
    # Create WAV file in memory
    byte_io = io.BytesIO()
    
    wavfile.write(byte_io, sampling_rate, audio_int16)
    byte_io.seek(0)
    
    # Convert to AudioSegment
    return AudioSegment.from_wav(byte_io)

In [19]:

PODCAST_TEXT[:-1]

Out[19]:

'[\n ("Speaker 1", "Welcome to \'The Think Tank\'! I\'m Rachel, and I\'m thrilled to have my co-host, Alex, joining me on this thought-provoking journey. We\'re diving into the fascinating world of artificial intelligence, and I\'m excited to share with you the latest advancements and insights in this field. Alex, let\'s start with the basics. What\'s your take on AI?"),\n ("Speaker 2", "I think AI is like the ultimate game-changer. We\'re already using chatbots to book our flights and hotels. But with the rise of natural language processing, AI is becoming increasingly sophisticated. It\'s like having a super-smart personal assistant that can learn and adapt to our needs. Umm, like Alexa or Siri?"),\n ("Speaker 1", "That\'s right, Alex. And I love the analogy of a personal assistant. You know, I was talking to a friend who\'s a tech entrepreneur, and he said that AI is like the ultimate enabler. It\'s allowing companies to automate tasks that were previously done by humans, freeing up resources for more strategic and creative work. Hmm, it\'s amazing to think about how much more efficient our daily lives could be with AI."),\n ("Speaker 2", "Hmmm, right... Yeah, that makes sense. I mean, we\'ve seen companies like Netflix and Amazon using AI to personalize their services. It\'s like having a super-smart algorithm that knows exactly what you want before you even ask for it. [laughs] I mean, have you ever ordered something on Amazon and gotten a personalized recommendation? That\'s some crazy AI magic!"),\n ("Speaker 1", "Exactly! And that\'s what I love about AI. It\'s not just about automating tasks, it\'s about creating new experiences and opportunities. I was talking to a colleague who\'s a data scientist, and he said that AI is going to revolutionize the way we approach healthcare. We\'ll be able to analyze medical data in ways that were previously impossible, leading to breakthroughs in disease diagnosis and treatment. Umm, it\'s a pretty exciting space."),\n ("Speaker 2", "Umm, excuse me, Rachel. I think you\'re glossing over the potential risks of AI. I mean, we\'ve seen movies like \'The Terminator\' and \'Ex Machina\' that showcase the darker side of AI. What about the job displacement concerns? We can\'t just create machines that do everything better than humans without thinking about the consequences. Sigh, we have to consider the social implications."),\n ("Speaker 1", "That\'s a great point, Alex. And I think that\'s where the conversation gets really interesting. I mean, we\'re not just talking about replacing humans with machines, we\'re talking about augmenting human capabilities. And that\'s where the potential for real-world impact comes in. I was talking to a startup founder who\'s working on an AI-powered platform that\'s helping small businesses manage their finances. It\'s amazing to see how AI can be used to level the playing field for entrepreneurs. Hmm, I think that\'s a really important message."),\n ("Speaker 2", "So, Rachel, can you give us an example of how AI is being used to solve real-world problems? Like, what\'s the story behind that startup you mentioned? I mean, how does it actually work? Is it like machine learning or deep learning? [laughs] I\'m so curious!"),\n ("Speaker 1", "Well, actually, the story goes like this: Meet Maria, a single mom who owns a small bakery in a low-income neighborhood. She was struggling to manage her finances, juggling multiple clients and suppliers. But with the help of our startup, she was able to streamline her operations, optimize her pricing, and even start selling her products online. It\'s amazing to see how AI-powered tools could help her scale her business and improve her bottom line. The technology is actually a hybrid approach that combines both machine learning and rule-based systems. It\'s like a smart dashboard that provides real-time insights and recommendations to businesses like Maria\'s. Umm, it\'s pretty fascinating!"),\n ("Speaker 2", "Wow, that\'s incredible! And what about the potential for AI to help solve global problems like climate change? I mean, can AI really help us find sustainable solutions? We need to explore that topic!"),\n ("Speaker 1", "That\'s a great question, Alex. And I think AI can indeed play a crucial role in addressing global challenges. For example, AI can help optimize energy consumption, predict weather patterns, and even aid in disaster response. It\'s amazing to see how AI can be used to drive positive change. Hmm, I think we\'re just scratching the surface of what\'s possible with AI."),\n ("Speaker 2", "I completely agree! And what about the ethics of AI development? I mean, who\'s responsible for ensuring that AI is developed in a way that\'s transparent and fair? We need to have those conversations!"),\n ("Speaker 1", "That\'s a fantastic point, Alex. And I think it\'s essential to have ongoing discussions about the ethics and governance of AI development. We need to work together to ensure that AI is developed in a way that benefits society as a whole. Umm, it\'s a complex issue, but I think we\'re on the right track."),\n]'

Most of the times we argue in life that Data Structures isn't very useful. However, this time the knowledge comes in handy.

We will take the string from the pickle file and load it in as a Tuple with the help of ast.literal_eval()

In [20]:

import ast
ast.literal_eval(PODCAST_TEXT[:-1])

Out[20]:

[('Speaker 1',
"Welcome to 'The Think Tank'! I'm Rachel, and I'm thrilled to have my co-host, Alex, joining me on this thought-provoking journey. We're diving into the fascinating world of artificial intelligence, and I'm excited to share with you the latest advancements and insights in this field. Alex, let's start with the basics. What's your take on AI?"),
('Speaker 2',
"I think AI is like the ultimate game-changer. We're already using chatbots to book our flights and hotels. But with the rise of natural language processing, AI is becoming increasingly sophisticated. It's like having a super-smart personal assistant that can learn and adapt to our needs. Umm, like Alexa or Siri?"),
('Speaker 1',
"That's right, Alex. And I love the analogy of a personal assistant. You know, I was talking to a friend who's a tech entrepreneur, and he said that AI is like the ultimate enabler. It's allowing companies to automate tasks that were previously done by humans, freeing up resources for more strategic and creative work. Hmm, it's amazing to think about how much more efficient our daily lives could be with AI."),
('Speaker 2',
"Hmmm, right... Yeah, that makes sense. I mean, we've seen companies like Netflix and Amazon using AI to personalize their services. It's like having a super-smart algorithm that knows exactly what you want before you even ask for it. [laughs] I mean, have you ever ordered something on Amazon and gotten a personalized recommendation? That's some crazy AI magic!"),
('Speaker 1',
"Exactly! And that's what I love about AI. It's not just about automating tasks, it's about creating new experiences and opportunities. I was talking to a colleague who's a data scientist, and he said that AI is going to revolutionize the way we approach healthcare. We'll be able to analyze medical data in ways that were previously impossible, leading to breakthroughs in disease diagnosis and treatment. Umm, it's a pretty exciting space."),
('Speaker 2',
"Umm, excuse me, Rachel. I think you're glossing over the potential risks of AI. I mean, we've seen movies like 'The Terminator' and 'Ex Machina' that showcase the darker side of AI. What about the job displacement concerns? We can't just create machines that do everything better than humans without thinking about the consequences. Sigh, we have to consider the social implications."),
('Speaker 1',
"That's a great point, Alex. And I think that's where the conversation gets really interesting. I mean, we're not just talking about replacing humans with machines, we're talking about augmenting human capabilities. And that's where the potential for real-world impact comes in. I was talking to a startup founder who's working on an AI-powered platform that's helping small businesses manage their finances. It's amazing to see how AI can be used to level the playing field for entrepreneurs. Hmm, I think that's a really important message."),
('Speaker 2',
"So, Rachel, can you give us an example of how AI is being used to solve real-world problems? Like, what's the story behind that startup you mentioned? I mean, how does it actually work? Is it like machine learning or deep learning? [laughs] I'm so curious!"),
('Speaker 1',
"Well, actually, the story goes like this: Meet Maria, a single mom who owns a small bakery in a low-income neighborhood. She was struggling to manage her finances, juggling multiple clients and suppliers. But with the help of our startup, she was able to streamline her operations, optimize her pricing, and even start selling her products online. It's amazing to see how AI-powered tools could help her scale her business and improve her bottom line. The technology is actually a hybrid approach that combines both machine learning and rule-based systems. It's like a smart dashboard that provides real-time insights and recommendations to businesses like Maria's. Umm, it's pretty fascinating!"),
('Speaker 2',
"Wow, that's incredible! And what about the potential for AI to help solve global problems like climate change? I mean, can AI really help us find sustainable solutions? We need to explore that topic!"),
('Speaker 1',
"That's a great question, Alex. And I think AI can indeed play a crucial role in addressing global challenges. For example, AI can help optimize energy consumption, predict weather patterns, and even aid in disaster response. It's amazing to see how AI can be used to drive positive change. Hmm, I think we're just scratching the surface of what's possible with AI."),
('Speaker 2',
"I completely agree! And what about the ethics of AI development? I mean, who's responsible for ensuring that AI is developed in a way that's transparent and fair? We need to have those conversations!"),
('Speaker 1',
"That's a fantastic point, Alex. And I think it's essential to have ongoing discussions about the ethics and governance of AI development. We need to work together to ensure that AI is developed in a way that benefits society as a whole. Umm, it's a complex issue, but I think we're on the right track.")]

Generating the Final Podcast¶

Finally, we can loop over the Tuple and use our helper functions to generate the audio

In [21]:

final_audio = None

for speaker, text in tqdm(ast.literal_eval(PODCAST_TEXT[:-1]), desc="Generating podcast segments", unit="segment"):
    if speaker == "Speaker 1":
        audio_arr, rate = generate_speaker1_audio(text)
    else:  # Speaker 2
        audio_arr, rate = generate_speaker2_audio(text)
    
    # Convert to AudioSegment (pydub will handle sample rate conversion automatically)
    audio_segment = numpy_to_audio_segment(audio_arr, rate)
    
    # Add to final audio
    if final_audio is None:
        final_audio = audio_segment
    else:
        final_audio += audio_segment

Generating podcast segments:   8%|▊         | 1/13 [00:28<05:46, 28.83s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating podcast segments:  23%|██▎       | 3/13 [01:41<05:37, 33.75s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating podcast segments:  38%|███▊      | 5/13 [03:04<05:03, 37.98s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating podcast segments:  54%|█████▍    | 7/13 [04:30<04:05, 40.86s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating podcast segments:  69%|██████▉   | 9/13 [05:59<02:51, 42.78s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating podcast segments:  85%|████████▍ | 11/13 [07:10<01:16, 38.48s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating podcast segments: 100%|██████████| 13/13 [08:21<00:00, 38.57s/segment]

Output the Podcast¶

We can now save this as a mp3 file

In [22]:

final_audio.export("./resources/_podcast.mp3", 
                  format="mp3", 
                  bitrate="192k",
                  parameters=["-q:a", "0"])

Out[22]:

<_io.BufferedRandom name='./resources/_podcast.mp3'>

Suggested Next Steps:¶

Experiment with the prompts: Please feel free to experiment with the SYSTEM_PROMPT in the notebooks
Extend workflow beyond two speakers
Test other TTS Models
Experiment with Speech Enhancer models as a step 5.

In [ ]:

#fin