#collapse
# setting things for pretty visualization
from rich import print
from pyannote.core import notebook, Segment
SAMPLE_EXTENT = Segment(0, 30)
notebook.crop = SAMPLE_EXTENT
SAMPLE_CHUNK = Segment(15, 20)
SAMPLE_URI = "sample"
SAMPLE_WAV = f"{SAMPLE_URI}.wav"
SAMPLE_REF = f"{SAMPLE_URI}.rttm"
In this blog post, I talk about pyannote.audio pretrained speaker segmentation model, which happens to be one of the most popular audio model available on 🤗 Huggingface model hub.
from pyannote.audio import Model
model = Model.from_pretrained("pyannote/segmentation")
pyannote/segmentation
do?¶Every pyannote.audio
model has a specifications
attribute that tells us a bit more about itself:
print(model.specifications)
Specifications( problem=<Problem.MULTI_LABEL_CLASSIFICATION: 2>, resolution=<Resolution.FRAME: 1>, duration=5.0, warm_up=(0.0, 0.0), classes=['speaker#1', 'speaker#2', 'speaker#3'], permutation_invariant=True )
These specifications tell us the following about pyannote/segmentation
:
duration
problem
...classes
are chosen among speaker#1
, speaker#2
, and speaker#3
...permutation_invariant
(more about that below)We also learn that its output temporal resolution
is the frame (i.e. it outputs a sequence of frame-wise decisions rather than just one decision for the whole chunk). The actual temporal resolution can be obtained through the magic introspection
attribute (approximately 17ms for pyannote/segmentation
):
model.introspection.frames.step
0.016875
pyannote/segmentation
really do?¶To answer this question, let us consider the audio recording of a 30s conversation between two speakers (the blue one and the red one):
#collapse
from pyannote.database.util import load_rttm
reference = load_rttm(SAMPLE_REF)[SAMPLE_URI]
reference
#collapse
from IPython.display import Audio as AudioPlayer
AudioPlayer(SAMPLE_WAV)
Let's apply the model on this 5s excerpt of the conversation:
#collapse
SAMPLE_CHUNK
from pyannote.audio import Audio
audio_reader = Audio(sample_rate=model.hparams.sample_rate)
waveform, sample_rate = audio_reader.crop(SAMPLE_WAV, SAMPLE_CHUNK)
#collapse
import numpy as np
from pyannote.core import SlidingWindowFeature, SlidingWindow
_waveform, sample_rate = Audio()(SAMPLE_WAV)
_waveform = _waveform.numpy().T
_waveform[:round(SAMPLE_CHUNK.start * sample_rate)] = np.NAN
_waveform[round(SAMPLE_CHUNK.end * sample_rate):] = np.NAN
SlidingWindowFeature(_waveform, SlidingWindow(1./sample_rate, 1./sample_rate))
output = model(waveform)
#collapse
_output = output.detach()[0].numpy()
shifted_frames = SlidingWindow(start=SAMPLE_CHUNK.start,
duration=model.introspection.frames.duration,
step=model.introspection.frames.step)
_output = SlidingWindowFeature(_output, shifted_frames)
_output
The model has accurately detected that two speakers are active (the orange one and the blue one) in this 5s chunk, and that they are partially overlapping around t=18s. The third speaker probability (in green) is close to zero for the whole five seconds.
#collapse
reference.crop(SAMPLE_CHUNK)
pyannote.audio
provides a convenient Inference
class to apply the model using a 5s sliding window on the whole file:
from pyannote.audio import Inference
inference = Inference(model, duration=5.0, step=2.5)
output = inference(SAMPLE_WAV)
#collapse
output
output.data.shape
(11, 293, 3)
BATCH_AXIS = 0
TIME_AXIS = 1
SPEAKER_AXIS = 2
For each of the 11 positions of the 5s window, the model outputs a 3-dimensional vector every 16ms (293 frames for 5 seconds), corresponding to the probabilities that each of (up to) 3 speakers is active.
This model has more than one string to its bow and can prove useful for lots of speaker-related tasks such as:
Voice activity detection is the task of detecting speech regions in a given audio stream or recording.
This can be achieved by postprocessing the output of the model using the maximum over the speaker axis for each frame.
to_vad = lambda o: np.max(o, axis=SPEAKER_AXIS, keepdims=True)
to_vad(output)
The Inference
class has a built-in mechanism to apply this transformation automatically on each window and aggregate the result using overlap-add. This is achieved by passing the above to_vad
function via the pre_aggregation_hook
parameter:
vad = Inference("pyannote/segmentation", pre_aggregation_hook=to_vad)
vad_prob = vad(SAMPLE_WAV)
#collapse
vad_prob.labels = ['SPEECH']
vad_prob
Binarize
utility class can eventually convert this frame-based probabiliy to the time domain:
from pyannote.audio.utils.signal import Binarize
binarize = Binarize(onset=0.5)
speech = binarize(vad_prob)
#collapse
speech
#collapse
reference
Overlapped speech detection is the task of detecting regions where at least two speakers are speaking at the same time. This can be achieved by postprocessing the output of the model using the second maximum over the speaker axis for each frame.
to_osd = lambda o: np.partition(o, -2, axis=SPEAKER_AXIS)[:, :, -2, np.newaxis]
osd = Inference("pyannote/segmentation", pre_aggregation_hook=to_osd)
osd_prob = osd(SAMPLE_WAV)
#collapse
osd_prob.labels = ['OVERLAP']
osd_prob
binarize(osd_prob)
#collapse
reference
Instantaneous speaker counting is a generalization of voice activity and overlapped speech detection that aims at returning the number of simultaneous speakers at each frame. This can be achieved by summing the output of the model over the speaker axis for each frame:
to_cnt = lambda probability: np.sum(probability, axis=SPEAKER_AXIS, keepdims=True)
cnt = Inference("pyannote/segmentation", pre_aggregation_hook=to_cnt)
cnt(SAMPLE_WAV)
Speaker change detection is the task of detecting speaker change points in a given audio stream or recording. It can be achieved by taking the absolute value of the first derivative over the time axis, and take the maximum value over the speaker axis:
to_scd = lambda probability: np.max(
np.abs(np.diff(probability, n=1, axis=TIME_AXIS)),
axis=SPEAKER_AXIS, keepdims=True)
scd = Inference("pyannote/segmentation", pre_aggregation_hook=to_scd)
scd_prob = scd(SAMPLE_WAV)
#collapse
scd_prob.labels = ['SPEAKER_CHANGE']
scd_prob
Using a combination of Peak
utility class (to detect peaks in the above curve) and voice activity detection output, we can obtain a decent segmentation into speaker turns:
from pyannote.audio.utils.signal import Peak
peak = Peak(alpha=0.05)
peak(scd_prob).crop(speech.get_timeline())
#collapse
reference
pyannote/segmentation
CANNOT do¶Now, let's take a few steps back and have a closer look at the raw output of the model.
#collapse
output
Did you notice that the blue and orange speakers have been swapped between overlapping windows [15s, 20s]
and [17.5s, 22.5s]
?
This is a (deliberate) side effect of the permutation-invariant training process using when training the model.
The model is trained to discriminate speakers locally (i.e. within each window) but does not care about their global identity (i.e. at conversation scale). Have a look at this paper to learn more about this permutation-invariant training thing.
That means that this kind of model does not actually perform speaker diarization out of the box.
Luckily, pyannote.audio
has got you covered! pyannote/speaker-diarization
pretrained pipeline uses pyannote/segmentation
internally and combines it with speaker embeddings to deal with this permutation problem globally.
For technical questions and bug reports, please check pyannote.audio Github repository.
I also offer scientific consulting services around speaker diarization (and speech processing in general), please contact me if you think this type of technology might help your business/startup!