#collapse
# setting things for pretty visualization
from rich import print
from pyannote.core import notebook, Segment
SAMPLE_EXTENT = Segment(0, 30)
notebook.crop = SAMPLE_EXTENT
SAMPLE_CHUNK = Segment(15, 20)
SAMPLE_URI = "sample"
SAMPLE_WAV = f"{SAMPLE_URI}.wav"
SAMPLE_REF = f"{SAMPLE_URI}.rttm"
In this blog post, I talk about pyannote.audio pretrained speaker segmentation model, which happens to be one of the most popular audio model available on 🤗 Huggingface model hub.
from pyannote.audio import Model
model = Model.from_pretrained("pyannote/segmentation")
pyannote/segmentation
do?¶Every pyannote.audio
model has a specifications
attribute that tells us a bit more about itself:
print(model.specifications)
Specifications( problem=<Problem.MULTI_LABEL_CLASSIFICATION: 2>, resolution=<Resolution.FRAME: 1>, duration=5.0, warm_up=(0.0, 0.0), classes=['speaker#1', 'speaker#2', 'speaker#3'], permutation_invariant=True )
These specifications tell us the following about pyannote/segmentation
:
duration
problem
...classes
are chosen among speaker#1
, speaker#2
, and speaker#3
...permutation_invariant
(more about that below)We also learn that its output temporal resolution
is the frame (i.e. it outputs a sequence of frame-wise decisions rather than just one decision for the whole chunk). The actual temporal resolution can be obtained through the magic introspection
attribute (approximately 17ms for pyannote/segmentation
):
model.introspection.frames.step
0.016875
pyannote/segmentation
really do?¶To answer this question, let us consider the audio recording of a 30s conversation between two speakers (the blue one and the red one):
#collapse
from pyannote.database.util import load_rttm
reference = load_rttm(SAMPLE_REF)[SAMPLE_URI]
reference
#collapse
from IPython.display import Audio as AudioPlayer
AudioPlayer(SAMPLE_WAV)