Notebook

【註】這是顏國雄老師寫的程式，可以直接「聽」在 YouTube 上的影片，並把做好字幕檔。這是專為在 Colab 上執行設計的, 使用 pytube 下載在 YouTube 上的影片 (到 Colab 的臨時雲端碟), 再用 OpenAI 的 Whisper 做語音辨識。

使用方法只需找到 YouTube 影片分享網址, 執行後 Colab 就開始工作, 完成後預設會自動下載字幕檔到你的電腦上!

唯一要提醒的是, Whisper 自己會判斷是繁體還是簡體, 所以你有可能得到的是簡體版。不過相信這是小事, 很容易自己轉成繁體的。

OpenAI 的 Whisper 語音辨識測試¶

OpenaAI 的 Whisper 是一個自動語音辨識系統，而且有開源，可以在底下的網址中找到:

https://github.com/openai/whisper

結合 Whisper 和 pytube 或其它類似 yt-dlp 的工具，就可以將 Youtube 上的影片或播放清單擷取聲音、儲存語音檔後，進行語音辨識，並生成字幕檔。

目前在後面程式設定區塊中，語音來源路徑的「url」欄位中，可以填入 Youtube 的影片或影片清單網址；如果想馬上錄一段做測試，也可以利用線上錄音的網站 Vocaroo (https://voca.roo/) 來錄音，再將它給的網址貼在「url」欄位中；如果是在電腦本機中的影片或聲音檔，則可以上載在左側資料夾中後，在「url」欄位中填入完整的檔名名稱。

接著將其它選項都設定好後，就可以在[程式區塊]中按「執行」的按鈕，開始進行語音辨識了。

Whisper 也可以辨識臺語，不過記得設定時，語言代碼「lang」的欄位要選「Chinese」比較保險，如果用「自動判斷」，有時會被當成非中文，而無法辨識出文字。

程式第一次執行時，因為要安裝及下載自動語音辨識所需要的資料，可能要稍等一下下。

In [ ]:

#@title OpenAI Whisper 語音辨識並輸出字幕檔案程式 { vertical-output: true }

#@markdown <b>設定底下的自訂參數後，就可以按左側的執行鈕</b>

#@markdown 聲音檔的來源，可以是網址(Youtube影片、撥放清單、 [Vocaroo](https://voca.roo/) 網址)，或是上載後的影片、聲音檔檔案名稱。
url = "https://voca.ro/15HVH0YvIaa6" #@param {type:"string"}
#@markdown 語音的語言代碼
lang = 'Chinese' #@param ["Chinese", "English", "Japanese", "Korean", "自動判斷"]
#@markdown 輸出為哪一種格式（.srt:字幕檔、.txt:純文字檔）
outputFormat = 'srt' #@param ['srt', 'txt']
#@markdown 使用哪一種辨識模型（small:快/普通，medium:慢/精準）
modelType = 'small' #@param ["small" , "medium"]
#@markdown 是否全部辨識完成，立即下載字幕檔
start_downloading_immediately = True #@param { type: 'boolean' }
#@markdown 是否即時顯示語音辨識結果
verbose = False #@param { type: 'boolean' }
#@markdown ---



# Install + Import + Config
try: import whisper
except:
  print('install whisper ...')
  ! pip -q install git+https://github.com/openai/whisper.git

#try: import yt_dlp
#except:
#  print('install ;yt-dlp ...')
#  ! pip -q install yt-dlp

try: from pytube import YouTube
except:
  print('install pytube ...')
  ! pip -q install pytube

import torch
import whisper
from whisper.utils import get_writer
from pytube import YouTube
from pytube import Playlist
import re
import os.path
import urllib.request
from slugify import slugify

import google

#import yt_dlp

#url = "https://voca.ro/15HVH0YvIaa6"
#url = 'https://www.youtube.com/watch?v=I4DZn4z8aRQ&list=PLelNvYGEtsV8TpwxL4t7GTTG-7qALZqol'
#url = 'https://www.youtube.com/watch?v=I4DZn4z8aRQ'
#url = 'vocaroo-台語.mp3'

#lang = 'Chinese' # '自動判斷'
#start_downloading_immediately = True

#modelType = 'small' # 'small' 'medium'
#outputFormat = 'txt' # 'srt' 'txt'

audioFile = 'source.mp3'
output_path = '.'
title = ''

# GPU or CPU
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

def getAudioFromYoutube(url, output_path, filename) :
  video = YouTube(url)
  video.streams.get_audio_only().download(output_path, filename)
  return video.title

#Get the MP3 URL from Vocaroo record share URL
def getVocarooMP3URL(url) :
  vocarooMP3Base = 'https://media.vocaroo.com/mp3/'
  regex = re.compile(r'(https\:\/\/voca\.ro|https\:\/\/vocaroo\.com)\/(\w{12})')
  match = regex.search(url)
  if match :
    url = vocarooMP3Base+match.group(2)
  return url

def transcribe(filename, output_path, outputFilename, outputFormat) :
  # load whisper model and get transcribe result
  model = whisper.load_model(modelType, device=DEVICE)
  if lang=="自動判斷" :
    print('auto detect language')
    result = model.transcribe(filename, fp16=False, verbose=verbose)    
  else :
    result = model.transcribe(filename, fp16=False, verbose=verbose, language=lang)
  # save result to outputFormat file
  saveToFile(result, output_path, outputFilename, outputFormat)

# save whisper result to a file
def saveToFile(result, output_path, filename, fileType='srt') :
  # save SRT
  file_writer = get_writer(fileType, output_path)
  file_writer(result, filename)


# 建立 Playlist 物件
pList = Playlist(url)

isPlayList = True
try : title = pList.title
except :
  isPlayList = False

if isPlayList :
  for video in pList.videos :
    title = video.title
    #filename = title+'.mp3'
    filename = audioFile
  
    print(title)

    #convert title to valid filename
    outputFilename = slugify(title, allow_unicode=True, lowercase=False)
  
    #continue

    #download audio from video stream
    video.streams.get_audio_only().download(output_path, filename)

    # load whisper model and get transcribe result
    transcribe(filename, output_path, outputFilename, outputFormat)
else :
  filename = audioFile
  if re.search('https\:\/\/', url) :
    if re.search('youtube\.|youtu\.', url) :
      # Youtube video
      title = getAudioFromYoutube(url, output_path, audioFile)  
    elif re.search('voca\.ro|vocaroo\.com', url) :
      # Vocaroo or other web audio
      url = getVocarooMP3URL(url)
      urllib.request.urlretrieve(url, audioFile)
      title = 'vocaroo'
    else :
      # other website
      ! rm {audioFile}
      #! yt-dlp -q --force-overwrites -x --audio-format mp3 -o {audioFile} {url}
  else :
    # local audio file
    title = url
    filename = url

  print(title)
  #convert title to valid filename
  outputFilename = slugify(title, allow_unicode=True, lowercase=False)
  
  # load whisper model and get transcribe result
  if os.path.exists(filename) :
    transcribe(filename, output_path, outputFilename, outputFormat)
  else :
    print('找不到語音檔: '+filename)
    start_downloading_immediately = False

if start_downloading_immediately:
  if isPlayList :
    print('\n壓縮並下載辨識結果')
    title = pList.title
    outputFilename = slugify(title, allow_unicode=True, lowercase=False)+'-'+outputFormat+'.zip'
    #先將所有字幕檔的換行符號由 \n 換成 \r\n，全部壓縮
    ! zip -l {outputFilename} *.{outputFormat}
  else :
    print('\n下載辨識結果')
    outputFilename = outputFilename+'.'+outputFormat
  #下載辨識結果
  google.colab.files.download(outputFilename)

vocaroo

100%|███████████████████████████████████████| 461M/461M [00:08<00:00, 55.2MiB/s]
100%|██████████| 4051/4051 [01:08<00:00, 59.35frames/s]

下載辨識結果

辨識結果比較¶

以同一個音檔

https://voca.ro/15HVH0YvIaa6

用不同模型來進行辨識的結果

Model small¶

[00:00.000 --> 00:02.500] 大家午安
[00:02.500 --> 00:07.500] 科技的進步實在讓人很驚訝
[00:07.500 --> 00:13.000] 我們現在不怕利用廣的賣力
[00:13.000 --> 00:16.000] 我們說台語
[00:16.000 --> 00:19.000] 電腦跟我們說的台語就變異
[00:19.000 --> 00:23.000] 連鍵盤都不變硬
[00:23.000 --> 00:26.000] 現在我們來砌這個
[00:26.000 --> 00:28.500] OpenAI的Whisper
[00:28.500 --> 00:33.500] 也是跟我們說的微變異
[00:33.500 --> 00:36.500] 好,我們現在就來試試看
[00:36.500 --> 00:58.500] 看它到底有多厲害

Model medium¶

[00:00.000 --> 00:03.000] 大家午安
[00:03.000 --> 00:08.000] 科技的進步實在讓人很驚險
[00:08.000 --> 00:11.000] 我們現在不用打字
[00:11.000 --> 00:14.000] 用講的也可以
[00:14.000 --> 00:16.000] 我們講台語
[00:16.000 --> 00:20.000] 電腦跟我們講的台語就變字
[00:20.000 --> 00:24.000] 連鍵盤都不用用
[00:24.000 --> 00:26.000] 現在我們來試這個
[00:26.000 --> 00:29.000] OpenAI的Whisper
[00:29.000 --> 00:33.000] 它可以跟我們講的話變字
[00:33.000 --> 00:37.000] 好,我們現在就來試看看
[00:37.000 --> 01:03.000] 看它到底多厲害

基本上，精準度是用空間與時間換來的。