实时语音识别

相关空间: https://huggingface.co/spaces/abidlabs/streaming-asr-paused,https : //huggingface.co/spaces/abidlabs/full-context-asr标签:ASR, SPEECH, STREAMING

介绍

自动语音识别 (ASR),将语音转换为文本,是机器学习的一个非常重要且蓬勃发展的领域。 ASR 算法几乎可以在每部智能手机上运行,​​并且越来越多地嵌入到专业工作流程中,例如护士和医生的数字助理。 由于 ASR 算法旨在供客户和最终用户直接使用,因此验证它们在面对各种语音模式(不同的口音、音高和背景音频条件)时是否按预期运行非常重要。

Automatic speech recognition (ASR), the conversion of spoken speech to text, is a very important and thriving area of machine learning. ASR algorithms run on practically every smartphone, and are becoming increasingly embedded in professional workflows, such as digital assistants for nurses and doctors. Because ASR algorithms are designed to be used directly by customers and end users, it is important to validate that they are behaving as expected when confronted with a wide variety of speech patterns (different accents, pitches, and background audio conditions).

使用 gradio ,你可以轻松构建 ASR 模型的演示并与测试团队共享,或者通过设备上的麦克风说话来自行测试。

Using gradio, you can easily build a demo of your ASR model and share that with a testing team, or test it yourself by speaking through the microphone on your device.

本教程将展示如何采用预训练的语音到文本模型并使用 Gradio 界面部署它。 我们将从全上下文模型开始,在该模型中,用户在预测运行之前说出整个音频。 然后我们将调整演示使其流式传输,这意味着音频模型将在你说话时转换语音。 我们创建的流媒体演示看起来像这样(在下面或在新选项卡中尝试!):

This tutorial will show how to take a pretrained speech-to-text model and deploy it with a Gradio interface. We will start with a full-context model, in which the user speaks the entire audio before the prediction runs. Then we will adapt the demo to make it streaming, meaning that the audio model will convert speech as you speak. The streaming demo that we create will look something like this (try it below or in a new tab!):

实时 ASR 本质上是有状态的,这意味着模型的预测会根据用户之前说的词而变化。 因此,在本教程中,我们还将介绍如何在 Gradio 演示中使用状态

Real-time ASR is inherently stateful, meaning that the model's predictions change depending on what words the user previously spoke. So, in this tutorial, we will also cover how to use state with Gradio demos.

先决条件

Prerequisites

确保你已经安装了gradio Python 包。 你还需要一个预训练的语音识别模型。 在本教程中,我们将从 2 个 ASR 库构建演示:

Make sure you have the gradio Python package already installed. You will also need a pretrained speech recognition model. In this tutorial, we will build demos from 2 ASR libraries:

  • 变形金刚(为此, pip install transformerspip install torch

    Transformers (for this, pip install transformers and pip install torch)

  • DeepSpeech (pip install deepspeech==0.8.2)

确保你至少安装了其中一个,以便你可以按照教程进行操作。 如果你还没有安装 ffmpeg ,你还需要在系统上安装它来处理来自麦克风的文件。

Make sure you have at least one of these installed so that you can follow along the tutorial. You will also need ffmpeg installed on your system, if you do not already have it, to process files from the microphone.

以下是构建实时语音识别 (ASR) 应用程序的方法:

Here's how to build a real time speech recognition (ASR) app:

  1. 设置 Transformers ASR 模型

    Set up the Transformers ASR Model

  2. 使用 Transformers 创建全上下文 ASR 演示

    Create a Full-Context ASR Demo with Transformers

  3. 使用 Transformers 创建流式 ASR 演示

    Create a Streaming ASR Demo with Transformers

  4. 使用 DeepSpeech 创建流式 ASR 演示

    Create a Streaming ASR Demo with DeepSpeech

1. 设置 Transformers ASR 模型

首先,你需要有一个你自己训练过的 ASR 模型,或者你需要下载一个预训练模型。 在本教程中,我们将首先使用来自 Hugging Face 模型 Wav2Vec2 的预训练 ASR 模型。

First, you will need to have an ASR model that you have either trained yourself or you will need to download a pretrained model. In this tutorial, we will start by using a pretrained ASR model from the Hugging Face model, Wav2Vec2.

这是从 Hugging Face transformers 加载 Wav2Vec2 代码。

Here is the code to load Wav2Vec2 from Hugging Face transformers.

from transformers import pipeline

p = pipeline("automatic-speech-recognition")

就是这样! 默认情况下,自动语音识别模型管道加载 Facebook 的 facebook/wav2vec2-base-960h 模型。

That's it! By default, the automatic speech recognition model pipeline loads Facebook's facebook/wav2vec2-base-960h model.

2. 使用 Transformers 创建全上下文 ASR 演示

我们将首先创建一个全上下文ASR 演示,其中用户在使用 ASR 模型运行推理之前说出完整的音频。 使用 Gradio 这很容易——我们只需围绕上面的 pipeline 对象创建一个函数。

We will start by creating a full-context ASR demo, in which the user speaks the full audio before using the ASR model to run inference. This is very easy with Gradio -- we simply create a function around the pipeline object above.

我们将使用 gradio 的内置 Audio 组件,配置为从用户的麦克风获取输入并返回录制音频的文件路径。 输出组件将是一个普通的 Textbox

We will use gradio's built in Audio component, configured to take input from the user's microphone and return a filepath for the recorded audio. The output component will be a plain Textbox.

import gradio as gr

def transcribe(audio):
    text = p(audio)["text"]
    return text

gr.Interface(
    fn=transcribe, 
    inputs=gr.Audio(source="microphone", type="filepath"), 
    outputs="text").launch()

那么这里发生了什么? transcribe 函数采用单个参数 audio ,它是用户录制的音频文件的文件路径。 pipeline 对象需要一个文件路径并将其转换为文本,该文本返回到前端并显示在文本框中。

So what's happening here? The transcribe function takes a single parameter, audio, which is a filepath to the audio file that the user has recorded. The pipeline object expects a filepath and converts it to text, which is returned to the frontend and displayed in a textbox.

让我们看看它的实际效果! (录制一个简短的音频片段然后点击提交,或在新标签页中打开):

Let's see it in action! (Record a short audio clip and then click submit, or open in a new tab):

3. 使用 Transformers 创建流式 ASR 演示

太好了! 我们构建了一个适用于短音频剪辑的 ASR 模型。 但是,如果你正在录制较长的音频剪辑,你可能需要一个流媒体界面,该界面可以在用户说话时转录音频,而不是在最后一次全部转录。

Ok great! We've built an ASR model that works well for short audio clips. However, if you are recording longer audio clips, you probably want a streaming interface, one that transcribes audio as the user speaks instead of just all-at-once at the end.

好消息是,使用相同的 Wav2Vec2 模型调整我们刚刚制作的演示使其流式传输并不难。

The good news is that it's not too difficult to adapt the demo we just made to make it streaming, using the same Wav2Vec2 model.

最大的变化是我们现在必须引入一个 state 参数,它保存了到目前为止已经转录的音频。 这允许我们只使用最新的音频块并将其简单地附加到我们之前转录的音频中。

The biggest change is that we must now introduce a state parameter, which holds the audio that has been transcribed so far. This allows us to only the latest chunk of audio and simply append it to the audio we previously transcribed.

给 Gradio demo 添加状态时,一共需要做 3 件事:

When adding state to a Gradio demo, you need to do a total of 3 things:

  • 向函数添加 state 参数

    Add a state parameter to the function

  • 在函数结束时返回更新后的 state

    Return the updated state at the end of the function

  • "state" 组件添加到 Interface 中的 inputsoutputs

    Add the "state" components to the inputs and outputs in Interface

代码如下所示:

Here's what the code looks like:

def transcribe(audio, state=""):
    text = p(audio)["text"]
    state += text + " "
    return state, state

# Set the starting state to an empty string

gr.Interface(
    fn=transcribe, 
    inputs=[
        gr.Audio(source="microphone", type="filepath", streaming=True), 
        "state" 
    ],
    outputs=[
        "textbox",
        "state"
    ],
    live=True).launch()

请注意,我们还进行了另一项更改,即我们设置了 live=True 。 这使 Gradio 界面保持持续运行,因此它可以自动转录音频,而无需用户反复点击提交按钮。

Notice that we've also made one other change, which is that we've set live=True. This keeps the Gradio interface running constantly, so it automatically transcribes audio without the user having to repeatedly hit the submit button.

让我们看看它是如何工作的(在下面或在新标签中尝试)!

Let's see how it does (try below or in a new tab)!

你可能会注意到的一件事是转录质量下降,因为音频块太小了,它们缺乏正确转录的上下文。 对此的“hacky”修复是简单地增加 transcribe() 函数的运行时间,以便处理更长的音频块。 我们可以通过在函数内添加一个 time.sleep() 来做到这一点,如下所示(接下来我们将看到一个适当的修复)

One thing that you may notice is that the transcription quality has dropped since the chunks of audio are so small, they lack the context to properly be transcribed. A "hacky" fix to this is to simply increase the runtime of the transcribe() function so that longer audio chunks are processed. We can do this by adding a time.sleep() inside the function, as shown below (we'll see a proper fix next)

from transformers import pipeline
import gradio as gr
import time

p = pipeline("automatic-speech-recognition")

def transcribe(audio, state=""):
    time.sleep(2)
    text = p(audio)["text"]
    state += text + " "
    return state, state

gr.Interface(
    fn=transcribe, 
    inputs=[
        gr.Audio(source="microphone", type="filepath", streaming=True), 
        "state"
    ],
    outputs=[
        "textbox",
        "state"
    ],
    live=True).launch()

尝试下面的演示以查看不同之处(或在新选项卡中打开)!

Try the demo below to see the difference (or open in a new tab)!

4. 使用 DeepSpeech 创建流式 ASR 演示

你不局限于来自 transformers 库的 ASR 模型——你可以使用自己的模型或来自其他库的模型。 DeepSpeech 库包含专门设计用于处理流式音频数据的模型。 这些模型在处理流数据时表现非常好,因为它们能够在进行预测时考虑之前的音频数据块。

You're not restricted to ASR models from the transformers library -- you can use your own models or models from other libraries. The DeepSpeech library contains models that are specifically designed to handle streaming audio data. These models perform really well with streaming data as they are able to account for previous chunks of audio data when making predictions.

浏览 DeepSpeech 库超出了本指南的范围(在此处查看他们的优秀文档),但你可以非常相似地将 Gradio 用于 DeepSpeech ASR 模型和 Transformers ASR 模型。

Going through the DeepSpeech library is beyond the scope of this Guide (check out their excellent documentation here), but you can use Gradio very similarly with a DeepSpeech ASR model as with a Transformers ASR model.

这是一个完整的示例(在 Linux 上):

Here's a complete example (on Linux):

首先安装 DeepSpeech 库并从终端下载预训练模型:

First install the DeepSpeech library and download the pretrained models from the terminal:

wget https://github.com/mozilla/DeepSpeech/releases/download/v0.8.2/deepspeech-0.8.2-models.pbmm
wget https://github.com/mozilla/DeepSpeech/releases/download/v0.8.2/deepspeech-0.8.2-models.scorer
apt install libasound2-dev portaudio19-dev libportaudio2 libportaudiocpp0 ffmpeg
pip install deepspeech==0.8.2

然后,创建一个与之前类似的 transcribe() 函数:

Then, create a similar transcribe() function as before:

from deepspeech import Model
import numpy as np

model_file_path = "deepspeech-0.8.2-models.pbmm"
lm_file_path = "deepspeech-0.8.2-models.scorer"
beam_width = 100
lm_alpha = 0.93
lm_beta = 1.18

model = Model(model_file_path)
model.enableExternalScorer(lm_file_path)
model.setScorerAlphaBeta(lm_alpha, lm_beta)
model.setBeamWidth(beam_width)

def reformat_freq(sr, y):
    if sr not in (
        48000,
        16000,
    ):  # Deepspeech only supports 16k, (we convert 48k -> 16k)
        raise ValueError("Unsupported rate", sr)
    if sr == 48000:
        y = (
            ((y / max(np.max(y), 1)) * 32767)
            .reshape((-1, 3))
            .mean(axis=1)
            .astype("int16")
        )
        sr = 16000
    return sr, y

def transcribe(speech, stream):
    _, y = reformat_freq(*speech)
    if stream is None:
        stream = model.createStream()
    stream.feedAudioContent(y)
    text = stream.intermediateDecode()
    return text, stream

然后,像以前一样创建一个 Gradio 界面(唯一的区别是返回类型应该是 numpy 而不是 filepath ,以便与 DeepSpeech 模型兼容)

Then, create a Gradio Interface as before (the only difference being that the return type should be numpy instead of a filepath to be compatible with the DeepSpeech models)

import gradio as gr

gr.Interface(
    fn=transcribe, 
    inputs=[
        gr.Audio(source="microphone", type="numpy"), 
        "state" 
    ], 
    outputs= [
        "text", 
        "state"
    ], 
    live=True).launch()

运行所有这些应该允许你使用漂亮的 GUI 部署实时 ASR 模型。 尝试一下,看看它对你的效果如何。

Running all of this should allow you to deploy your realtime ASR model with a nice GUI. Try it out and see how well it works for you.


你完成了! 这就是为 ASR 模型构建基于 Web 的 GUI 所需的全部代码。

And you're done! That's all the code you need to build a web-based GUI for your ASR model.

有趣的提示:你只需在 launch() 中设置 share=True 即可立即与他人分享你的 ASR 模型。

Fun tip: you can share your ASR model instantly with others simply by setting share=True in launch().