命名实体识别

介绍

命名实体识别 (NER)，也称为标记分类或文本标记，是将句子中的每个单词（或“标记”）分类为不同类别的任务，例如人名或地点名称，或不同的类别词性。

Named-entity recognition (NER), also known as token classification or text tagging, is the task of taking a sentence and classifying every word (or "token") into different categories, such as names of people or names of locations, or different parts of speech.

例如，给定句子：

For example, given the sentence:

芝加哥有巴基斯坦餐馆吗？
Does Chicago have any Pakistani restaurants?

命名实体识别算法可以识别：

A named-entity recognition algorithm may identify:

“芝加哥”作为一个位置
"Chicago" as a location
“巴基斯坦”作为一个种族
"Pakistani" as an ethnicity

等等。

and so on.

使用 gradio （特别是 HighlightedText 组件），你可以轻松构建 NER 模型的 Web 演示并与团队的其他成员共享。

Using gradio (specifically the HighlightedText component), you can easily build a web demo of your NER model and share that with the rest of your team.

以下是你可以构建的演示示例：

Here is an example of a demo that you'll be able to build:

本教程将展示如何采用预训练的 NER 模型并使用 Gradio 界面部署它。我们将展示两种不同的方式来使用 HighlightedText 组件——根据你的 NER 模型，这两种方式中的任何一种都可能更容易学习！

This tutorial will show how to take a pretrained NER model and deploy it with a Gradio interface. We will show two different ways to use the HighlightedText component -- depending on your NER model, either of these two ways may be easier to learn!

先决条件

Prerequisites

确保你已经安装了gradio Python 包。你还需要一个预训练的命名实体识别模型。你可以使用自己的，而在本教程中，我们将使用 transformers 库中的一个。

Make sure you have the gradio Python package already installed. You will also need a pretrained named-entity recognition model. You can use your own, while in this tutorial, we will use one from the transformers library.

方法一：实体字典列表

Approach 1: List of Entity Dictionaries

许多命名实体识别模型输出字典列表。每个字典都包含一个实体、一个“开始”索引和一个“结束”索引。例如， transformers 库中的 NER 模型是如何运行的：

Many named-entity recognition models output a list of dictionaries. Each dictionary consists of an entity, a "start" index, and an "end" index. This is, for example, how NER models in the transformers library operate:

from transformers import pipeline 
ner_pipeline = pipeline("ner")
ner_pipeline("Does Chicago have any Pakistani restaurants")

输出：

Output:

[{'entity': 'I-LOC',
  'score': 0.9988978,
  'index': 2,
  'word': 'Chicago',
  'start': 5,
  'end': 12},
 {'entity': 'I-MISC',
  'score': 0.9958592,
  'index': 5,
  'word': 'Pakistani',
  'start': 22,
  'end': 31}]

如果你有这样的模型，很容易将其连接到 Gradio 的 HighlightedText 组件。你需要做的就是将这个实体列表连同原始文本一起作为字典传递给模型，键分别为 "entities" 和 "text" 。

If you have such a model, it is very easy to hook it up to Gradio's HighlightedText component. All you need to do is pass in this list of entities, along with the original text to the model, together as dictionary, with the keys being "entities" and "text" respectively.

这是一个完整的例子：

Here is a complete example:

from transformers import pipeline

import gradio as gr

ner_pipeline = pipeline("ner")

examples = [
    "Does Chicago have any stores and does Joe live here?",
]

def ner(text):
    output = ner_pipeline(text)
    return {"text": text, "entities": output}    

demo = gr.Interface(ner,
             gr.Textbox(placeholder="Enter sentence here..."), 
             gr.HighlightedText(),
             examples=examples)

demo.launch()

方法 2：元组列表

Approach 2: List of Tuples

将数据传递到 HighlightedText 组件的另一种方法是元组列表。每个元组的第一个元素应该是被分类为特定实体的单词。第二个元素应该是实体标签（或者 None 如果它们应该是未标记的）。 HighlightedText 组件自动将单词和标签串在一起以显示实体。

An alternative way to pass data into the HighlightedText component is a list of tuples. The first element of each tuple should be the word or words that are being classified into a particular entity. The second element should be the entity label (or None if they should be unlabeled). The HighlightedText component automatically strings together the words and labels to display the entities.

在某些情况下，这可能比第一种方法更容易。这是一个使用 Spacy 的词性标注器展示这种方法的演示：

In some cases, this can be easier than the first approach. Here is a demo showing this approach using Spacy's parts-of-speech tagger:

import gradio as gr
import os
os.system('python -m spacy download en_core_web_sm')
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")

def text_analysis(text):
    doc = nlp(text)
    html = displacy.render(doc, style="dep", page=True)
    html = (
        ""
        + html
        + ""
    )
    pos_count = {
        "char_count": len(text),
        "token_count": 0,
    }
    pos_tokens = []

    for token in doc:
        pos_tokens.extend([(token.text, token.pos_), (" ", None)])

    return pos_tokens, pos_count, html

demo = gr.Interface(
    text_analysis,
    gr.Textbox(placeholder="Enter sentence here..."),
    ["highlight", "json", "html"],
    examples=[
        ["What a beautiful morning for a walk!"],
        ["It was the best of times, it was the worst of times."],
    ],
)

demo.launch()

你完成了！这就是为你的 NER 模型构建基于 Web 的 GUI 所需了解的全部内容。

And you're done! That's all you need to know to build a web-based GUI for your NER model.

有趣的提示：你只需在 launch() 中设置 share=True 即可立即与他人分享你的 NER 演示。

Fun tip: you can share your NER demo instantly with others simply by setting share=True in launch().

< 上一个

How To Use 3D Model Component

下一个 >

Real Time Speech Recognition