中英文模式阅读
中文模式阅读
英文模式阅读

Parul Pandey

Source: eWeek

Gartner
据估计,到2020年,聊天机器人将处理85%的客户服务互动;他们现在已经处理了大约30%的交易。

I am sure you've heard about
Duolingo
: a popular language-learning app, which gamifies practicing a new language. It is quite popular due to its innovative styles of teaching a foreign language.The concept is simple: five to ten minutes of interactive training a day is enough to learn a language.
{#8f64}

However, even though Duolingo is enabling people to learn a new language, it's practitioners had a concern. People felt they were missing out on learning valuable conversational skills since they were learning on their own. People were also apprehensive about being paired with other language learners due to fear of embarrassment. This was turning out be a
big bottleneck in Duolingo's plans
.
{#b1bf}

So their team solved this problem by building a native chatbot within its app, to help users learn conversational skills and practice what they learned.
{#1769}
http://bots.duolingo.com/

Since the bots are designed as conversational and friendly, Duolingo learners can practice conversation any time of the day, using their choice of characters, until they feel brave enough to practice their new language with other speakers.
This solved a major consumer pain point and made learning through the app a lot more fun.
{#686a}



A chatbot
是设备(Siri,Alexa,Google Assistant等),应用程序,网站或其他网络中的人工智能驱动软件,试图衡量消费者的需求,然后协助他们执行特定任务,如商业交易,酒店预订,表格提交等。今天,几乎每家公司都部署了一个聊天机器人与用户互动。公司使用聊天机器人的一些方式是:{#3ffe}

  • To deliver flight information{#aea3} {#aea3}
  • to connect customers and their finances{#468e} {#468e}
  • As customer support{#caf0} {#caf0}

可能性(几乎)是无限的。{#bd13}

History of chatbots dates back to 1966 when a computer program called ELIZA was invented by Weizenbaum. It imitated the language of a psychotherapist from only 200 lines of code. You can still converse with it here:
Eliza
.
{#32e9}
S
ource: Cognizant


大致有两种变体chatbots
: Rule-Based
and Self-learning.
{#7fb3}

  1. In a Rule-based approach , a bot answers questions based on some rules on which it is trained on. The rules defined can be very simple to very complex. The bots can handle simple queries but fail to manage complex ones.{#df79} {#df79}
  2. Self-learning bots are the ones that use some Machine Learning-based approaches and are definitely more efficient than rule-based bots. These bots can be of further two types: Retrieval Based or Generative {#ac5a} {#ac5a}

i) In retrieval-based models
,聊天机器人使用一些启发式方法从预定义响应库中选择响应。聊天机器人使用会话的消息和上下文从预定义的机器人消息列表中选择最佳响应。上下文可以包括对话树中的当前位置,对话中的所有先前消息,先前保存的变量(例如,用户名)。用于选择响应的启发式方法可以通过许多不同的方式进行设计,从基于规则的if-else条件逻辑到机器学习分类器。{#2ccd}


ii) Generative
机器人可以生成答案,而不是总是回答一组答案中的答案之一。这使得他们更聪明,因为他们从查询中逐字逐句地生成答案。{#ec03}


在本文中,我们将在python中基于NLTK库构建一个基于简单检索的聊天机器人。{#0f5f}

Pre-requisites {#e5c2}


实践知识 scikit
library and NLTK
假设。但是,如果您不熟悉NLP,您仍然可以阅读该文章,然后再参考资源。{#cc94}

NLP {#7c53}


专注于人类语言和计算机之间相互作用的研究领域称为自然语言处理(简称NLP)。它位于计算机科学,人工智能和计算语言学[维基百科]的交叉点.NLP是计算机以智能和有用的方式分析,理解和从人类语言中获得意义的一种方式。通过利用NLP,开发人员可以组织和构建知识,以执行自动摘要,翻译,命名实体识别,关系提取,情感分析,语音识别和主题分割等任务。{#6962}

NLTK: A Brief Intro {#ac14}

NLTK(Natural Language Toolkit)
是构建Python程序以使用人类语言数据的领先平台。它提供易于使用的界面over 50 corpora and lexical resources
例如WordNet,以及一套用于分类,标记化,词干化,标记,解析和语义推理的文本处理库,用于工业级NLP库的包装器。{#89e6}


NLTK被称为"一个很好的教学和工作工具,使用Python的计算语言学",以及"一个用自然语言玩的神奇图书馆。"{#2a98}

Natural Language Processing with Python
提供了语言处理编程的实用介绍。我强烈推荐这本书给那些以NLP开头的人。{#4fd9}

Downloading and installing NLTK
{#cc4d}

  1. Install NLTK: run pip install nltk{#0d87} {#0d87}
  2. Test installation: run python then type import nltk{#a6de} {#a6de}

有关特定于平台的说明,请阅读here
.{#0ee6}

Installing NLTK Packages {#da1b}


导入NLTK并运行nltk.download().
这将打开NLTK下载器,您可以从中选择要下载的语料库和模型。您也可以一次下载所有包。{#6153}

Text Pre- Processing with NLTK {#8efc}


文本数据的主要问题是它都是文本格式(字符串)。但是,机器学习算法需要某种数字特征向量才能执行任务。因此,在我们开始任何NLP项目之前,我们需要对其进行预处理,以使其成为工作的理想选择。基本 text pre-processing
includes:{#ce5e}

  • Converting the entire text into uppercase or lowercase , so that the algorithm does not treat the same words in different cases as different{#5eaf} {#5eaf}
  • Tokenization : Tokenization is just the term used to describe the process of converting the normal text strings into a list of tokens i.e words that we actually want. Sentence tokenizer can be used to find the list of sentences and Word tokenizer can be used to find the list of words in strings.{#618c} {#618c}

The NLTK data package includes a pre-trained Punkt tokenizer for English.
{#67fa}

  • Removing Noise i.e everything that isn't in a standard number or letter.{#c702} {#c702}
  • Removing Stop words. Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words {#8609} {#8609}
  • Stemming : Stemming is the process of reducing inflected (or sometimes derived) words to their stem, base or root form --- generally a written word form. Example if we were to stem the following words: "Stems", "Stemming", "Stemmed", "and Stemtization", the result would be a single word "stem".{#bec2} {#bec2}
  • Lemmatization : A slight variant of stemming is lemmatization. The major difference between these is, that, stemming can often create non-existent words, whereas lemmas are actual words. So, your root stem, meaning the word you end up with, is not something you can just look up in a dictionary, but you can look up a lemma. Examples of Lemmatization are that "run" is a base form for words like "running" or "ran" or that the word "better" and "good" are in the same lemma so they are considered the same.{#c9c5} {#c9c5}

Bag of Words {#19c7}


在初始预处理阶段之后,我们需要将文本转换为有意义的数字向量(或数组)。词袋是文本的表示,用于描述文档中单词的出现。它涉及两件事:{#114f}


•已知单词的词汇。{#6fab}


•衡量已知单词的存在。{#82cf}


为什么称它为" bag
"单词?那是因为任何有关文档中单词顺序或结构的信息都会被丢弃,而模型只关注 whether the known words occur in the document, not where they occur in the document.
{#b708}


Bag of Words背后的直觉是,如果文档具有相似的内容,则它们是相似的。此外,我们可以仅从内容中了解文档的含义。{#5b31}


例如,如果我们的字典包含单词{Learning,is,the,not,great},并且我们想要对文本进行矢量化"学习很棒",我们将得到以下向量:(1,1,0,0, 1)。{#} 519f

TF-IDF Approach {#34c2}


Bag of Words方法的一个问题是高频率的单词在文档中开始占主导地位(例如,得分较高),但可能不包含那么多的"信息内容"。此外,与较短的文档相比,它会给较长文档带来更多权重。{#be02}


一种方法是通过它们在所有文档中出现的频率来重新调整单词的频率,使得在所有文档中频繁出现的频繁单词(如"the")的分数受到惩罚。这种评分方法被称为 Term Frequency-Inverse Document Frequency
, or TF-IDF
简而言之,其中:{#5e96}

Term Frequency
:是当前文档中单词频率的得分。{#a946}

TF = (Number of times term t appears in a document)/(Number of terms in the document)

Inverse Document Frequency
:这是该单词在文档中的罕见程度得分。{#07a0}

IDF = 1+log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in.

Tf-IDF权重是经常用于信息检索和文本挖掘的权重。此权重是用于评估单词对集合或语料库中的文档的重要程度的统计度量{#dfb4}

Example:
{#5ef6}


考虑一个包含100个单词的文档,其中"phone"一词出现5次。{#1d86}


然后,电话的术语频率(即,tf)是(5/100)= 0.05。现在,假设我们有1000万个文档,其中一千个文字出现在电话中。然后,逆文档频率(即IDF)计算为log(10,000,000 / 1,000)= 4.因此,Tf-IDF权重是这些量的乘积:0.05 * 4 = 0.20。{#ecc2}


Tf-IDF可以用scikit实现学习:{#dc73}


来自sklearn.feature_extraction.text import TfidfVectorizer {#9f2c}

Cosine Similarity {#2a22}


TF-IDF是应用于文本的变换,以在向量空间中获得两个实值向量。然后我们可以获得 Cosine
通过获取它们的点积并将其除以它们的规范的乘积来表示任何一对矢量的相似性。这产生了矢量之间角度的余弦。 Cosine similarity
is a measure of similarity
在两个非零向量之间。使用这个公式,我们可以找出任何两个文件d1和d2之间的相似性。{#4aa1}

Cosine Similarity (d1, d2) =  Dot product(d1, d2) / ||d1|| * ||d2||

其中d1,d2是两个非零向量。{#f7b3}

For a detailed explanation and practical example of TF-IDF and Cosine Similarity refer to the document below.
{#9dd3}


Now we have a fair idea of the NLP process. It is time that we get to our real task i.e Chatbot creation. We will name the chatbot here as 'ROBO
🤖 '.
{#8766}

You can
在相关的Github存储库中找到包含语料库的整个代码here
或者你可以点击下面的图片在我的活页夹上查看它。{#5e5f}

Importing the necessary libraries {#1afa}

import nltk
import numpy as np
import random
import string # to process standard python strings

Corpus {#a20e}


对于我们的示例,我们将使用Wikipedia页面chatbots
作为我们的语料库。复制页面中的内容并将其放在名为"chatbot.txt"的文本文件中。但是,您可以使用您选择的任何语料库。{#ab99}

Reading in the data {#654d}


我们将读入corpus.txt文件,并将整个语料库转换为句子列表和单词列表,以便进一步预处理。{#ddc3}

f=open('chatbot.txt','r',errors = 'ignore')raw=f.read()raw=raw.lower()# converts to lowercasenltk.download('punkt') # first-time use only
nltk.download('wordnet') # first-time use onlysent_tokens = nltk.sent_tokenize(raw)# converts to list of sentences 
word_tokens = nltk.word_tokenize(raw)# converts to list of words

让我们看一下sent_tokens和word_tokens的一个例子{#de78}

sent_tokens\[:2\]
['a chatbot (also known as a talkbot, chatterbot, bot, im bot, interactive agent, or artificial conversational entity) is a computer program or an artificial intelligence which conducts a conversation via auditory or textual methods.',
 'such programs are often designed to convincingly simulate how a human would behave as a conversational partner, thereby passing the turing test.']word_tokens\[:2\]
['a', 'chatbot', '(', 'also', 'known']

Pre-processing the raw text {#d5cc}


我们现在将定义一个名为LemTokens的函数,它将令牌作为输入并返回规范化的令牌。{#7e86}

lemmer = nltk.stem.WordNetLemmatizer()
#WordNet is a semantically-oriented dictionary of English included in NLTK.def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

Keyword matching {#e71c}


接下来,我们将为机器人定义一个问候函数,即如果用户的输入是问候语,机器人将返回问候语响应.ELIZA使用简单的关键字匹配问候语。我们将在这里使用相同的概念。{#0876}

GREETING_INPUTS = ("hello", "hi", "greetings", "sup", "what's up","hey",)GREETING_RESPONSES = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]def greeting(sentence):     for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

Generating Response {#1ce3}


为了从我们的机器人生成输入问题的响应,将使用文档相似性的概念。所以我们首先导入必要的模块。{#599a}

  • From scikit learn library, import the TFidf vectorizer to convert a collection of raw documents to a matrix of TF-IDF features.{#9af8} {#9af8}

    from sklearn.feature_extraction.text import TfidfVectorizer

  • Also, importcosine similarity module from scikit learn library{#4490} {#4490}

    from sklearn.metrics.pairwise import cosine_similarity

This will be used to find the similarity between words entered by the user and the words in the corpus.
这是聊天机器人最简单的实现方式。{#b9a4}


我们定义一个函数 response
它搜索用户的话语以获得一个或多个已知关键字,并返回几个可能的响应之一。如果找不到与任何关键字匹配的输入,它会返回一个回复:"我很抱歉!我不明白你"{#af78}

def response(user_response):
    robo_response=''
    sent_tokens.append(user_response)    TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
    tfidf = TfidfVec.fit_transform(sent_tokens)
    vals = cosine_similarity(tfidf[-1], tfidf)
    idx=vals.argsort()[0][-2]
    flat = vals.flatten()
    flat.sort()
    req_tfidf = flat[-2]    if(req_tfidf==0):
        robo_response=robo_response+"I am sorry! I don't understand you"
        return robo_response
    else:
        robo_response = robo_response+sent_tokens[idx]
        return robo_response

最后,我们将根据用户的输入提供我们希望机器人在开始和结束对话时说出的行。{#dcf9}

flag=True
print("ROBO: My name is Robo. I will answer your queries about Chatbots. If you want to exit, type Bye!")while(flag==True):
    user_response = input()
    user_response=user_response.lower()
    if(user_response!='bye'):
        if(user_response=='thanks' or user_response=='thank you' ):
            flag=False
            print("ROBO: You are welcome..")
        else:
            if(greeting(user_response)!=None):
                print("ROBO: "+greeting(user_response))
            else:
                print("ROBO: ",end="")
                print(response(user_response))
                sent_tokens.remove(user_response)
    else:
        flag=False
        print("ROBO: Bye! take care..")

所以这就是它。我们在NLTK中编写了第一个聊天机器人。现在,让我们看看它如何与人类互动:{#7c96}


这不是太糟糕。尽管聊天机器人无法对某些问题给出满意的答案,但其他人的表现却相当不错。{#f5be}

Conclusion {#94dd}


虽然它是一个非常简单的机器人,几乎没有任何认知技能,但它是进入NLP并了解聊天机器人的好方法。虽然'ROBO'响应用户输入。它不会欺骗你的朋友,对于一个生产系统,你会想要考虑一个现有的机器人平台或框架,但是这个例子可以帮助你思考创建聊天机器人的设计和挑战。互联网充斥着资源,在阅读完这篇文章之后,我确信你会想要创建一个自己的聊天机器人。如此快乐的修修补补!{#3c38}

中英文模式阅读
中文模式阅读
英文模式阅读

查看英文原文

查看更多文章


公众号:银河系1号


联系邮箱:public@space-explore.com


(未经同意,请勿转载)