Parul Pandey
Source: eWeek

Gartner estimates that by 2020, chatbots will be handling 85 percent of customer-service interactions; they are already handling about 30 percent of transactions now.

I am sure you've heard about
: a popular language-learning app, which gamifies practicing a new language. It is quite popular due to its innovative styles of teaching a foreign language.The concept is simple: five to ten minutes of interactive training a day is enough to learn a language. {#8f64}

However, even though Duolingo is enabling people to learn a new language, it's practitioners had a concern. People felt they were missing out on learning valuable conversational skills since they were learning on their own. People were also apprehensive about being paired with other language learners due to fear of embarrassment. This was turning out be a
big bottleneck in Duolingo's plans
. {#b1bf}

So their team solved this problem by building a native chatbot within its app, to help users learn conversational skills and practice what they learned. {#1769}
{#1769} http://bots.duolingo.com/

Since the bots are designed as conversational and friendly, Duolingo learners can practice conversation any time of the day, using their choice of characters, until they feel brave enough to practice their new language with other speakers.
This solved a major consumer pain point and made learning through the app a lot more fun. {#686a}

A chatbot is an artificial intelligence-powered piece of software in a device (Siri, Alexa, Google Assistant etc), application, website or other networks that try to gauge consumer's needs and then assist them to perform a particular task like a commercial transaction, hotel booking, form submission etc . Today almost every company has a chatbot deployed to engage with the users. Some of the ways in which companies are using chatbots are:{#3ffe}
是设备(Siri,Alexa,Google Assistant等),应用程序,网站或其他网络中的人工智能驱动软件,试图衡量消费者的需求,然后协助他们执行特定任务,如商业交易,酒店预订,表格提交等。今天,几乎每家公司都部署了一个聊天机器人与用户互动。公司使用聊天机器人的一些方式是:{#3ffe}

  • To deliver flight information{#aea3} {#aea3}
  • to connect customers and their finances{#468e} {#468e}
  • As customer support{#caf0} {#caf0}

The possibilities are (almost) limitless.{#bd13}

History of chatbots dates back to 1966 when a computer program called ELIZA was invented by Weizenbaum. It imitated the language of a psychotherapist from only 200 lines of code. You can still converse with it here:
. {#32e9}
ource: Cognizant

There are broadly two variants of
: Rule-Based and
and Self-learning. {#7fb3}

  1. In a Rule-based approach , a bot answers questions based on some rules on which it is trained on. The rules defined can be very simple to very complex. The bots can handle simple queries but fail to manage complex ones.{#df79} {#df79}
  2. Self-learning bots are the ones that use some Machine Learning-based approaches and are definitely more efficient than rule-based bots. These bots can be of further two types: Retrieval Based or Generative {#ac5a} {#ac5a}

i) In
i) In retrieval-based models , a chatbot uses some heuristic to select a response from a library of predefined responses. The chatbot uses the message and context of the conversation for selecting the best response from a predefined list of bot messages. The context can include a current position in the dialogue tree, all previous messages in the conversation, previously saved variables (e.g. username). Heuristics for selecting a response can be engineered in many different ways, from rule-based if-else conditional logic to machine learning classifiers.{#2ccd}

ii) Generative bots can generate the answers and not always replies with one of the answers from a set of answers. This makes them more intelligent as they take word by word from the query and generates the answers.{#ec03}

In this article we will build a simple retrieval based chatbot based on NLTK library in python.{#0f5f}

Pre-requisites {#e5c2}

Hands-On knowledge of
实践知识 scikit library and
library and NLTK is assumed. However, if you are new to NLP, you can still read the article and then refer back to resources.{#cc94}

NLP {#7c53}

The field of study that focuses on the interactions between human language and computers is called Natural Language Processing, or NLP for short. It sits at the intersection of computer science, artificial intelligence, and computational linguistics[Wikipedia].NLP is a way for computers to analyze, understand, and derive meaning from human language in a smart and useful way. By utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation.{#6962}

NLTK: A Brief Intro {#ac14}

NLTK(Natural Language Toolkit) is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to
是构建Python程序以使用人类语言数据的领先平台。它提供易于使用的界面over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.{#89e6}

NLTK has been called "a wonderful tool for teaching and working in, computational linguistics using Python," and "an amazing library to play with natural language."{#2a98}

Natural Language Processing with Python provides a practical introduction to programming for language processing. I highly recommend this book to people beginning in NLP with Python.{#4fd9}

Downloading and installing NLTK {#cc4d}

  1. Install NLTK: run pip install nltk{#0d87} {#0d87}
  2. Test installation: run python then type import nltk{#a6de} {#a6de}

For platform-specific instructions, read

Installing NLTK Packages {#da1b}

import NLTK and run
导入NLTK并运行nltk.download().This will open the NLTK downloader from where you can choose the corpora and models to download. You can also download all packages at once.{#6153}

Text Pre- Processing with NLTK {#8efc}

The main issue with text data is that it is all in text format (strings). However, Machine learning algorithms need some sort of numerical feature vector in order to perform the task. So before we start with any NLP project we need to pre-process it to make it ideal for work. Basic
文本数据的主要问题是它都是文本格式(字符串)。但是,机器学习算法需要某种数字特征向量才能执行任务。因此,在我们开始任何NLP项目之前,我们需要对其进行预处理,以使其成为工作的理想选择。基本 text pre-processing includes:{#ce5e}

  • Converting the entire text into uppercase or lowercase , so that the algorithm does not treat the same words in different cases as different{#5eaf} {#5eaf}
  • Tokenization : Tokenization is just the term used to describe the process of converting the normal text strings into a list of tokens i.e words that we actually want. Sentence tokenizer can be used to find the list of sentences and Word tokenizer can be used to find the list of words in strings.{#618c} {#618c}

The NLTK data package includes a pre-trained Punkt tokenizer for English. {#67fa}

  • Removing Noise i.e everything that isn't in a standard number or letter.{#c702} {#c702}
  • Removing Stop words. Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words {#8609} {#8609}
  • Stemming : Stemming is the process of reducing inflected (or sometimes derived) words to their stem, base or root form --- generally a written word form. Example if we were to stem the following words: "Stems", "Stemming", "Stemmed", "and Stemtization", the result would be a single word "stem".{#bec2} {#bec2}
  • Lemmatization : A slight variant of stemming is lemmatization. The major difference between these is, that, stemming can often create non-existent words, whereas lemmas are actual words. So, your root stem, meaning the word you end up with, is not something you can just look up in a dictionary, but you can look up a lemma. Examples of Lemmatization are that "run" is a base form for words like "running" or "ran" or that the word "better" and "good" are in the same lemma so they are considered the same.{#c9c5} {#c9c5}

Bag of Words {#19c7}

After the initial preprocessing phase, we need to transform the text into a meaningful vector (or array) of numbers. The bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:{#114f}

•A vocabulary of known words.{#6fab}

•A measure of the presence of known words.{#82cf}

Why is it is called a "
为什么称它为" bag " of words? That is because any information about the order or structure of words in the document is discarded and the model is only concerned with
"单词?那是因为任何有关文档中单词顺序或结构的信息都会被丢弃,而模型只关注 whether the known words occur in the document, not where they occur in the document. {#b708}

The intuition behind the Bag of Words is that documents are similar if they have similar content. Also, we can learn something about the meaning of the document from its content alone.{#5b31}
Bag of Words背后的直觉是,如果文档具有相似的内容,则它们是相似的。此外,我们可以仅从内容中了解文档的含义。{#5b31}

For example, if our dictionary contains the words {Learning, is, the, not, great}, and we want to vectorize the text "Learning is great", we would have the following vector: (1, 1, 0, 0, 1).{#519f}
例如,如果我们的字典包含单词{Learning,is,the,not,great},并且我们想要对文本进行矢量化"学习很棒",我们将得到以下向量:(1,1,0,0, 1)。{#} 519f

TF-IDF Approach {#34c2}

A problem with the Bag of Words approach is that highly frequent words start to dominate in the document (e.g. larger score), but may not contain as much "informational content". Also, it will give more weight to longer documents than shorter documents.{#be02}
Bag of Words方法的一个问题是高频率的单词在文档中开始占主导地位(例如,得分较高),但可能不包含那么多的"信息内容"。此外,与较短的文档相比,它会给较长文档带来更多权重。{#be02}

One approach is to rescale the frequency of words by how often they appear in all documents so that the scores for frequent words like "the" that are also frequent across all documents are penalized. This approach to scoring is called
一种方法是通过它们在所有文档中出现的频率来重新调整单词的频率,使得在所有文档中频繁出现的频繁单词(如"the")的分数受到惩罚。这种评分方法被称为 Term Frequency-Inverse Document Frequency , or
, or TF-IDF for short, where:{#5e96}

Term Frequency : is a scoring of the frequency of the word in the current document.{#a946}

TF = (Number of times term t appears in a document)/(Number of terms in the document)

Inverse Document Frequency : is a scoring of how rare the word is across documents.{#07a0}

IDF = 1+log(N/n), where, N is the number of documents and n is the number of documents a term t has appeared in.

Tf-IDF weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus{#dfb4}

Example: {#5ef6}

Consider a document containing 100 words wherein the word 'phone' appears 5 times.{#1d86}

The term frequency (i.e., tf) for phone is then (5 / 100) = 0.05. Now, assume we have 10 million documents and the word phone appears in one thousand of these. Then, the inverse document frequency (i.e., IDF) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-IDF weight is the product of these quantities: 0.05 * 4 = 0.20.{#ecc2}
然后,电话的术语频率(即,tf)是(5/100)= 0.05。现在,假设我们有1000万个文档,其中一千个文字出现在电话中。然后,逆文档频率(即IDF)计算为log(10,000,000 / 1,000)= 4.因此,Tf-IDF权重是这些量的乘积:0.05 * 4 = 0.20。{#ecc2}

Tf-IDF can be implemented in scikit learn as:{#dc73}

from sklearn.feature_extraction.text import TfidfVectorizer{#9f2c}
来自sklearn.feature_extraction.text import TfidfVectorizer {#9f2c}

Cosine Similarity {#2a22}

TF-IDF is a transformation applied to texts to get two real-valued vectors in vector space. We can then obtain the
TF-IDF是应用于文本的变换,以在向量空间中获得两个实值向量。然后我们可以获得 Cosine similarity of any pair of vectors by taking their dot product and dividing that by the product of their norms. That yields the cosine of the angle between the vectors.
通过获取它们的点积并将其除以它们的规范的乘积来表示任何一对矢量的相似性。这产生了矢量之间角度的余弦。 Cosine similarity is a
is a measure of similarity between two non-zero vectors. Using this formula we can find out the similarity between any two documents d1 and d2.{#4aa1}

Cosine Similarity (d1, d2) =  Dot product(d1, d2) / ||d1|| * ||d2||

where d1,d2 are two non zero vectors.{#f7b3}

For a detailed explanation and practical example of TF-IDF and Cosine Similarity refer to the document below. {#9dd3}

Now we have a fair idea of the NLP process. It is time that we get to our real task i.e Chatbot creation. We will name the chatbot here as 'ROBO 🤖
🤖 '. {#8766}

You can find the entire code with the corpus at the associated Github Repository
在相关的Github存储库中找到包含语料库的整个代码here or you can view it on my binder by clicking the image below.{#5e5f}

Importing the necessary libraries {#1afa}

import nltk
import numpy as np
import random
import string # to process standard python strings

Corpus {#a20e}

For our example, we will be using the Wikipedia page for
对于我们的示例,我们将使用Wikipedia页面chatbots as our corpus. Copy the contents from the page and place it in a text file named 'chatbot.txt'. However, you can use any corpus of your choice.{#ab99}

Reading in the data {#654d}

We will read in the corpus.txt file and convert the entire corpus into a list of sentences and a list of words for further pre-processing.{#ddc3}

f=open('chatbot.txt','r',errors = 'ignore')raw=f.read()raw=raw.lower()# converts to lowercasenltk.download('punkt') # first-time use only
nltk.download('wordnet') # first-time use onlysent_tokens = nltk.sent_tokenize(raw)# converts to list of sentences 
word_tokens = nltk.word_tokenize(raw)# converts to list of words

Let see an example of the sent_tokens and the word_tokens{#de78}

['a chatbot (also known as a talkbot, chatterbot, bot, im bot, interactive agent, or artificial conversational entity) is a computer program or an artificial intelligence which conducts a conversation via auditory or textual methods.',
 'such programs are often designed to convincingly simulate how a human would behave as a conversational partner, thereby passing the turing test.']word_tokens\[:2\]
['a', 'chatbot', '(', 'also', 'known']

Pre-processing the raw text {#d5cc}

We shall now define a function called LemTokens which will take as input the tokens and return normalized tokens.{#7e86}

lemmer = nltk.stem.WordNetLemmatizer()
#WordNet is a semantically-oriented dictionary of English included in NLTK.def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

Keyword matching {#e71c}

Next, we shall define a function for a greeting by the bot i.e if a user's input is a greeting, the bot shall return a greeting response.ELIZA uses a simple keyword matching for greetings. We will utilize the same concept here.{#0876}

GREETING_INPUTS = ("hello", "hi", "greetings", "sup", "what's up","hey",)GREETING_RESPONSES = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]def greeting(sentence):     for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

Generating Response {#1ce3}

To generate a response from our bot for input questions, the concept of document similarity will be used. So we begin by importing the necessary modules.{#599a}

  • From scikit learn library, import the TFidf vectorizer to convert a collection of raw documents to a matrix of TF-IDF features.{#9af8} {#9af8}

    from sklearn.feature_extraction.text import TfidfVectorizer

  • Also, importcosine similarity module from scikit learn library{#4490} {#4490}

    from sklearn.metrics.pairwise import cosine_similarity

This will be used to find the similarity between words entered by the user and the words in the corpus. This is the simplest possible implementation of a chatbot.{#b9a4}

We define a function
我们定义一个函数 response which searches the user's utterance for one or more known keywords and returns one of several possible responses. If it doesn't find the input matching any of the keywords, it returns a response:" I am sorry! I don't understand you"{#af78}

def response(user_response):
    sent_tokens.append(user_response)    TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
    tfidf = TfidfVec.fit_transform(sent_tokens)
    vals = cosine_similarity(tfidf[-1], tfidf)
    flat = vals.flatten()
    req_tfidf = flat[-2]    if(req_tfidf==0):
        robo_response=robo_response+"I am sorry! I don't understand you"
        return robo_response
        robo_response = robo_response+sent_tokens[idx]
        return robo_response

Finally, we will feed the lines that we want our bot to say while starting and ending a conversation depending upon the user's input.{#dcf9}

print("ROBO: My name is Robo. I will answer your queries about Chatbots. If you want to exit, type Bye!")while(flag==True):
    user_response = input()
        if(user_response=='thanks' or user_response=='thank you' ):
            print("ROBO: You are welcome..")
                print("ROBO: "+greeting(user_response))
                print("ROBO: ",end="")
        print("ROBO: Bye! take care..")

So that's pretty much it. We have coded our first chatbot in NLTK. Now, let us see how it interacts with humans:{#7c96}

This wasn't too bad. Even though the chatbot couldn't give a satisfactory answer for some questions, it fared pretty well on others.{#f5be}

Conclusion {#94dd}

Though it is a very simple bot with hardly any cognitive skills, its a good way to get into NLP and get to know about chatbots.Though 'ROBO' responds to user input. It won't fool your friends, and for a production system you'll want to consider one of the existing bot platforms or frameworks, but this example should help you think through the design and challenge of creating a chatbot. Internet is flooded with resources and after reading this article I am sure , you will want to create a chatbot of your own. So happy tinkering!!{#3c38}