文本识别（自然语言处理，NLP）

目录语音识别NLTK - 自然语言工具包分词词干词形还原词袋词频文档频率（DF）逆文档频率（IDF）词频你文档频率（TF-IDF）基于多项分布朴素贝叶斯的情感分析主题抽取语音识别语音----------------------->文本--------------------->语义NLTK - 自然语言工具包分词import nltk.tokenize as tktk.

Chise1

7634人浏览 · 2018-12-19 18:32:26

Chise1 · 2018-12-19 18:32:26 发布

语音识别

语音----------------------->文本--------------------->语义

NLTK - 自然语言工具包

分词

import nltk.tokenize as tk
tk.sent_tokenize(文本)->句子列表
tk.word_tokenize(文本)->单词列表
分词器 = tk.WordPunctTokenizer() > 略有不同(会把’s分开成’ 和s)
分词器.tokenize(文本)->单词列表 /
代码：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.tokenize as tk
doc = "Are you curious about tokenization? " \
      "Let's see how it works! " \
      "We need to analyze a couple of sentences " \
      "with punctuations to see it in action."
print(doc)
tokens = tk.sent_tokenize(doc,language='english')
for i, token in enumerate(tokens):
    print("%2d" % (i + 1), token)
print('-' * 15)
tokens = tk.word_tokenize(doc)
for i, token in enumerate(tokens):
    print("%2d" % (i + 1), token)
print('-' * 15)
tokenizer = tk.WordPunctTokenizer()
tokens = tokenizer.tokenize(doc)
for i, token in enumerate(tokens):
    print("%2d" % (i + 1), token)

词干

import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
注意：提取出来的不一定是单词，也有可能只是单词的部分组成
pt.PorterStemmer() -> 波特词干提取器，偏宽松
lc.LancasterStemmer() -> 朗卡斯特词干提取器，偏严格
sb.SnowballStemmer(语言) -> 思诺博词干提取器，偏中庸
XXX词干提取器.stem(单词)->词干
代码：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
words = ['table', 'probably', 'wolves', 'playing',
         'is', 'dog', 'the', 'beaches', 'grounded',
         'dreamt', 'envision']
pt_stemmer = pt.PorterStemmer()
lc_stemmer = lc.LancasterStemmer()
sb_stemmer = sb.SnowballStemmer('english')
for word in words:
    pt_stem = pt_stemmer.stem(word)
    lc_stem = lc_stemmer.stem(word)
    sb_stem = sb_stemmer.stem(word)
    print('%8s %8s %8s %8s' % (
        word, pt_stem, lc_stem, sb_stem))

词形还原

名词：复数->单数
动词：分词->原型

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.stem as ns
words = ['table', 'probably', 'wolves', 'playing',
         'is', 'dog', 'the', 'beaches', 'grounded',
         'dreamt', 'envision']
lemmatizer = ns.WordNetLemmatizer()
for word in words:
    n_lemma = lemmatizer.lemmatize(word, pos='n')
    v_lemma = lemmatizer.lemmatize(word, pos='v')
    print('%8s %8s %8s' % (word, n_lemma, v_lemma))

词袋

相似的词会出现在含义相似的语句里面。根据相似输入对应相似输出，统计词典中的词在每个样本内出现的次数，根据次数统计规律，找到相似语句，聊天机器人就可以通过其进行反馈。
The brown dog is running. The black dog is in the black room. Running in the room is forbidden.
1 The brown dog is running
2 The black dog is in the black room
3 Running in the room is forbidden
the brown dog is running black in room forbidden
1 1 1 1 1 1 0 0 0 0
2 1 0 1 1 0 2 1 1 0
3 1 0 0 1 1 0 1 1 1
代码：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft
doc = 'The brown dog is running. ' \
      'The black dog is in the black room. ' \
      'Running in the room is forbidden.'
print(doc)
# 语句统计
sentences = tk.sent_tokenize(doc)
print(sentences)
# 计数向量
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences).toarray()
#返回的是矩阵，
print(bow)
words = cv.get_feature_names()
print(words)

词频

词频是词袋矩阵的归一化。根据词袋统计的词语数量，得到词语出现频率。

import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft
import sklearn.preprocessing as sp
doc = 'The brown dog is running. ' \
      'The black dog is in the black room. ' \
      'Running in the room is forbidden.'
print(doc)
sentences = tk.sent_tokenize(doc)
print(sentences)
# 文本统计的特征提取。 
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences).toarray()
print(bow)
words = cv.get_feature_names()
print(words)
# 统计词频
tf = sp.normalize(bow, norm='l1')
print(tf)

文档频率（DF）

针对词典中的每一个单词，用包含该单词的样本数闭上总样本数。如果这个单词越稀有，则文档频率越小。单词越稀有，文档频率越小，单词的稀有度贡献了文档的特征。

逆文档频率（IDF）

逆文档频率越高，文档频率越低，单词越稀有，可识别性贡献越高。
词频越高---------------------------------------------->语义表现力贡献越高

词频你文档频率（TF-IDF）

词频乘你文档频率，综合体现了单词对样本语义表现力和可识别性贡献的大小。

词频矩阵中的每一个元素乘以相应单词的逆文档频率，其值越大说明该词对样本语义的贡献越大，根据每个词的贡献力度，构建学习模型。
代码：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft
doc = 'The brown dog is running. ' \
      'The black dog is in the black room. ' \
      'Running in the room is forbidden.'
print(doc)
sentences = tk.sent_tokenize(doc)
print(sentences)
# 特征提取器，统计各文本在该行出现的次数（特征值次数）
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences).toarray()
print(bow)
# 得到统计的特征值。
words = cv.get_feature_names()
print(words)
tt = ft.TfidfTransformer()
# 得到词频-逆文档频率
tfidf = tt.fit_transform(bow).toarray()
print(tfidf)

基于多项分布朴素贝叶斯的情感分析

多项分布朴素贝叶斯分类器
通过有监督学习，将关键单词和情感联系起来，对未知语句，进行词语匹配，判断其情感好坏。
情感分析
A B C
1 2 3 -> {‘A’: 1, ‘C’: 3, ‘B’: 2}
4 5 6 -> {‘C’:6, ‘A’: 4, ‘B’, 5}
7 8 9 …
代码：

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.corpus as nc
import nltk.classify as cf
import nltk.classify.util as cu
# 每个好评实例的单词字典的列表
pdata = []
# 获取自带数据：电影评论的好评
fileids = nc.movie_reviews.fileids('pos')
for fileid in fileids:
    # nc.movie_reviews.words：nltk.corpus.reader.plaintext.CategorizedPlaintextCorpusReader的实例方法
    words = nc.movie_reviews.words(fileid)
    # 生成含有每个单词的字典
    sample = {}
    for word in words:
        sample[word] = True
    pdata.append((sample, 'POSITIVE'))
# 每个差评实例的单词字典的列表
ndata = []
# 获取自带数据：电影评论的差评
fileids = nc.movie_reviews.fileids('neg')
for fileid in fileids:
    words = nc.movie_reviews.words(fileid)
    sample = {}
    for word in words:
        sample[word] = True
    ndata.append((sample, 'NEGATIVE'))
# 划分训练集和测试集，这里没有考虑交叉验证的问题
pnumb, nnumb = int(0.8 * len(pdata)), int(0.8 * len(ndata))
train_data = pdata[:pnumb] + ndata[:nnumb]
test_data = pdata[pnumb:] + ndata[nnumb:]
# 生成多项式朴素贝叶斯分类模型，使用的nltk的模型
model = cf.NaiveBayesClassifier.train(train_data)
# 验证模型准确度
ac = cu.accuracy(model, test_data)
print('%.2f%%' % round(ac * 100, 2))
# 最具信息量的特征值
tops = model.most_informative_features()
for top in tops[:5]:
    print(top[0])
reviews = [
    'It is an amazing movie.',
    'This is a dull movie. I wound never recommend it to anyoue.',
    'The cinematography is pretty great in this movie.',
    'The direction was terrible and the story was all over the place.']
sents, probs = [], []
# 生成词语字典，这里就没有使用单词划分的方法，直接通过split切割。
for review in reviews:
    words = review.split()
    sample = {}
    for word in words:
        sample[word] = True
    # 可能的概率，这里相当于得到分类结果
    pcls = model.prob_classify(sample)
    # 分类
    sent = pcls.max()
    # 处于这个类的概率，置信度，准确率
    prob = pcls.prob(sent)
    sents.append(sent)
    probs.append(prob)
for review, sent, prob in zip(
        reviews, sents, probs):
    print(review, '->', sent, '%.2f%%' % round(
        prob * 100, 2))