文本识别(自然语言处理,NLP)
目录语音识别NLTK - 自然语言工具包分词词干词形还原词袋词频文档频率(DF)逆文档频率(IDF)词频你文档频率(TF-IDF)基于多项分布朴素贝叶斯的情感分析主题抽取语音识别语音----------------------->文本--------------------->语义NLTK - 自然语言工具包分词import nltk.tokenize as tktk.
语音识别
语音----------------------->文本--------------------->语义
NLTK - 自然语言工具包
分词
import nltk.tokenize as tk
tk.sent_tokenize(文本)->句子列表
tk.word_tokenize(文本)->单词列表
分词器 = tk.WordPunctTokenizer() > 略有不同(会把’s分开成’ 和s)
分词器.tokenize(文本)->单词列表 /
代码:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.tokenize as tk
doc = "Are you curious about tokenization? " \
"Let's see how it works! " \
"We need to analyze a couple of sentences " \
"with punctuations to see it in action."
print(doc)
tokens = tk.sent_tokenize(doc,language='english')
for i, token in enumerate(tokens):
print("%2d" % (i + 1), token)
print('-' * 15)
tokens = tk.word_tokenize(doc)
for i, token in enumerate(tokens):
print("%2d" % (i + 1), token)
print('-' * 15)
tokenizer = tk.WordPunctTokenizer()
tokens = tokenizer.tokenize(doc)
for i, token in enumerate(tokens):
print("%2d" % (i + 1), token)
词干
import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
注意:提取出来的不一定是单词,也有可能只是单词的部分组成
pt.PorterStemmer() -> 波特词干提取器,偏宽松
lc.LancasterStemmer() -> 朗卡斯特词干提取器,偏严格
sb.SnowballStemmer(语言) -> 思诺博词干提取器,偏中庸
XXX词干提取器.stem(单词)->词干
代码:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.stem.porter as pt
import nltk.stem.lancaster as lc
import nltk.stem.snowball as sb
words = ['table', 'probably', 'wolves', 'playing',
'is', 'dog', 'the', 'beaches', 'grounded',
'dreamt', 'envision']
pt_stemmer = pt.PorterStemmer()
lc_stemmer = lc.LancasterStemmer()
sb_stemmer = sb.SnowballStemmer('english')
for word in words:
pt_stem = pt_stemmer.stem(word)
lc_stem = lc_stemmer.stem(word)
sb_stem = sb_stemmer.stem(word)
print('%8s %8s %8s %8s' % (
word, pt_stem, lc_stem, sb_stem))
词形还原
名词:复数->单数
动词:分词->原型
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.stem as ns
words = ['table', 'probably', 'wolves', 'playing',
'is', 'dog', 'the', 'beaches', 'grounded',
'dreamt', 'envision']
lemmatizer = ns.WordNetLemmatizer()
for word in words:
n_lemma = lemmatizer.lemmatize(word, pos='n')
v_lemma = lemmatizer.lemmatize(word, pos='v')
print('%8s %8s %8s' % (word, n_lemma, v_lemma))
词袋
相似的词会出现在含义相似的语句里面。根据相似输入对应相似输出,统计词典中的词在每个样本内出现的次数,根据次数统计规律,找到相似语句,聊天机器人就可以通过其进行反馈。
The brown dog is running. The black dog is in the black room. Running in the room is forbidden.
1 The brown dog is running
2 The black dog is in the black room
3 Running in the room is forbidden
the brown dog is running black in room forbidden
1 1 1 1 1 1 0 0 0 0
2 1 0 1 1 0 2 1 1 0
3 1 0 0 1 1 0 1 1 1
代码:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft
doc = 'The brown dog is running. ' \
'The black dog is in the black room. ' \
'Running in the room is forbidden.'
print(doc)
# 语句统计
sentences = tk.sent_tokenize(doc)
print(sentences)
# 计数向量
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences).toarray()
#返回的是矩阵,
print(bow)
words = cv.get_feature_names()
print(words)
词频
词频是词袋矩阵的归一化。根据词袋统计的词语数量,得到词语出现频率。
import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft
import sklearn.preprocessing as sp
doc = 'The brown dog is running. ' \
'The black dog is in the black room. ' \
'Running in the room is forbidden.'
print(doc)
sentences = tk.sent_tokenize(doc)
print(sentences)
# 文本统计的特征提取。
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences).toarray()
print(bow)
words = cv.get_feature_names()
print(words)
# 统计词频
tf = sp.normalize(bow, norm='l1')
print(tf)
文档频率(DF)
针对词典中的每一个单词,用包含该单词的样本数闭上总样本数。如果这个单词越稀有,则文档频率越小。单词越稀有,文档频率越小,单词的稀有度贡献了文档的特征。
逆文档频率(IDF)
逆文档频率越高,文档频率越低,单词越稀有,可识别性贡献越高。
词频越高---------------------------------------------->语义表现力贡献越高
词频你文档频率(TF-IDF)
词频乘你文档频率,综合体现了单词对样本语义表现力和可识别性贡献的大小。
词频矩阵中的每一个元素乘以相应单词的逆文档频率,其值越大说明该词对样本语义的贡献越大,根据每个词的贡献力度,构建学习模型。
代码:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.tokenize as tk
import sklearn.feature_extraction.text as ft
doc = 'The brown dog is running. ' \
'The black dog is in the black room. ' \
'Running in the room is forbidden.'
print(doc)
sentences = tk.sent_tokenize(doc)
print(sentences)
# 特征提取器,统计各文本在该行出现的次数(特征值次数)
cv = ft.CountVectorizer()
bow = cv.fit_transform(sentences).toarray()
print(bow)
# 得到统计的特征值。
words = cv.get_feature_names()
print(words)
tt = ft.TfidfTransformer()
# 得到词频-逆文档频率
tfidf = tt.fit_transform(bow).toarray()
print(tfidf)
基于多项分布朴素贝叶斯的情感分析
多项分布朴素贝叶斯分类器
通过有监督学习,将关键单词和情感联系起来,对未知语句,进行词语匹配,判断其情感好坏。
情感分析
A B C
1 2 3 -> {‘A’: 1, ‘C’: 3, ‘B’: 2}
4 5 6 -> {‘C’:6, ‘A’: 4, ‘B’, 5}
7 8 9 …
代码:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import nltk.corpus as nc
import nltk.classify as cf
import nltk.classify.util as cu
# 每个好评实例的单词字典的列表
pdata = []
# 获取自带数据:电影评论的好评
fileids = nc.movie_reviews.fileids('pos')
for fileid in fileids:
# nc.movie_reviews.words:nltk.corpus.reader.plaintext.CategorizedPlaintextCorpusReader的实例方法
words = nc.movie_reviews.words(fileid)
# 生成含有每个单词的字典
sample = {}
for word in words:
sample[word] = True
pdata.append((sample, 'POSITIVE'))
# 每个差评实例的单词字典的列表
ndata = []
# 获取自带数据:电影评论的差评
fileids = nc.movie_reviews.fileids('neg')
for fileid in fileids:
words = nc.movie_reviews.words(fileid)
sample = {}
for word in words:
sample[word] = True
ndata.append((sample, 'NEGATIVE'))
# 划分训练集和测试集,这里没有考虑交叉验证的问题
pnumb, nnumb = int(0.8 * len(pdata)), int(0.8 * len(ndata))
train_data = pdata[:pnumb] + ndata[:nnumb]
test_data = pdata[pnumb:] + ndata[nnumb:]
# 生成多项式朴素贝叶斯分类模型,使用的nltk的模型
model = cf.NaiveBayesClassifier.train(train_data)
# 验证模型准确度
ac = cu.accuracy(model, test_data)
print('%.2f%%' % round(ac * 100, 2))
# 最具信息量的特征值
tops = model.most_informative_features()
for top in tops[:5]:
print(top[0])
reviews = [
'It is an amazing movie.',
'This is a dull movie. I wound never recommend it to anyoue.',
'The cinematography is pretty great in this movie.',
'The direction was terrible and the story was all over the place.']
sents, probs = [], []
# 生成词语字典,这里就没有使用单词划分的方法,直接通过split切割。
for review in reviews:
words = review.split()
sample = {}
for word in words:
sample[word] = True
# 可能的概率,这里相当于得到分类结果
pcls = model.prob_classify(sample)
# 分类
sent = pcls.max()
# 处于这个类的概率,置信度,准确率
prob = pcls.prob(sent)
sents.append(sent)
probs.append(prob)
for review, sent, prob in zip(
reviews, sents, probs):
print(review, '->', sent, '%.2f%%' % round(
prob * 100, 2))
主题抽取
代码:topic.py
文本分类,一般情况下选择基于统计的分类器进行训练。自然语言有明显的基于统计的特征。
代码:doc.py
1 2 3 4 5 6
2 3 0 0 1 4
0 4 1 1 2 2
10.性别识别
代码:gndr.py
更多推荐
所有评论(0)