NLP-Natural Language Processing:自然语言处理
从机器学习角度讲,需要执行五个步骤:

  1. 读取预料
  2. 标记化
  3. 清理/移除通用词
  4. 词干提取
  5. 转化成数值格式

基本步骤简述

语料

语料被称为文本文档的完整集合,例如,假设又一个集合中有数千封邮件,他们需要处理和分析以供使用,这组电子邮价就被称为语料,因为里面包含了所有的文本文档。

标记化

将制定语句或文本文档的词语集合划分成单独/独立语句的方法成为标记化。这会移除不必要的字符,例如标点符号,如下:

输入:
He really liked the London City. He is there for two more days.
标记:
He,really,liked,the,London,City,He,is,there,for two,more,days

我们最后从上面的输入语句中得到13个标记
我们用spark实现一下

  1. 先创建一个DataFrame
df=spark.createDataFrame([(1,'I really liked this movie'),
                         (2,'I would recommend this movie to my friends'),
                         (3,'movie was alright but acting was horrible'),
                         (4,'I am never watching that movie ever again')],
                        ['user_id','review'])
user_id review
1 I really liked this movie
2 I would recommend this movie to my friends
3 movie was alright but acting was horrible
4 I am never watching that movie ever again
  1. 引入Tokenizer标记
tokenization=Tokenizer(inputCol='review',outputCol='tokens')
tokenized_df=tokenization.transform(df)
user_id review tokens
1 I really liked this movie [i, really, liked, this, movie]
2 I would recommend this movie to my friends [i, would, recommend, this, movie, to, my, friends]
3 movie was alright but acting was horrible [movie, was, alright, but, acting, was, horrible]
4 I am never watching that movie ever again [i, am, never, watching, that, movie, ever, again]
停用词

我们可以发现在上面的tokens里面有一些非常通用但没有实际意义的词语,如:this,the,to,that,was等,这些词就称为停用词。这些词对于分析而言没有太大的价值,如果将它们用于分析会增加计算开销,而且很有可能对结果产生负面影响。因此,我们需要把这些停用词从标记中移除。

stopword_removal=StopWordsRemover(inputCol='tokens',outputCol='refined_tokens')
refined_df=stopword_removal.transform(tokenized_df)
user_id tokens refined_tokens
1 [i, really, liked, this, movie] [really, liked, movie]
2 [i, would, recommend, this, movie, to, my, friends] [recommend, movie, friends]
3 [movie, was, alright, but, acting, was, horrible] [movie, alright, acting, horrible]
4 [i, am, never, watching, that, movie, ever, again] [never, watching, movie, ever]
词袋

文本数据通常是非结构化的,并且长度不固定。词袋允许我们将文本形式转换数值向量形式,其中设计单词在文本文档中出现的频率。
例:
文档1-The best thing in life is to travel
文档2-Travel is the best medicine
文档3-One should travel more often

出现在所有文档中的独特单词的列表被称为词汇表,上例中,有13个独特的单词,他们都是词汇表的一部分。每个文档都可以使用如下固定大小为13的向量来表示:

The best thing in life is to travel medicine One should more often

另一个要素就是使用布尔值(0/1)来表示特定文档中的单词
文档一:

The best thing in life is to travel medicine One should more often
1 1 1 1 1 1 1 1 0 0 0 0 0

文档二:

The best thing in life is to travel medicine One should more often
1 1 0 0 0 1 0 1 1 0 0 0 0

文档三:

The best thing in life is to travel medicine One should more often
0 0 0 0 0 0 0 1 0 1 1 1 1

词袋并不会考虑单词的语义以及他们在文档中的顺序,因此词袋是以数值形式表示文本数据的最基础方法。

计数向量器

在词袋中,我们看到了直接通过1或者0来表示单词出现与否,而没有考虑单词的出现频率。计数向量器会统计特定文档中单词出现的次数。这里要使用的文本文档与之前标记化期间创建的文本文档相同。
每一条语句都会被表示成一个密集向量。
可以使用vocabulary函数查看计数向量的词汇表

count_vec=CountVectorizer(inputCol='refined_tokens',outputCol='features')
cv_df=count_vec.fit(refined_df)
cv_df.vocabulary

[‘movie’,
‘horrible’,
‘liked’,
‘alright’,
‘friends’,
‘recommend’,
‘acting’,
‘never’,
‘really’,
‘watching’,
‘ever’]

cv_df.transform(refined_df).select(['user_id','refined_tokens','features']).show(10,False)
user_id refined_tokens features
1 [really, liked, movie] (11,[0,2,8],[1.0,1.0,1.0])
2 [recommend, movie, friends] (11,[0,4,5],[1.0,1.0,1.0])
3 [movie, alright, acting, horrible] (11,[0,1,3,6],[1.0,1.0,1.0,1.0])
4 [never, watching, movie, ever] (11,[0,7,9,10],[1.0,1.0,1.0,1.0])

以第一条语句为例:
11:表示向量长度为11
[0,2,8]:表示有三个值,每个值在计数向量的词汇表的索引位置
[1.0,1.0,1.0]:表示每个索引处的频数

计数向量的缺点在于,不会考虑单词同时出现在其他文档中的情况。

TF-IDF

TF-IDF会尝试基于其他文档归一化单词出现的频率。整体理念是,如果单词在同一文档中大量出现,则给予更多的权重,但是如果单词也在其他文档中大量出现,则给予惩罚。这就表明,一个单词也许是在语料库中是常见的,但却并不像在当前文档中的出现频率那么重要。
词频:基于单词在文档中的出现频率来评分

词频TF(t,d)是词语t在文档d中出现的次数

逆文档频率:基于包含当前单词的文档的出现频率来评分

文件频率DF(t,D)是包含词语的文档的个数
公式中使用log函数,当词出现在所有文档中时,它的IDF值变为0。加1是为了避免分母为0的情况。TF-IDF 度量值表示如下:
IDF(t,D)=log∣D∣+1DF(t,D)+1IDF(t,D)=log\frac{|D|+1}{DF(t,D)+1}IDF(t,D)=logDF(t,D)+1D+1
D是语料库中总的文档数

TF-IDF

TFIDF(t,d,D)=TF(t,d)IDF(t,D)TFIDF(t,d,D)=TF(t,d)IDF(t,D)TFIDF(t,d,D)=TF(t,d)IDF(t,D)

在Spark ML库中,TF-IDF被分成两部分:TF (+hashing) 和 IDF。

TF: HashingTF 是一个Transformer,在文本处理中,接收词条的集合然后把这些集合转化成固定长度的特征向量。这个算法在哈希的同时会统计各个词条的词频。

IDF: IDF是一个Estimator,在一个数据集上应用它的fit()方法,产生一个IDFModel。 该IDFModel 接收特征向量(由HashingTF产生),然后计算每一个词在文档中出现的频次。IDF会减少那些在语料库中出现频率较高的词的权重。

Spark.mllib 中实现词频率统计使用特征hash的方式,原始特征通过hash函数,映射到一个索引值。后面只需要统计这些索引值的频率,就可以知道对应词的频率。这种方式避免设计一个全局1对1的词到索引的映射,这个映射在映射大量语料库时需要花费更长的时间。但需要注意,通过hash的方式可能会映射到同一个值的情况,即不同的原始特征通过Hash映射后是同一个值。为了降低这种情况出现的概率,我们只能对特征向量升维。i.e., 提高hash表的桶数,默认特征维度是 2^20 = 1,048,576.

使用分词后的文档序列后,再使用HashingTF的transform()方法把句子哈希成特征向量,这里设置哈希表的桶数为100。

hashing_vec=HashingTF(inputCol='refined_tokens',outputCol='tf_features', numFeatures=100)
hashing_df=hashing_vec.transform(refined_df)
user_id refined_tokens tf_features
1 [really, liked, movie] (100,[12,39,88],[1.0,1.0,1.0])
2 [recommend, movie, friends] (100,[16,39,99],[1.0,1.0,1.0])
3 [movie, alright, acting, horrible] (100,[5,23,39,66],[1.0,1.0,1.0,1.0])
4 [never, watching, movie, ever] (100,[39,75,81,94],[1.0,1.0,1.0,1.0])

使用IDF来对单纯的词频特征向量进行修正,使其更能体现不同词汇对文本的区别能力,IDF是一个Estimator,调用fit()方法并将词频向量传入,即产生一个IDFModel。调用它的transform()方法,即可得到每一个单词对应的TF-IDF度量值

tf_idf_vec=IDF(inputCol='tf_features',outputCol='tf_idf_features')
tf_idf_df=tf_idf_vec.fit(hashing_df).transform(hashing_df)
user_id tf_idf_features
1 (100,[12,39,88],[0.9162907318741551,0.0,0.9162907318741551])
2 (100,[16,39,99],[0.9162907318741551,0.0,0.9162907318741551])
3 (100,[5,23,39,66],[0.9162907318741551,0.9162907318741551,0.0,0.9162907318741551])
4 (100,[39,75,81,94],[0.0,0.9162907318741551,0.9162907318741551,0.9162907318741551])

使用机器学习进行分类

导入数据并查看
text_df=spark.read.csv('Movie_reviews.csv',inferSchema=True,header=True,sep=',')
text_df.orderBy(rand()).show(10,False)
Review Sentiment
My dad’s being stupid about brokeback mountain… 0
Ok brokeback mountain is such a horrible movie. 0
I love Brokeback Mountain. 1
He’s like,'YEAH I GOT ACNE AND I LOVE BROKEBACK MOUNTAIN '… 1
Harry Potter and the Sorcerer’s Stone is great but I had forgotten what 1
“Anyway, thats why I love “” Brokeback Mountain.” 1
Which is why i said silent hill turned into reality coz i was hella like 1
Apparently the Da Vinci code sucks. 0
I am going to start reading the Harry Potter series again because that i 1
So as felicia’s mom is cleaning the table, felicia grabs my keys and we 1
text_df.groupBy("Sentiment").count().show()
Sentiment count
,0 1
. but “” Angel an… 1
0 3081
“” you see Demen… 1
but due to the s… 1
the story of “” … 1
and not because … 1
oddly e" 1
but I still feel" 1
my God 1
I decided to wri… 1
but it was reall… 1
but I hate the Da" 1
1 3909
but immensely we… 1
with f" 1
also" 80
or how I love" 1
Joe 1
which was really… 1

发现有乱码数据,所以我们需要进行剔除

text_df=text_df.filter(((text_df.Sentiment =='1') | (text_df.Sentiment =='0')))
text_df = text_df.withColumn("Label", text_df.Sentiment.cast('float')).drop('Sentiment')

愿数据的Sentiment列是字符串类型,需要转换成数值型

数据标记并去除通用词

标记化处理

tokenization=Tokenizer(inputCol='Review',outputCol='tokens')
tokenized_df=tokenization.transform(text_df)

移除停用词

stopword_removal=StopWordsRemover(inputCol='tokens',outputCol='refined_tokens')
refined_text_df=stopword_removal.transform(tokenized_df)
计数向量器
count_vec=CountVectorizer(inputCol='refined_tokens',outputCol='features')
cv_text_df=count_vec.fit(refined_text_df).transform(refined_text_df)
refined_tokens features Label
[da, vinci, code,… (2302,[0,1,4,43,2… 1.0
[first, clive, cu… (2302,[11,51,229,… 1.0
[liked, da, vinci… (2302,[0,1,4,53,3… 1.0
[liked, da, vinci… (2302,[0,1,4,53,3… 1.0
[liked, da, vinci… (2302,[0,1,4,53,6… 1.0
[even, exaggerati… (2302,[46,229,271… 1.0
[loved, da, vinci… (2302,[0,1,22,30,… 1.0
[thought, da, vin… (2302,[0,1,4,228,… 1.0
[da, vinci, code,… (2302,[0,1,4,33,2… 1.0
[thought, da, vin… (2302,[0,1,4,223,… 1.0
建模数据
df_assembler = VectorAssembler(inputCols=['features'],outputCol='features_vec')
model_text_df = df_assembler.transform(model_text_df)
构建模型
#切分数据集
training_df,test_df=model_text_df.randomSplit([0.75,0.25])
#建模训练
log_reg=LogisticRegression(featuresCol='features_vec',labelCol='Label').fit(training_df)
#预测
results=log_reg.evaluate(test_df).predictions
#精度
accuracy=MulticlassClassificationEvaluator(labelCol='Label',metricName='accuracy').evaluate(results)

0.9775219298245614

Logo

CSDN联合极客时间,共同打造面向开发者的精品内容学习社区,助力成长!

更多推荐