spark_NLP_(1)

NLP-Natural Language Processing：自然语言处理从机器学习角度讲，需要执行五个步骤：读取预料标记化清理/移除通用词词干提取转化成数值格式基本步骤简述语料语料被称为文本文档的完整集合，例如，假设又一个集合中有数千封邮件，他们需要处理和分析以供使用，这组电子邮价就被称为语料，因为里面包含了所有的文本文档。标记化将制定语句或文本文档的词语集合划分成单独/独立语句的方法成为标记

Andy_shenzl

386人浏览 · 2020-07-20 16:20:05

Andy_shenzl · 2020-07-20 16:20:05 发布

NLP-Natural Language Processing：自然语言处理
从机器学习角度讲，需要执行五个步骤：

读取预料
标记化
清理/移除通用词
词干提取
转化成数值格式

基本步骤简述

语料

语料被称为文本文档的完整集合，例如，假设又一个集合中有数千封邮件，他们需要处理和分析以供使用，这组电子邮价就被称为语料，因为里面包含了所有的文本文档。

标记化

将制定语句或文本文档的词语集合划分成单独/独立语句的方法成为标记化。这会移除不必要的字符，例如标点符号，如下：

输入：
He really liked the London City. He is there for two more days.
标记：
He,really,liked,the,London,City,He,is,there,for two,more,days

我们最后从上面的输入语句中得到13个标记
我们用spark实现一下

先创建一个DataFrame

df=spark.createDataFrame([(1,'I really liked this movie'),
                         (2,'I would recommend this movie to my friends'),
                         (3,'movie was alright but acting was horrible'),
                         (4,'I am never watching that movie ever again')],
                        ['user_id','review'])

user_id	review
1	I really liked this movie
2	I would recommend this movie to my friends
3	movie was alright but acting was horrible
4	I am never watching that movie ever again

引入Tokenizer标记

tokenization=Tokenizer(inputCol='review',outputCol='tokens')
tokenized_df=tokenization.transform(df)

user_id	review	tokens
1	I really liked this movie	[i, really, liked, this, movie]
2	I would recommend this movie to my friends	[i, would, recommend, this, movie, to, my, friends]
3	movie was alright but acting was horrible	[movie, was, alright, but, acting, was, horrible]
4	I am never watching that movie ever again	[i, am, never, watching, that, movie, ever, again]

停用词

我们可以发现在上面的tokens里面有一些非常通用但没有实际意义的词语，如：this，the，to，that，was等，这些词就称为停用词。这些词对于分析而言没有太大的价值，如果将它们用于分析会增加计算开销，而且很有可能对结果产生负面影响。因此，我们需要把这些停用词从标记中移除。

stopword_removal=StopWordsRemover(inputCol='tokens',outputCol='refined_tokens')
refined_df=stopword_removal.transform(tokenized_df)

user_id	tokens	refined_tokens
1	[i, really, liked, this, movie]	[really, liked, movie]
2	[i, would, recommend, this, movie, to, my, friends]	[recommend, movie, friends]
3	[movie, was, alright, but, acting, was, horrible]	[movie, alright, acting, horrible]
4	[i, am, never, watching, that, movie, ever, again]	[never, watching, movie, ever]

词袋

文本数据通常是非结构化的，并且长度不固定。词袋允许我们将文本形式转换数值向量形式，其中设计单词在文本文档中出现的频率。
例：
文档1-The best thing in life is to travel
文档2-Travel is the best medicine
文档3-One should travel more often

出现在所有文档中的独特单词的列表被称为词汇表，上例中，有13个独特的单词，他们都是词汇表的一部分。每个文档都可以使用如下固定大小为13的向量来表示：

The	best	thing	in	life	is	to	travel	medicine	One	should	more	often

另一个要素就是使用布尔值（0/1）来表示特定文档中的单词
文档一：

The	best	thing	in	life	is	to	travel	medicine	One	should	more	often
1	1	1	1	1	1	1	1	0	0	0	0	0

文档二:

The	best	thing	in	life	is	to	travel	medicine	One	should	more	often
1	1	0	0	0	1	0	1	1	0	0	0	0

文档三:

The	best	thing	in	life	is	to	travel	medicine	One	should	more	often
0	0	0	0	0	0	0	1	0	1	1	1	1

词袋并不会考虑单词的语义以及他们在文档中的顺序，因此词袋是以数值形式表示文本数据的最基础方法。

计数向量器

在词袋中，我们看到了直接通过1或者0来表示单词出现与否，而没有考虑单词的出现频率。计数向量器会统计特定文档中单词出现的次数。这里要使用的文本文档与之前标记化期间创建的文本文档相同。
每一条语句都会被表示成一个密集向量。
可以使用vocabulary函数查看计数向量的词汇表

count_vec=CountVectorizer(inputCol='refined_tokens',outputCol='features')
cv_df=count_vec.fit(refined_df)
cv_df.vocabulary

[‘movie’,
‘horrible’,
‘liked’,
‘alright’,
‘friends’,
‘recommend’,
‘acting’,
‘never’,
‘really’,
‘watching’,
‘ever’]

cv_df.transform(refined_df).select(['user_id','refined_tokens','features']).show(10,False)

user_id	refined_tokens	features
1	[really, liked, movie]	(11,[0,2,8],[1.0,1.0,1.0])
2	[recommend, movie, friends]	(11,[0,4,5],[1.0,1.0,1.0])
3	[movie, alright, acting, horrible]	(11,[0,1,3,6],[1.0,1.0,1.0,1.0])
4	[never, watching, movie, ever]	(11,[0,7,9,10],[1.0,1.0,1.0,1.0])

以第一条语句为例：
11:表示向量长度为11
[0,2,8]：表示有三个值，每个值在计数向量的词汇表的索引位置
[1.0,1.0,1.0]：表示每个索引处的频数

计数向量的缺点在于，不会考虑单词同时出现在其他文档中的情况。

TF-IDF

TF-IDF会尝试基于其他文档归一化单词出现的频率。整体理念是，如果单词在同一文档中大量出现，则给予更多的权重，但是如果单词也在其他文档中大量出现，则给予惩罚。这就表明，一个单词也许是在语料库中是常见的，但却并不像在当前文档中的出现频率那么重要。
词频：基于单词在文档中的出现频率来评分

词频TF(t,d)是词语t在文档d中出现的次数

逆文档频率：基于包含当前单词的文档的出现频率来评分

文件频率DF(t,D)是包含词语的文档的个数
公式中使用log函数，当词出现在所有文档中时，它的IDF值变为0。加1是为了避免分母为0的情况。TF-IDF 度量值表示如下：
$IDF(t,D)=log∣D∣+1DF(t,D)+1IDF(t,D)=log\frac{|D|+1}{DF(t,D)+1}$
D是语料库中总的文档数

TF-IDF

$T F I D F (t, d, D) = T F (t, d) I D F (t, D)$

在Spark ML库中，TF-IDF被分成两部分：TF (+hashing) 和 IDF。

TF: HashingTF 是一个Transformer，在文本处理中，接收词条的集合然后把这些集合转化成固定长度的特征向量。这个算法在哈希的同时会统计各个词条的词频。

IDF: IDF是一个Estimator，在一个数据集上应用它的fit（）方法，产生一个IDFModel。该IDFModel 接收特征向量（由HashingTF产生），然后计算每一个词在文档中出现的频次。IDF会减少那些在语料库中出现频率较高的词的权重。

Spark.mllib 中实现词频率统计使用特征hash的方式，原始特征通过hash函数，映射到一个索引值。后面只需要统计这些索引值的频率，就可以知道对应词的频率。这种方式避免设计一个全局1对1的词到索引的映射，这个映射在映射大量语料库时需要花费更长的时间。但需要注意，通过hash的方式可能会映射到同一个值的情况，即不同的原始特征通过Hash映射后是同一个值。为了降低这种情况出现的概率，我们只能对特征向量升维。i.e., 提高hash表的桶数，默认特征维度是 2^20 = 1,048,576.

使用分词后的文档序列后，再使用HashingTF的transform()方法把句子哈希成特征向量，这里设置哈希表的桶数为100。

hashing_vec=HashingTF(inputCol='refined_tokens',outputCol='tf_features', numFeatures=100)
hashing_df=hashing_vec.transform(refined_df)

user_id	refined_tokens	tf_features
1	[really, liked, movie]	(100,[12,39,88],[1.0,1.0,1.0])
2	[recommend, movie, friends]	(100,[16,39,99],[1.0,1.0,1.0])
3	[movie, alright, acting, horrible]	(100,[5,23,39,66],[1.0,1.0,1.0,1.0])
4	[never, watching, movie, ever]	(100,[39,75,81,94],[1.0,1.0,1.0,1.0])

使用IDF来对单纯的词频特征向量进行修正，使其更能体现不同词汇对文本的区别能力，IDF是一个Estimator，调用fit()方法并将词频向量传入，即产生一个IDFModel。调用它的transform()方法，即可得到每一个单词对应的TF-IDF度量值

tf_idf_vec=IDF(inputCol='tf_features',outputCol='tf_idf_features')
tf_idf_df=tf_idf_vec.fit(hashing_df).transform(hashing_df)

user_id	tf_idf_features
1	(100,[12,39,88],[0.9162907318741551,0.0,0.9162907318741551])
2	(100,[16,39,99],[0.9162907318741551,0.0,0.9162907318741551])
3	(100,[5,23,39,66],[0.9162907318741551,0.9162907318741551,0.0,0.9162907318741551])
4	(100,[39,75,81,94],[0.0,0.9162907318741551,0.9162907318741551,0.9162907318741551])

使用机器学习进行分类

导入数据并查看

text_df=spark.read.csv('Movie_reviews.csv',inferSchema=True,header=True,sep=',')
text_df.orderBy(rand()).show(10,False)

Review	Sentiment
My dad’s being stupid about brokeback mountain…	0
Ok brokeback mountain is such a horrible movie.	0
I love Brokeback Mountain.	1
He’s like,'YEAH I GOT ACNE AND I LOVE BROKEBACK MOUNTAIN '…	1
Harry Potter and the Sorcerer’s Stone is great but I had forgotten what	1
“Anyway, thats why I love “” Brokeback Mountain.”	1
Which is why i said silent hill turned into reality coz i was hella like	1
Apparently the Da Vinci code sucks.	0
I am going to start reading the Harry Potter series again because that i	1
So as felicia’s mom is cleaning the table, felicia grabs my keys and we	1

text_df.groupBy("Sentiment").count().show()

Sentiment	count
,0	1
. but “” Angel an…	1
0	3081
“” you see Demen…	1
but due to the s…	1
the story of “” …	1
and not because …	1
oddly e"	1
but I still feel"	1
my God	1
I decided to wri…	1
but it was reall…	1
but I hate the Da"	1
1	3909
but immensely we…	1
with f"	1
also"	80
or how I love"	1
Joe	1
which was really…	1

发现有乱码数据，所以我们需要进行剔除

text_df=text_df.filter(((text_df.Sentiment =='1') | (text_df.Sentiment =='0')))
text_df = text_df.withColumn("Label", text_df.Sentiment.cast('float')).drop('Sentiment')

愿数据的Sentiment列是字符串类型，需要转换成数值型

数据标记并去除通用词

标记化处理

tokenization=Tokenizer(inputCol='Review',outputCol='tokens')
tokenized_df=tokenization.transform(text_df)

移除停用词

stopword_removal=StopWordsRemover(inputCol='tokens',outputCol='refined_tokens')
refined_text_df=stopword_removal.transform(tokenized_df)

计数向量器

count_vec=CountVectorizer(inputCol='refined_tokens',outputCol='features')
cv_text_df=count_vec.fit(refined_text_df).transform(refined_text_df)

refined_tokens	features	Label
[da, vinci, code,…	(2302,[0,1,4,43,2…	1.0
[first, clive, cu…	(2302,[11,51,229,…	1.0
[liked, da, vinci…	(2302,[0,1,4,53,3…	1.0
[liked, da, vinci…	(2302,[0,1,4,53,3…	1.0
[liked, da, vinci…	(2302,[0,1,4,53,6…	1.0
[even, exaggerati…	(2302,[46,229,271…	1.0
[loved, da, vinci…	(2302,[0,1,22,30,…	1.0
[thought, da, vin…	(2302,[0,1,4,228,…	1.0
[da, vinci, code,…	(2302,[0,1,4,33,2…	1.0
[thought, da, vin…	(2302,[0,1,4,223,…	1.0

建模数据

df_assembler = VectorAssembler(inputCols=['features'],outputCol='features_vec')
model_text_df = df_assembler.transform(model_text_df)

构建模型

#切分数据集
training_df,test_df=model_text_df.randomSplit([0.75,0.25])
#建模训练
log_reg=LogisticRegression(featuresCol='features_vec',labelCol='Label').fit(training_df)
#预测
results=log_reg.evaluate(test_df).predictions
#精度
accuracy=MulticlassClassificationEvaluator(labelCol='Label',metricName='accuracy').evaluate(results)