【Python】词频统计(written in python and Mapreduce)
一、利用Python进行词频统计(一)计算机等级考试中常用的方法(二)升级方法利用Python进行词频统计的核心语法利用Python进行词频统计的三种方法示例二、Mapreduce的方法进行词频统计面对大型的文件的统计需求,需要使用到集群来进行词频统计。我们打算在Hadoop平台上运行Python程序,分布计算从而提高我们词频统计的效率。因此使用了写MapReduce的方法。(一)代码示例以及解释
·
一、利用Python进行词频统计
(一)计算机等级考试中常用的方法
首先是一个比较标准的考试中使用的方法,针对英文文本:
def getText():
txt = open("E:\hamlet.txt", "r").read() #读取Hamlet文本文件,并返回给txt
txt = txt.lower() #将文件中的单词全部变为小写
for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
txt = txt.replace(ch, " ") #将文本中特殊字符替换为空格
return txt
hamletTxt = getText()
words = hamletTxt.split() #按照空格,将文本分割
counts = {}
for word in words: #统计单词出现的次数,并存储到counts字典中
counts[word] = counts.get(word,0) + 1 #先给字典赋值,如果字典中没有word这个键,则返回0
items = list(counts.items()) #将字典转换为列表,以便操作
items.sort(key=lambda x:x[1], reverse=True) # 见下面函数讲解
for i in range(10):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))
针对中文文本则一般使用jieba库,下面是一个示例(但不算很常考):
#使用Jieba库进行词频统计
import jieba
txt = open("Jieba词频统计素材.txt", "r", encoding='utf-8').read()#防止出现编码问题而使用encoding
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue#不希望统计到单个词,比如说“的”,“好”等
counts[word] = counts.get(word,0) + 1
#将分词放入字典中
#如果有不希望统计到的词,那就在开始时创建一个包含所有你不想统计到的词语列表,例如
#exclude_words=["统计","排除"]
#for word in exclude_words:
# del counts[word]
#这样就可以避免统计到不希望出现的词了
#以下开始对字典中词语进行统计
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))
(二)升级方法
- 利用Python进行词频统计的核心语法
要掌握好利用python词频统计(特指上述的最简单的方法),我认为有以下几个重要的点需要熟悉
(1)将词放入字典,并同时统计频数的过程
words = txt_file.split() #以" "为分隔符分隔文件
words2 = txt_file.lcut()#或者将中文文件用jieba库分词
for word in words:
counts[word]=counts.get(word,0)+1#dict.get(寻找值,找不到则返回的值);这一行代码同时实现计数
(2)将字典的键值对以列表形式输出,中途进行排序的过程
items = list(counts.items())#items方法返回键值对
items.sort(key=lambda x:x[1], reverse=True)
先简单讲lambda函数,lambda x:y,输入x返回y,可以理解成sort函数的key参数的值等于lambda函数的返回值;lambda函数输入值x相当于items列表,输出的是列表的第二列也就是itmes[1],即返回的是词的频数。
也就是说,按照频数对items排序。
3. 利用Python进行词频统计的三种方法示例
import pandas as pd
from collections import Counter
words_list = ["Monday","Tuesday","Thursday","Zeus","Venus","Monday","Monday","Zeus","Venus","Venus"]
dict = {}
for word in words_list:
dict[word] = dict.get(word, 0) + 1
print ("Result1:\n",dict)
result2 =Counter(words_list)
print("Result2:\n",result2)
result3 =pd.value_counts(words_list)
print("Result3:\n",result3)
Result1:
{'Monday': 3, 'Tuesday': 1, 'Thursday': 1, 'Zeus': 2, 'Venus': 3}
Result2:
Counter({'Monday': 3, 'Venus': 3, 'Zeus': 2, 'Tuesday': 1, 'Thursday': 1})
Result3:
Monday 3
Venus 3
Zeus 2
Thursday 1
Tuesday 1
dtype: int64
二、Mapreduce的方法进行词频统计
面对大型的文件的统计需求,需要使用到集群来进行词频统计。我们打算在Hadoop平台上运行Python程序,分布计算从而提高我们词频统计的效率。因此使用了写MapReduce的方法。
(一)代码示例以及解释
Map:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
from operator import itemgetter
from itertools import groupby
def main():
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
# tab-delimited; the trivial word count is 1
print('%s\t%s' % (word, 1))
if (__name__ == "__main__" ):
main()
Reduce:
#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue
# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print('%s\t%s' % (current_word, current_count))
current_count = count
current_word = word
# do not forget to output the last word if needed!
if current_word == word:
print('%s,%s' % (current_word, current_count))
(二)核心语法的学习探究
更多推荐
所有评论(0)