失眠网 > 【词频统计】--用python的jieba进行英文文本词频统计

【词频统计】--用python的jieba进行英文文本词频统计

时间：2018-11-07 16:22:15

1、基本思路：统计哈利波特小说中词频最高的前20个，去掉一些停用词（如is）

2、停用词（截取部分）

3、代码如下

4、小知识：元组可以用来这样赋值

1、基本思路：统计哈利波特小说中词频最高的前20个，去掉一些停用词（如is）

2、停用词（截取部分）

3、代码如下

# -*- coding: utf-8 -*-"""@File : 04.py@author: FxDr@Time : /10/19 14:33"""import jieba'''英文文本词频统计.'''# 打开文件，读取内容txt = open("Harry Potter and The Half Blood Prince.txt", "r").read()# 转小写txt = txt.lower()# 去掉一些特殊符号for each in '"’—!|“#$%&()*+,-./:;<=>?@[\\]^{|}~”':txt = txt.replace(each, " ") # 用空格代替特殊符号# 文本分词words = jieba.cut(txt) # 默认用空格分离并以列表形式返回stopword = []with open('stopwords_EN.txt', 'r') as f:stopwords = f.read() # 一些无意义不需要统计的词counts = {}for word in words:if word not in stopwords:counts[word] = counts.get(word, 0) + 1items = list(counts.items()) # 将字典转为列表items.sort(key=lambda x: x[1], reverse=True) # 按第二列排序，从高到低print(type(items)) # 列表类型print(type(items[0])) # 元组类型for i in range(20):word, count = items[i]print("{0:<10}{1:>5}".format(word, count))

其中
输出如下：harry 2815
dumbledore 1034
hermione 694
slughorn 397
snape 379
malfoy 371
professor 280
voldemort 243
ginny 233
hagrid 231
weasley 211
eyes 206
dark 200
voice 195
wand 191
door 179
moment 167
people 165
head 162
told 160