失眠网 > python统计英文文本词频和提取文本关键词

python统计英文文本词频和提取文本关键词

时间：2020-06-16 15:40:14

统计一段英文的词频，以下代码只将标点符号省去，没有去除英文中介词，数词，人称代词等，如需要改进在统计时候直接去除相应词汇即可。

#读取文本txt = open("english.txt","r",errors='ignore').read()#字母变小写txt = txt.lower()for ch in '!"#$&()*+,-./:;<=>?@[\\]^_{|}·~‘’':#替换标点txt = txt.replace(ch,"")#根据空格，空字符，换行符，制表符进行分词words = txt.split()#记录词频counts = {}for word in words:counts[word] = counts.get(word,0) + 1items = list(counts.items())items.sort(key=lambda x:x[1],reverse=True)for i in range(10):word,count = items[i]#分别左对齐占据10个单位，空格补全，右对齐五个单位，空格补全print("{0:<10}{1:>5}".format(word,count))

为了更好地体现文本的核心含义，下面我们采用jieba包的自然语言分析功能进行关键词的提取。

# 正则包import re# 自然语言处理包import stringimport jiebaimport jieba.analyse# html 包import htmlfrom numpy import *from zhon import *with open('./english.txt', "rb") as x:content = x.read() # 正则过滤content = re.sub("[{}]+".format(string.punctuation), " ", content.decode("utf-8"))# html 转义符实体化content = html.unescape(content)# 切割seg = [i for i in jieba.cut(content, cut_all=True) if i != '']# 提取关键词keywords = jieba.analyse.extract_tags("|".join(seg),topK=10, withWeight=True)# 分词与关键词提取keywords0 = keywordswith open('./keywords0.txt', 'w') as k0:k0.write(str(keywords0))k0.close()print('完成文章关键字提取！')

结果如下，后者为所占权重

[('Python', 0.4361421259243218), ('ILM', 0.36904333732058), ('production', 0.19011323437726846), ('was', 0.1789301029433115), ('its', 0.11183131433956968), ('used', 0.11183131433956968), ('into', 0.10064818290561271), ('process', 0.10064818290561271), ('systems', 0.08946505147165575), ('time', 0.08946505147165575)]

如果觉得《python统计英文文本词频和提取文本关键词》对你有帮助，请点赞、收藏，并留下你的观点哦！

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。