失眠网,内容丰富有趣,生活中的好帮手!
失眠网 > jieba库词频统计_如何用python对《三国演义》 《红楼梦》等名著开展词云分析及字频统

jieba库词频统计_如何用python对《三国演义》 《红楼梦》等名著开展词云分析及字频统

时间:2019-08-03 18:28:55

相关推荐

jieba库词频统计_如何用python对《三国演义》 《红楼梦》等名著开展词云分析及字频统

以下以《红楼梦》为例进行设计。

在制作词云图及统计之前,需要下载python的几个库,wordcloud、jieba以及imageio等,我的操作系统为Windows10,IDE环境为idle,下载方式就直接搜索cmd,打开命令提示符窗口,输入pip install wordcloud等库进行下载即可。

像这样,就下载成功了

要对名著进行开展,必不可少的就是这些名著的电子书,安装好库就要进行对电子书的下载,这个链接可以下载《红楼梦》的txt电子书:

红楼梦txt下载|红楼梦txt全集下载-红楼梦百度云下载-TXT下载站​

这是我用到的背景图

以下为我具体的操作代码,具体的注释我都加在了里面:

import jieba import wordcloudfrom imageio import imread# 1、进行词云分析,即词云图的制作def ciyun():mask = imread("林黛玉.png") # 打开词云背景图tf = open('红楼梦.txt','rt',encoding = 'utf-8') # 打开《林黛玉》txt文档txt = ''for line in tf.readlines():for j in ",.“”?:《》--!":line.replace('',j)txt += linejieba_cut = jieba.lcut(txt) # 利用jieba对文档进行全文分词c = wordcloud.WordCloud(width = 1200,font_path = 'msyh.ttc',height = 800,background_color='white',mask=mask) # 进行背景、画布大小、颜色等处理c.generate(' '.join(jieba_cut))c.to_file('红楼梦.png')tf.close()ciyun() # 2、出场统计的制作excludes = {"什么","一个","我们","那里","你们","如今","说道","知道","起来","姑娘","这里","出来","他们","众人","自己","一面","只见","怎么","奶奶","两个","没有","不是","不知","这个","听见","这样","进来","咱们","告诉","就是","东西","袭人","回来","只是","大家","只得","老爷","丫头","这些","不敢","出去","所以","不过","的话","不好","姐姐","探春","鸳鸯","一时","不能","过来","心里","如此","今日","银子","几个","答应","二人","还有","只管","这么","说话","一回","那边","这话","外头","打发","自然","今儿","罢了","屋里","那些","听说","小丫头","不用","如何"}# 将这些会干扰的词汇列出并且删除,以免影响最后的结果txt = open("红楼梦.txt","r",encoding='utf-8').read() # 打开《红楼梦》txt电子书words = jieba.lcut(txt) # 利用jieba进行全文分词paixv = {}for word in words:if len(word) == 1: # 如果分割的长度是一,可能是语气词之类的,所以删除continueelse:paixv[word] = paixv.get(word,0) + 1for word in excludes:del(paixv[word]) # 如果列出的干扰词汇在分完词后的所有词汇中那么删除items = list(paixv.items()) # 将字典转换为列表items.sort(key=lambda x:x[1],reverse = True) # 将列表进行降序排列for i in range(20): # 打印出前20个出场最多的人物名word,count = items[i]print("{0:<10}{1:>5}".format(word,count))# 3、字频统计的制作 import osimport codecsimport jiebaimport pandas as pdfrom wordcloud import WordCloudfrom scipy.misc import imreadimport matplotlib.pyplot as pltos.chdir("/Users/Zhaohaibo/Desktop")class Hlm(object): def Zipin(self, readdoc, writedoc): # readdoc:要读取的文件名,writedoc:要写入的文件名word_lst = []word_dict = {} exclude_str = ",。!?、()【】<>《》=:+-*—“”…" with open(readdoc,"r") as fileIn ,open(writedoc,'w') as fileOut:# 添加每一个字到列表中:for line in fileIn:for char in line:word_lst.append(char)# 用字典统计每个字出现的个数: for char in word_lst:if char not in exclude_str:if char.strip() not in word_dict: # strip去除各种空白word_dict[char] = 1else :word_dict[char] += 1# 排序x[1]是按字频排序,x[0]则是按字排序lstWords = sorted(word_dict.items(), key=lambda x:x[1], reverse=True) # 输出结果 (前100)print ('字符t字频')print ('=============')for e in lstWords[:100]:print ('%st%d' % e)fileOut.write('%s, %dn' % e)# 词频表(DataFrame格式)def Cipin(self, doc): # doc:要读取的文件名wdict = {}f = open(doc,"r")for line in f.readlines():words = jieba.cut(line)for w in words:if(w not in wdict):wdict[w] = 1else:wdict[w] += 1# 导入停用词表stop = pd.read_csv('stoplist.txt', encoding = 'utf-8', sep = 'zhao', header = None,engine = 'python') # sep:分割符号(需要用一个确定不会出现在停用词表中的单词)stop.columns = ['word'] stop = [' '] + list(stop.word) # python读取时不会读取到空格。但空格依旧需要去除。所以加上空格; 读取后的stop是series的结构,需要转成列表for i in range(len(stop)):if(stop[i] in wdict):wdict.pop(stop[i])ind = list(wdict.keys())val = list(wdict.values())ind = pd.Series(ind)val = pd.Series(val)data = pd.DataFrame()data['词'] = inddata['词频'] = valreturn data

最后的结果截图为:

词云图:

出场统计:

字频统计:

有点多就只截一部分

以上便为《红楼梦》的词云分析及字频统计、出场统计。主要是为了记录一下我昨天的课程设计作业,代码有借鉴。

jieba库词频统计_如何用python对《三国演义》 《红楼梦》等名著开展词云分析及字频统计 出场统计等工作。...

如果觉得《jieba库词频统计_如何用python对《三国演义》 《红楼梦》等名著开展词云分析及字频统》对你有帮助,请点赞、收藏,并留下你的观点哦!

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。