失眠网 > python jieba分词及中文词频统计

python jieba分词及中文词频统计

时间：2023-07-19 21:10:04

这篇博客用来记录一下自己学习用python做词频统计的过程，接上篇的英文词频统计

上篇：python词频统计并按词频排序

参考资料：jieba参考文档

一、jieba库简介

jieba是Python中一个重要的第三方中文分词函数库，需要通过

pip指令安装，顺便一说，使用-i 参数指定国内镜像源，速度更快

pip install -i https://pypi.tuna./simple jieba

jieba分词的三种常见模式如下

*精确模式，尝试将句子最精确地切开，适合文本分析，不过精确模式分词速度不尽人意

* 全模式，把句子中所有的可以成词的词语都扫描出来, 速度非常快，但是不能解决歧义问题；

* 搜索引擎模式，在精确模式的基础上，对长词再次切分，适合用于搜索引擎分词。

简单介绍下jieba库的几个常用方法

jieba.lcut(s) 精确模式，返回一个列表类型，一般分词就用这个方法

jieba.lcut(s, cut_all=True) 全模式，返回一个列表类型，

jieba.lcut_for_search(s) 搜索引擎模式，返回一个列表类型

下面简单看下三个方法的效果

>>> import jieba>>> jieba.lcut("青年一代是充满朝气、生机勃勃的")#精确模式['青年一代', '是', '充满', '朝气', '、', '生机勃勃', '的']>>> jieba.lcut("青年一代是充满朝气、生机勃勃的",cut_all=True)#全模式['青年', '青年一代', '一代', '是', '充满', '满朝', '朝气', '、', '生机', '生机勃勃', '勃勃', '勃勃的']>>> jieba.lcut_for_search("青年一代是充满朝气、生机勃勃的")#搜索引擎模式['青年', '一代', '青年一代', '是', '充满', '朝气', '、', '生机', '勃勃', '生机勃勃', '的']>>>

可以看到，精确模式对句子的切分最为准确，适合用来做文章词频统计，其它两个模式各有侧重，全模式会提供尽可能多的词，但不能解决歧义，搜索引擎模式切分出来的词适合作为搜索引擎的索引或关键字。

接下用，我们尝试用精确模式来对《水浒传》进行切分，并统计切分后的词频。

二、一些准备工作

同样的，这里也需要用到停用词表处理停用词并用相关第三方库解决中文标点符号问题

中文标点可以直接用zhon库（不是自带的库，需要手动安装）

import zhon.hanzipunc = zhon.hanzi.punctuation #要去除的中文标点符号print(punc)#包括＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､、〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏﹑﹔·！？｡。

这样，中文标点的问题解决了，但是，我上篇中用过的nltk库的停用词库里并没有中文停用词表，于是我就在网上重新找了一个并把它加到了nltk库的停用词表库里去，这样它就有中文停用词表了，我可真是机灵.jpg

当然这里其实也可以直接将停用词表保存为.txt文档，用的时候直接读入就行，下面介绍一下上述两种操作

某大佬在github分享的停用词表

我这里用的百度停用词表，但是尴尬的是它对我选择文本的词语排除效果好像不是很好…

一、将百度停用词表加入nltk库的停用词库中

①首先将百度停用词表复制（下载）下来，保存为.txt文档，注意文档格式，必须是每行一个词（字）

②接下来，找到nltk的停用词库路径，一般在python文件夹的lib中找，参考路径如下，python3.9.7\Lib\nltk_data\corpora\stopwords

你可以在lib里直接搜索stopwords,不过前提是你安装了nltk库，当然lib下可能有几个stopwords,不要弄错了，是nltk目录下的那个

③将我们前面提到的.txt文档复制到stopwords里，把后缀名.txt去掉就行

大功告成，接下来试试添加的停用词表能成功加载不

>>> from nltk.corpus import stopwords>>> baidu_stopwords = stopwords.words("baidu_stopwords")>>> print(baidu_stopwords[:100])['--', '?', '“', '”', '》', '－－', 'able', 'about', 'above', 'according', 'accordingly', 'across', 'actually', 'after', 'afterwards', 'again', 'against', "ain't", 'all', 'allow', 'allows', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'an', 'and', 'another', 'any', 'anybody', 'anyhow', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apart', 'appear', 'appreciate', 'appropriate', 'are', "aren't", 'around', 'as', "a's", 'aside', 'ask', 'asking', 'associated', 'at', 'available', 'away', 'awfully', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'best', 'better', 'between', 'beyond', 'both', 'brief', 'but', 'by', 'came', 'can', 'cannot', 'cant', "can't", 'cause', 'causes', 'certain', 'certainly', 'changes', 'clearly', "c'mon", 'co', 'com', 'come', 'comes', 'concerning', 'consequently']>>> print(baidu_stopwords[-100:])['起', '起来', '起见', '趁', '趁着', '越是', '跟', '转动', '转变', '转贴', '较', '较之', '边', '达到', '迅速', '过', '过去', '过来', '运用', '还是', '还有', '这', '这个', '这么', '这么些', '这么样', '这么点儿', '这些', '这会儿', '这儿', '这就是说', '这时', '这样', '这点', '这种', '这边', '这里', '这麽', '进入', '进步', '进而', '进行', '连', '连同', '适应', '适当', '适用', '逐步', '逐渐', '通常', '通过', '造成', '遇到', '遭到', '避免', '那', '那个', '那么', '那么些', '那么样', '那些', '那会儿', '那儿', '那时', '那样', '那边', '那里', '那麽', '部分', '鄙人', '采取', '里面', '重大', '重新', '重要', '鉴于', '问题', '防止', '阿', '附近', '限制', '除', '除了', '除此之外', '除非', '随', '随着', '随著', '集中', '需要', '非但', '非常', '非徒', '靠', '顺', '顺着', '首先', '高兴', '是不是', '说说']>>>

很好，加载成功了，说明我们成功给nltk库加上了中文停用词表，其实学会了这个操作后我们可以根据自己的需求定义自己的“停用词表”，将标点符号，停用词等放到一个文件中，这样用的时候直接导入nltk库，不需要导入其它库了

二、将百度停用词表保存为.txt文件，需要用的时候读取

with open('E:\Python_code\\blog\\baidu_stopwords.txt',encoding="utf-8") as fp:text = fp.read()print(text[:100],text[-100:])

三、中文词频统计

整个处理流程大致和上篇的英文词频统计差不多，主要就这几步：

*读入文档

*分词

*去掉标点及停用词

*统计词频

*排序

我这里不统计长度为1的词，所以去标点符号这步可省略，另外使用第三方库可以更快捷地统计词频，往往还集成了排序功能。

import jiebaimport zhon.hanzifrom nltk.corpus import stopwordspunc = zhon.hanzi.punctuation #要去除的中文标点符号baidu_stopwords = stopwords.words('baidu_stopwords') #导入停用词表#读入文件with open('E:\Python_code\Big_data\homework4\水浒传.txt',encoding="utf-8") as fp:text = fp.read()ls = jieba.lcut(text)#分词#统计词频counts= {}for i in ls:if len(i)>1: counts[i] = counts.get(i,0)+1#去标点（由于我这里不统计长度为1的词，去标点这步可省略）# for p in punc: #counts.pop(p,0)for word in baidu_stopwords: #去掉停用词counts.pop(word,0)ls1 = sorted(counts.items(),key=lambda x:x[1],reverse=True) #词频排序print(ls1[:20])

同上篇方法，我们也可借助第三方库完成词频统计

借助collections库

import jiebaimport zhon.hanzifrom nltk.corpus import stopwordsimport collectionspunc = zhon.hanzi.punctuation #要去除的中文标点符号baidu_stopwords = stopwords.words('baidu_stopwords') #导入停用词表#读入文件with open('E:\Python_code\Big_data\homework4\水浒传.txt',encoding="utf-8") as fp:text = fp.read()ls = jieba.lcut(text)#去掉长度为1的词，包括标点newls = []for i in ls:if len(i)>1:newls.append(i)#统计词频counts = collections.Counter(newls) for word in baidu_stopwords: #去掉停用词counts.pop(word,0)print(counts.most_common(20))

借助pandas库

import jiebaimport zhon.hanzifrom nltk.corpus import stopwordsimport pandas as pdpunc = zhon.hanzi.punctuation #要去除的中文标点符号baidu_stopwords = stopwords.words('baidu_stopwords') #导入停用词表#读入文件with open('E:\Python_code\Big_data\homework4\水浒传.txt',encoding="utf-8") as fp:text = fp.read()ls = jieba.lcut(text)#去掉长度为1的词，包括标点newls = []for i in ls:if len(i)>1:newls.append(i)#统计词频ds = pd.Series(newls).value_counts()for i in baidu_stopwords:try:#处理找不到元素i时pop()方法可能出现的错误ds.pop(i) except:continue #没有i这个词，跳过本次，继续下一个词print(ds[:20])

贴个测试结果

四、jieba自定义分词、词性分析

jieba支持自定义分词器，可以进行词性标注，但是不能保证所有词都被标注而且分析速度比较慢，这里只介绍一个例子，下面用自定义分词器对《“十四五”信息通信行业发展规划》解读做下词性分析

词性表

import jieba.posseg as psegwith open('E:/Python_code/blog/通信行业规划.txt',encoding="utf-8") as f:text = f.read()wordit = pseg.cut(text) #自定义分词，返回一个可迭代类型count_flag = {}for word ,flag in wordit:if flag not in count_flag.keys(): #如果没有flag键，就添加flag键，对应的值为一个空列表，每个键代表一种词性count_flag[flag] = []elif len(word)>1: #有对应的词性键，就将词加入到键对应的列表中，跳过长度为1的词count_flag[flag].append(word)print(count_flag)

如果觉得《python jieba分词及中文词频统计》对你有帮助，请点赞、收藏，并留下你的观点哦！

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。