失眠网 > 【python】微博热点话题舆情聚类分析

【python】微博热点话题舆情聚类分析

时间：2023-01-14 01:44:35

提前准备的Python模块

本文的实现使用到了多个第三方模块，主要模块如下所示：

jieba 使用最广的分词模块pandas 高效处理大型数据集常用的python模块Scikit-learn 用于机器学习的Python工具包Matplotlib 一个python的图形框架，用于绘制二维图形requests 一个常用的Http库，用来发送网络请求

第一步，爬取微博数据

一个很简单的微博爬虫程序，爬取网站是微博的手机端搜索页面/（选择手机端是因为手机端简单）。代码使用python简单的request包。

首先，对微博页面进行分析，在微博搜索页面随便输入个关键词，然后F12进入谷歌浏览器的审查元素界面，点击NetWork，筛选到XHR选项卡，观察页面返回的接口，和response返回的json数据。

发现url规律如下：

1./api/container/getIndex?containerid=100103type情人节&type=all&queryVal=情人节&featurecode=20000320&luicode=10000011&lfid=106003type%3D1&title=情人节&page=10/api/container/getIndex?containerid=100103type情人节&type=all&queryVal=情人节&featurecode=20000320&luicode=10000011&lfid=106003type%3D1&title=情人节&page=11

通过对微博列表的下拉，实现了分页获取微博数据，除了page参数在一直滚动，其他的参数都是固定不变的。由此可以确定访问接口。

接下来分析页面`response`返回的json数据，将数据格式展开如下

在本项目中，我们只需要提取微博的内容进行特征提取，所以我们只保存微博的部分有用字段

dataidcardsmblogid # 唯一标识created_at # 发布时间text # 正文

具体实现代码：

from urllib.parse import urlencodeimport requestsfrom pyquery import PyQuery as pqimport timeimport osimport csvimport jsonbase_url = '/api/container/getIndex?'headers = {'Host': '','Referer': '/u/2830678474','User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36','X-Requested-With': 'XMLHttpRequest',}class SaveCSV(object):def save(self, keyword_list,path, item):"""保存csv方法:param keyword_list: 保存文件的字段或者说是表头:param path: 保存文件路径和名字:param item: 要保存的字典对象:return:"""try:# 第一次打开文件时，第一行写入表头if not os.path.exists(path):with open(path, "w", newline='', encoding='utf-8') as csvfile: # newline='' 去除空白行writer = csv.DictWriter(csvfile, fieldnames=keyword_list) # 写字典的方法writer.writeheader() # 写表头的方法# 接下来追加写入内容with open(path, "a", newline='', encoding='utf-8') as csvfile: # newline='' 一定要写，否则写入数据有空白行writer = csv.DictWriter(csvfile, fieldnames=keyword_list)writer.writerow(item) # 按行写入数据print("^_^ write success")except Exception as e:print("write error==>", e)# 记录错误数据with open("error.txt", "w") as f:f.write(json.dumps(item) + ",\n")passdef get_page(page,title): #得到页面的请求，params是我们要根据网页填的，就是下图中的Query String里的参数params = {'containerid': '100103type=1&q='+title,'page': page,#page是就是当前处于第几页，是我们要实现翻页必须修改的内容。'type':'all','queryVal':title,'featurecode':'20000320','luicode':'10000011','lfid':'106003type=1','title':title}url = base_url + urlencode(params)print(url)try:response = requests.get(url, headers=headers)if response.status_code == 200:print(page) return response.json()except requests.ConnectionError as e:print('Error', e.args)# 解析接口返回的json字符串def parse_page(json , label):res = []if json:items = json.get('data').get('cards')for i in items:if i == None:continueitem = i.get('mblog')if item == None:continueweibo = {}weibo['id'] = item.get('id')weibo['label'] = labelweibo['text'] = pq(item.get('text')).text().replace(" ", "").replace("\n" , "")res.append(weibo)return resif __name__ == '__main__':title = input("请输入搜索关键词：")path = "article.csv"item_list = ['id','text', 'label']s = SaveCSV()for page in range(10,20):#循环页面try:time.sleep(1) #设置睡眠时间，防止被封号json = get_page(page , title )results = parse_page(json , title)if requests == None:continuefor result in results:if result == None:continueprint(result)s.save(item_list, path , result)except TypeError:print("完成")continue

将最后结果保存在csv文件中，保存的微博数据如下：

第二步，微博数据文本处理

对提取到的微博数据，本文采用jieba分词模块对微博正文进行处理，首先将微博中的数字、字母、特殊符号等使用正则表达式去掉，然后使用jieba分词模块对微博正文进行分词。

具体代码如下所示：

# 清洗文本def clearTxt(line:str):if(line != ''):line = line.strip()# 去除文本中的英文和数字line = re.sub("[a-zA-Z0-9]", "", line)# 去除文本中的中文符号和英文符号line = re.sub("[\s+\.\!\/_,$%^*(+\"\'；：“”．]+|[+——！，。？?、~@#￥%……&*（）]+", "", line)return linereturn None#文本切割def sent2word(line):segList = jieba.cut(line,cut_all=False)segSentence = ''for word in segList:if word != '\t':segSentence += word + " "return segSentence.strip()

处理完成后数据如下图所示：

第三步，特征向量提取，Kmeans聚类

因为Kmeans模型的输入必须是数值向量类型，所以我们需要把每条由词语组成的句子转换成一个数值型向量，在本文中我们使用了TF-IDF算法对文档进行了向量化，把所有的数据转换为词频矩阵作为Kmeans模型的输入，TF-IDF最大特征值选择为20000。

实现代码如下：

# 该类会将文本中的词语转换为词频矩阵，矩阵元素a[i][j] 表示j词在i类文本下的词频vectorizer = CountVectorizer(max_features=20000)# 该类会统计每个词语的tf-idf权值tf_idf_transformer = TfidfTransformer()# 将文本转为词频矩阵并计算tf-idftfidf = tf_idf_transformer.fit_transform(vectorizer.fit_transform(corpus))# 获取词袋模型中的所有词语tfidf_matrix = tfidf.toarray()# 获取词袋模型中的所有词语word = vectorizer.get_feature_names()

词频矩阵形成后，我们直接调用sklearn的Kmeans模型，对所有数据进行聚类。