失眠网 > Python爬虫实战：分析《战狼2》豆瓣影评

Python爬虫实战：分析《战狼2》豆瓣影评

时间：2022-10-02 13:13:56

一、介绍：

环境：win10 ,jupyter notebook, python3.6,，re, bs4，requests

爬取豆瓣电影《战狼2》

主页：

/subject/26363254/

短评主页：

/subject/26363254/comments?sort=new_score&status=P

事实上，并不能爬取上万条消息，

不登陆账号的直接爬取只能爬取十页200条信息，登陆账号的话，能爬取大约500条信息，下面会有介绍

主要内容为网页分析，程序编写，爬虫

二、网页分析：

要想爬取数据，就要知道数据在网页中存在方式，寻找对应的方法爬取相应的数据

1、主页分析

第1页短评网址：/subject/26363254/comments?sort=new_score&status=P

第2页短评网址：/subject/26363254/comments?start=20&limit=20&sort=new_score&status=P

第3页短评网址：/subject/26363254/comments?start=40&limit=20&sort=new_score&status=P

第3页短评网址：/subject/26363254/comments?start=80&limit=20&sort=new_score&status=P

。

由此我们得出，网址中，只有start在变化，即递增20，那第一页能不能用这种方式呢，答案是能，把start的值改为0，和短评主页的数据一模一样，这样我么使用循环迭代爬取数据即可：

for i in range(0,10000,20):print("爬取第{0}页......".format(int(i)))requrl = "/subject/26363254/comments?start=" + str(i) + "&limit=20&sort=new_score&status=P"getContent(requrl,headers,cookies,i)time.sleep(3)

2、数据分析

如图，在短评首页，我们可以看到总评论数量，这个不用爬取，而且总评论数高达20W+,那我们到底能不能爬取这么多数据呢，我们拭目以待，我们可以看到每页总共20个评论，对应于网页递增20，

在每一个评论里，我们可以得到数据由用户ID，评分星级，评论时间，点赞数，评论内容，

接下来，我们就来分析网页源码来看如何爬取这些数据，右键页面，选择检查进入开发者模式，我们使用箭头来定位源码

我们可以发现，每一个评论都在div标签中，class都等于”comment-item",

进入其中一个，我们可以发现，我们需要的信息结构如下，

我们从网页源码中找到这五个元素对应的源码，（这一步为什么不能直接按照上图的代码来呢，因为源码和上图中显示的可能不一样，比如说一个标签中有class和title两个属性，上图中可能显示class在前，但在源码中可能就显示title在前了，如果两者不一样，那么用正则表达式就会匹配不到相应的源码，活生生的一个教训），

五个对应的源码分别是：

<a href="/people/z286424115/" class="">俏皮面</a>33120首映礼看的。太恐怖了这个电影，不讲道理的，完全就是吴京在实现他这个小粉红的英雄梦。各种装备轮番上场，视物理逻辑于不顾，不得不说有钱真好，随意胡闹

用户ID的class值为空，评分星级在隐藏在标签中的字符中，这两个有点麻烦，这两个使用正则表达式获取,

先构造正则表达式：

'<a class="" href="(.*?)">(.*?)</a>'''

相应的爬取代码如下：

# 用户ID,namepattern_Name = pile(r'<a class="" href="(.*?)">(.*?)</a>')patter_name = pattern_Name.findall(str(item))if patter_name != []:name = str(patter_name[0][1])else:print("第 {0} 页某行有空用户ID... ".format(int(page)))# 评论星级,score#pattern = pile(r'')patter_score = pattern.findall(str(item))if patter_score == []:print("第 {0} 页某行有空评分星级... ".format(int(page)))continue score = str(int(patter_score[0][0])//10)

另外三个，所需要的的数据都在内容中，直接使用find_all即可，相应的代码如下：

# 评论时间if item.find_all('span',class_='comment-time')[0].string is not None:time = str(item.find_all('span',class_='comment-time')[0].string.split())else:print("第 {0} 页某行评论时间为空... ".format(int(page)))# 点赞数if item.find_all('span',class_="votes")[0].string is not None:votes = item.find_all('span',class_="votes")[0].stringelse:print("第 {0} 页某行点赞数为0... ".format(int(page)))# 评论内容if item.find_all('span',class_="short")[0].string is not None: comment = item.find_all('span',class_="short")[0].stringelse:print("第 {0} 页某行有空短评... ".format(int(page)))continue

这里要注意的是，对特殊值进行数理，这里主要是缺失值进行处理，处理的一个原则是，如果评论或者评分星级没有的话，那么丢弃这条数据。

三、代码构建，

爬取规则定好后，接下来就是构建整个代码了，

1、导入需要的库：

import re from bs4 import BeautifulSoup as bsimport time import csvimport requests

请求头和cookie：

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'}cookie = {'cookies':'你的cookies'}

豆瓣影评不登陆的话只能爬取10页，所有要想多爬，就要登陆，

cookie获取方法参考：

/bailixuance/article/details/84715924

完整代码如下：

import re from bs4 import BeautifulSoup as bsimport time import csvimport requestsdef getContent(requrl,headers,cookies,page):resp = requests.get(requrl,cookies=cookies,headers=headers)#res = requests.get(url, headers=headers)html_data = resp.text# 接下来使用bs进行爬虫soup = bs(html_data, 'html.parser') # 所要爬取的内容所在位置comment_div_lits = soup.find_all('div', class_='comment-item')#print(type(html_data))#print(comment_div_lits[0])# print("第{0}页输出： ".format(int(page)))eachList = []if len(comment_div_lits) == 0:print("第 {0} 页爬取不到信息.....".format(int(page)))print("len(comment_div_lits): ",len(comment_div_lits))return for item in comment_div_lits:name = ''score = ''time = ''comment = ''votes = ''each = []#<a href=(.*?) class>(.*?)</a># 用户ID,namepattern_Name = pile(r'<a class="" href="(.*?)">(.*?)</a>')patter_name = pattern_Name.findall(str(item))if patter_name != []:name = str(patter_name[0][1])else:print("第 {0} 页某行有空用户ID... ".format(int(page)))# 评论星级,score#pattern = pile(r'')patter_score = pattern.findall(str(item))if patter_score == []:print("第 {0} 页某行有空评分星级... ".format(int(page)))continue score = str(int(patter_score[0][0])//10)# 评论时间if item.find_all('span',class_='comment-time')[0].string is not None:time = str(item.find_all('span',class_='comment-time')[0].string.split())else:print("第 {0} 页某行评论时间为空... ".format(int(page)))# 点赞数if item.find_all('span',class_="votes")[0].string is not None:votes = item.find_all('span',class_="votes")[0].stringelse:print("第 {0} 页某行点赞数为0... ".format(int(page)))# 评论内容if item.find_all('span',class_="short")[0].string is not None: comment = item.find_all('span',class_="short")[0].stringelse:print("第 {0} 页某行有空短评... ".format(int(page)))continueeach = [name,score,votes,time,comment]#print([name,score,time])with open('./zhanlangall.csv','a+',encoding='utf-8',newline='') as f:writer = csv.writer(f)writer.writerow(each)def main():headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'}cookie = {'cookies': '你的cookies'}for i in range(0,10000,20):print("爬取第{0}页......".format(int(i)))requrl = "/subject/26363254/comments?start=" + str(i) + "&limit=20&sort=new_score&status=P"getContent(requrl,headers,cookie,i)time.sleep(3)print("爬到所有数据，爬虫结束")main()

四、爬虫结果：

第500页后就一直没内容，我们打开该页看看，把start改为500试试，

页面是空的，也就是说，虽然有20w+评论，但你实质上只能看500条评论，看都不让看，还能怎么抓，，

我们看下结果，把爬取的结果导入excel，打开乱码的话，参考：

/bailixuance/article/details/84678133

结果：

500条左右

五、爬虫总结：

步骤1：通过Chrome浏览器检查元素

步骤2：获取单个页面HTML文本

步骤3：用正则表达式解析出所需要的信息并存入列表

步骤4：将列表中的信息存成csv文件

步骤5：利用start参数爬取其他页的短评

分析网页要素，分析数据结构，注意数据，

class值可能为空，可以使用正则表达式，

数据在标签中，使用正则表达式，

使用cookies，可尝试使用post，

异常可以使用try/except来处理

没有使用通用框架来写代码，但好处是，学习理解很快

几万条评论的貌似都是再猫眼上爬的？？？？？？

刚开始爬的时候，把一类的组成一列表，然后再写入csv中，比如，把一页20个用户ID组成一个用户列表，其余四个也是这样，然后将五个列表写入csv，这样导致了好多次超出下标的错误，可能有的数据确实什么的，

后来就换成了现在这个策略，一个人的数据组成一个列表，写进csv，即使有缺失，也没事

六、词云显示

1、预处理

import pandas as pdfrom matplotlib import pyplot as pltimport reimport jiebafilepath = 'zhanlangall_5.csv'# 添加行标题data = pd.read_csv(filepath,header=None,names=['用户ID','评分星级','点赞数','发布日期','评论内容'])# 查看数据整体信息print(data.info())# 查看数据前5个data.head()

# 是否有缺失值print(data.isnull().sum())print(len(data['用户ID']))print(len(data['评分星级']))print(len(data['点赞数']))print(len(data['发布日期']))print(len(data['评论内容']))

结果：

用户ID 0评分星级 0点赞数0发布日期 0评论内容 0dtype: int64484484484484484

2、合成字符串

# 将所有评论变为一个字符串comments = ''for k in range(len(data['评论内容'])):comments = comments +(str(data['评论内容'][k])).strip()print(comments)

结果：

去标点和表情：

# 使用正则表达式去标点和表情pattern = pile(r'[\u4e00-\u9fa5]+')filterdata = re.findall(pattern, comments)cleaned_comments = ''.join(filterdata)print(cleaned_comments)

分词：

# 分词segment = jieba.lcut(cleaned_comments)words_df=pd.DataFrame({'segment':segment})words_df.head()

去停用词：

# 去停用词stopwords=pd.read_csv("stopwords.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')#quoting=3全不引用words_df=words_df[~words_df.segment.isin(stopwords.stopword)]words_df.head()

词频统计：

# 词频统计import numpy words_stat=words_df.groupby(by=['segment'])['segment'].agg({"计数":numpy.size})words_stat=words_stat.reset_index().sort_values(by=["计数"],ascending=False)words_stat.head()

词云显示：

# 词云显示import matplotlib.pyplot as plt%matplotlib inlineimport matplotlibmatplotlib.rcParams['figure.figsize'] = (10.0, 5.0)from wordcloud import WordCloud#词云包from wordcloud import WordCloud,ImageColorGenerator # 词云包wordcloud=WordCloud(font_path="simhei.ttf",background_color="white",max_font_size=80) #指定字体类型、字体大小和字体颜色word_frequence = {x[0]:x[1] for x in words_stat.head(1000).values}word_frequence_list = []for key in word_frequence:temp = (key,word_frequence[key])word_frequence_list.append(temp)wordcloud=wordcloud.fit_words(word_frequence)#image_colors = ImageColorGenerator(bg_pic) # 根据图片生成词云颜色plt.imshow(wordcloud)wordcloud.to_file('show_Chinese.png') # 把词云保存下来

结果：

七、数据分析

如果觉得《Python爬虫实战：分析《战狼2》豆瓣影评》对你有帮助，请点赞、收藏，并留下你的观点哦！

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。