失眠网 > 【Python_006】Python爬虫抓取豆瓣电影影评

【Python_006】Python爬虫抓取豆瓣电影影评

时间：2018-08-20 03:22:32

写在前面：
我在上一篇博客中【Python_005】利用jieba及wordcloud生成词频及词云图，为了测试切词和词云图的效果，从豆瓣爬了电影的100条短评，本篇博客就来分享一下如何爬豆瓣影评（当然还是以神夏为例嘎嘎嘎）

使用到的模块

抓取主要使用到两个模块：urlib.request 和 BeautifulSoup

urllib.request

urllib.request 用于打开URL的可扩展库
官方文档

用urllib.request中的 request函数发送请求，urlopen函数可返回网址源代码

需要加入header信息，如果不加可能会发生HTTP Error 418

找到header方法，我这边使用的是Chrome, 打开一个网页，按F12进入开发者页面，点Network -> Header，找到User Agent开头的一串，就是header

from urllib.request import urlopen, Requesturl = '/explore#!type=movie&tag=%E8%B1%86%E7%93%A3%E9%AB%98%E5%88%86&sort=recommend&page_limit=20&page_start=0'headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)....../537.36'}resp = Request(url, headers=headers)req = urlopen(resp)print(req.read().decode('utf-8'))

效果如图：

BeautifulSoup4

BeautifulSoup4是一个可以从HTML或XML文件中提取数据的Python库，支持Python标准库中的HTML解析器，或者支持第三方库。
BeautifulSoup4 自动将输入文档转换成Unicode编码，输出文档转换为utf-8
官方文档

from urllib.request import urlopen, Requestfrom bs4 import BeautifulSoupurl = '/explore#!type=movie&tag=%E8%B1%86%E7%93%A3%E9%AB%98%E5%88%86&sort=recommend&page_limit=20&page_start=0'headers={'User-Agent': 'Mozilla/5.0 ..... Chrome/70.0.3538.67 Safari/537.36'}resp = Request(url, headers=headers)req = urlopen(resp)page = req.read().decode('utf-8')soup = BeautifulSoup(page, 'html.parser')print(soup.prettify())

soup.prettify() 得到一个BeautifulSoup的对象，并能按照标准的缩进格式结构输出

BeautifulSoup 支持遍历文档，找到所有符合条件的tag

在此例中，就是找到评论所属的tag，用find_all() 返回所有评论及所在的tag。

下图是上文soup.prettify() 返回的文档内容，可以看到短评都放在名为span的tag中，class为short.

soup.find_all('span', class_='short')

返回的是一个resultset对象

如果想要爬取多页内容，那么在最开始加入一个对url的循环嵌套即可。

完整代码如下

from urllib.request import urlopen, Requestfrom bs4 import BeautifulSoupimport pandas as pdfrom pandas import DataFrameurl_init = '/subject/3986493/comments?start={0}&limit=20&sort=new_score&status=P'urls = [url_init.format(x*20) for x in range(5)]headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) .....36'}result = []for url in urls:ret = Request(url, headers=headers)res = urlopen(ret)page = res.read().decode('utf-8')soup = BeautifulSoup(page, 'html.parser')tb = soup.find_all('span', class_='short')for i in range(len(tb)):result.append(tb[i])result = pd.Series(result)result.to_excel('E:/Python Project/douban_comment.xlsx')

如果觉得《【Python_006】Python爬虫抓取豆瓣电影影评》对你有帮助，请点赞、收藏，并留下你的观点哦！

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。