失眠网,内容丰富有趣,生活中的好帮手!
失眠网 > 爬虫豆瓣读书top250 保存为本地csv文件

爬虫豆瓣读书top250 保存为本地csv文件

时间:2020-02-12 01:12:26

相关推荐

爬虫豆瓣读书top250 保存为本地csv文件

爬虫豆瓣读书top250,保存为本地csv文件

目的

将豆瓣读书top250排名保存到本地excel,包括书名,作者,评分,评论数,简评,网址。用到了requests,res,BeautifulSoup,csv库。

豆瓣读书top250网址:/top250

整体思路

先上代码,Pycharm运行

#coding=gbkimport requests #导入requests库,用于获取网页数据import re #导入re库,用于正则表达式筛选数据from bs4 import BeautifulSoup #导入库,用于解析网页import csv #导入库,用于创建csv文件并写入数据import xlsxwriterdef get_all_url():#定义获取所有网址的函数urls = []for i in range(0, 250, 25):url_1 = '/top250?start={}'.format(i)urls.append(url_1)return urlsdef get_book_info(url_2): #定义函数,通过网址参数获取整页所需数据,并将其存入csv文件global ids #定义序号变量header = {'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like\ Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0\ .3 Mobile/15E148 Safari/604.1'}res = requests.get(url_2, headers=header)res.encoding = 'utf-8'soup = BeautifulSoup(res.text, 'lxml')for i in range(25):data_name = soup.find_all('div', class_='pl2')[i]names = data_name.a.get('title')href = data_name.a.get('href')data_author = soup.find_all('p', class_='pl')[i]authors = data_author.get_text().split('/')[0]data_score = soup.find_all('span', class_='rating_nums')[i]scores = data_score.get_text()data_msg = soup.find_all('span', class_='pl')[i].get_text()msgs = re.findall('\d+', data_msg)[0]data_com = soup.find_all('td', valign='top')[2 * i + 1].find('span', class_='inq')if data_com is not None:comments = data_com.get_text()else:comments = '无'ids += 1book_info.writerow((ids, names, authors, scores, msgs, comments, href,))if __name__ == '__main__': #主函数try:workbook = xlsxwriter.Workbook('豆瓣读书TOP250.csv')worksheet = workbook.add_worksheet('SHOW')with open('f:/python_document/豆瓣读书TOP250.csv', 'w', newline='', encoding='utf-8-sig') as file:book_info = csv.writer(file)book_info.writerow(('序号', '书名', '作者', '评分', '评论数', '简评', '网址'))url_list = get_all_url()ids = 0for url in url_list:get_book_info(url)except Exception as e: #异常时执行print('Error:', e)print('读取完成!')

一:定义获取所有网址的函数(前250个)

def get_all_url():#定义获取所有网址的函数urls = []for i in range(0, 250, 25):url_1 = '/top250?start={}'.format(i)urls.append(url_1)return urls

分析网址

打开豆瓣读书网址:/top250

第一页:/top250

第二页:/top250?start=25

第三页:/top250?start=50

第十页:/top250?start=225

把第一页的网址改为:/top250?start=0

找到规律,通过修改最后的数字改变网址,先定义一个函数获取所有网址,并存入列表。

二:定义函数,通过网址参数获取整页所需数据,并将其存入csv文件

def get_book_info(url_2): #定义函数,通过网址参数获取整页所需数据,并将其存入csv文件global ids #定义序号变量header = {'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like\ Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0\ .3 Mobile/15E148 Safari/604.1'}res = requests.get(url_2, headers=header)res.encoding = 'utf-8'soup = BeautifulSoup(res.text, 'lxml')for i in range(25):data_name = soup.find_all('div', class_='pl2')[i]names = data_name.a.get('title')href = data_name.a.get('href')data_author = soup.find_all('p', class_='pl')[i]authors = data_author.get_text().split('/')[0]data_score = soup.find_all('span', class_='rating_nums')[i]scores = data_score.get_text()data_msg = soup.find_all('span', class_='pl')[i].get_text()msgs = re.findall('\d+', data_msg)[0]data_com = soup.find_all('td', valign='top')[2 * i + 1].find('span', class_='inq')if data_com is not None:comments = data_com.get_text()else:comments = '无'ids += 1book_info.writerow((ids, names, authors, scores, msgs, comments, href,))

网站header获取:谷歌浏览器(Ctrl+Shift+j)

属性获取:书名,作者,评分,评论数,简评,网址

相应的信息左键–检查

注意标签

原文链接:/m0_46426889/article/details/104754257?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-3.nonecase&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-3.nonecase

如果觉得《爬虫豆瓣读书top250 保存为本地csv文件》对你有帮助,请点赞、收藏,并留下你的观点哦!

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。