失眠网,内容丰富有趣,生活中的好帮手!
失眠网 > python 爬虫(一) requests+BeautifulSoup 爬取简单网页图片代码示例

python 爬虫(一) requests+BeautifulSoup 爬取简单网页图片代码示例

时间:2019-04-17 17:05:43

相关推荐

python 爬虫(一) requests+BeautifulSoup 爬取简单网页图片代码示例

最近学习了Python,借助各个大神的文章,自己写了以下代码,来爬取网页图片,希望可以帮助到大家。

工具是 idea

#coding=utf-8import requestsfrom bs4 import BeautifulSoupimport osimport sys'''#安卓端需要此语句reload(sys)sys.setdefaultencoding('utf-8')'''if(os.name == 'nt'):print(u'你正在使用win平台')else:print(u'你正在使用linux平台')header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'}#http请求头all_url = '/zt/xinggan.html'start_html = requests.get(all_url,headers = header)#保存地址 手动创建文件夹path = 'D:/练习/'#找寻最大页数soup = BeautifulSoup(start_html.text,"html.parser")page = soup.find_all('a',class_='num',rel='nofollow')max_page = int(page[2].text) +1same_url = '/zt/'for n in range(1,int(max_page)+1):ul = same_url+'xinggan_'+str(n)+'.html'print('ul:' +ul)start_html = requests.get(ul, headers=header)print(start_html)soup = BeautifulSoup(start_html.text,"html.parser")all_a = soup.find('div',class_='tab_tj').find_all('a',target='_blank')for a in all_a:title = a.get_text() #提取文本if(title != ''):print("准备扒取:"+title)#win不能创建带?的目录if(os.path.exists(path+title.strip().replace('?',''))):#print('目录已存在')flag=1else:os.makedirs(path+title.strip().replace('?',''))flag=0os.chdir(path + title.strip().replace('?',''))href = a['href']print('这是href:+'+href)html = requests.get(href,headers = header)mess = BeautifulSoup(html.text,"html.parser")pic_max = mess.select('.ptitle em')[0].text#pic_max = pic_max[-2].text #最大页数print(pic_max)if(flag == 1 and len(os.listdir(path+title.strip().replace('?',''))) >= int(pic_max)):print('已经保存完毕,跳过')continuefor num in range(1,int(pic_max)+1):pic = href[:-5]+'_'+str(num)+'.'+href[-4:]print(pic)html = requests.get(pic,headers = header)html.encoding = 'utf8'mess = BeautifulSoup(html.text,"html.parser")pic_url = mess.find('img',alt = title)pic_url2 = mess.select('.pic-large')[0]print(pic_url2)html = requests.get(pic_url2['src'],headers = header)file_name = pic_url2['src'].split(r'/')[-1]f = open(file_name,'wb')f.write(html.content)f.close()print('完成')print('第',n,'页完成')

如果觉得《python 爬虫(一) requests+BeautifulSoup 爬取简单网页图片代码示例》对你有帮助,请点赞、收藏,并留下你的观点哦!

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。