失眠网,内容丰富有趣,生活中的好帮手!
失眠网 > day26-爬虫-scrapy框架初识

day26-爬虫-scrapy框架初识

时间:2022-04-09 18:59:52

相关推荐

day26-爬虫-scrapy框架初识

1.框架了解:高性能的异步下载、解析、持久化存储

2.下载安装,创建项目-----------

pip install wheel

Twisted 5步安装!

二.安装Linux:pip3 install scrapyWindows:a. pip3 install wheelb. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twistedc. 进入下载目录,执行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whld. pip3 install pywin32e. pip3 install scrapy

scrapy startproject 项目名称

3.项目使用--5步听视频总结:

1.新建工程 scrapy startproject fristBlood

2.cd fristBlood 新建爬虫文件scrapy genspider chouti (在spiders中会新增一个chouti.py,注意名称、start_url,注释#allowed_domains)

3.在chouti.py中进行parse方法的编写

4.配置文件的配置:在settings中进行UA伪装、ROBOTSTXT_OBEY = False

5.配置完后,在cmd中执行:scarpy crawl 爬虫文件名称

1.爬取chouti fristBlood

# -*- coding: utf-8 -*-import scrapyclass ChoutiSpider(scrapy.Spider):#爬虫文件的名称:可以指定某一个具体的爬虫文件name = 'chouti'#允许的域名:#allowed_domains = ['']#起始url列表:工程被执行后就可以获取该列表中url所对应的页面数据start_urls = ['/']#该方法作用:就是讲起始url列表中指定url对应的页面数据进行解析操作#response参数:就是对起始url发起请求后对应的响应对象def parse(self, response):print(response)

chouti.py

2.爬取糗百 ---注意parse中 qiubaiPro

#extract()可以将selector对象中存储的文本内容获取

封装一个可迭代类型

基于终端指令执行 scarpy crawl -o data.csv qiubai --nolog---不常用

# -*- coding: utf-8 -*-import scrapyclass QiubaiSpider(scrapy.Spider):name = 'qiubai'#allowed_domains = ['']start_urls = ['/text/']def parse(self, response):#xpath返回的列表元素类型为Selecor类型div_list = response.xpath('//div[@id="content-left"]/div')#声明一个用于存储解析到数据的列表all_data = []for div in div_list:#extract()可以将selector对象中存储的文本内容获取#author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()author = div.xpath('./div[1]/a[2]/h2/text()').extract_first() #取出第一个元素,不用[0]了--意义同上行content = div.xpath('.//div[@class="content"]/span//text()').extract() #//text获取的内容不止一个,extract()获取多个列表内容content = "".join(content) #将列表转化成字符串dict = {'author':author,'content':content}all_data.append(dict)return all_data#持久化存储方式:#1.基于终端指令:必须保证parse方法有一个可迭代类型对象的返回#2.基于管道

qiubai.py

3.爬取糗百--基于管道执行--注意item pipeLinepro

pipelines.py编写

在settings中开启ITEM_PIPELINES 67-69行

ITEM_PIPELINES数值越小,优先级越高(管道中)

一个写到磁盘,一个写到数据库中

屏蔽日志信息 scarpy crawl chouti --nolog

cls清屏

import scrapyclass PipelineproItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()author = scrapy.Field()content = scrapy.Field()

items.py

# -*- coding: utf-8 -*-# Scrapy settings for pipeLinePro project## For simplicity, this file contains only settings considered important or# commonly used. You can find more settings consulting the documentation:##/en/latest/topics/settings.html#/en/latest/topics/downloader-middleware.html#/en/latest/topics/spider-middleware.htmlBOT_NAME = 'pipeLinePro'SPIDER_MODULES = ['pipeLinePro.spiders']NEWSPIDER_MODULE = 'pipeLinePro.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent#USER_AGENT = 'pipeLinePro (+)'USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'# Obey robots.txt rulesROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)# See /en/latest/topics/settings.html#download-delay# See also autothrottle settings and docs#DOWNLOAD_DELAY = 3# The download delay setting will honor only one of:#CONCURRENT_REQUESTS_PER_DOMAIN = 16#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)#TELNETCONSOLE_ENABLED = False# Override the default request headers:#DEFAULT_REQUEST_HEADERS = {# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',# 'Accept-Language': 'en',#}# Enable or disable spider middlewares# See /en/latest/topics/spider-middleware.html#SPIDER_MIDDLEWARES = {# 'pipeLinePro.middlewares.PipelineproSpiderMiddleware': 543,#}# Enable or disable downloader middlewares# See /en/latest/topics/downloader-middleware.html#DOWNLOADER_MIDDLEWARES = {# 'pipeLinePro.middlewares.PipelineproDownloaderMiddleware': 543,#}# Enable or disable extensions# See /en/latest/topics/extensions.html#EXTENSIONS = {# 'scrapy.extensions.telnet.TelnetConsole': None,#}# Configure item pipelines# See /en/latest/topics/item-pipeline.htmlITEM_PIPELINES = {'pipeLinePro.pipelines.PipelineproPipeline': 300,'pipeLinePro.pipelines.MyPipeline': 301,}# Enable and configure the AutoThrottle extension (disabled by default)# See /en/latest/topics/autothrottle.html#AUTOTHROTTLE_ENABLED = True# The initial download delay#AUTOTHROTTLE_START_DELAY = 5# The maximum download delay to be set in case of high latencies#AUTOTHROTTLE_MAX_DELAY = 60# The average number of requests Scrapy should be sending in parallel to# each remote server#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0# Enable showing throttling stats for every response received:#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)# See /en/latest/topics/downloader-middleware.html#httpcache-middleware-settings#HTTPCACHE_ENABLED = True#HTTPCACHE_EXPIRATION_SECS = 0#HTTPCACHE_DIR = 'httpcache'#HTTPCACHE_IGNORE_HTTP_CODES = []#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

settings.py

# -*- coding: utf-8 -*-import scrapyfrom pipeLinePro.items import PipelineproItemclass QiubaiSpider(scrapy.Spider):name = 'qiubai'#allowed_domains = ['']start_urls = ['/text/']def parse(self, response):# xpath返回的列表元素类型为Selecor类型div_list = response.xpath('//div[@id="content-left"]/div')# 声明一个用于存储解析到数据的列表all_data = []for div in div_list:# extract()可以将selector对象中存储的文本内容获取# author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()author = div.xpath('./div[1]/a[2]/h2/text()').extract_first()content = div.xpath('.//div[@class="content"]/span//text()').extract()content = "".join(content)#实例化item对象item = PipelineproItem()#将解析到的数据值存储到item对象中item['author'] = authoritem['content'] = content#将item对象提交给管道yield item# 持久化存储方式:# 1.基于终端指令:必须保证parse方法有一个可迭代类型对象的返回# 2.基于管道:#1.items.py:对该文件中的类进行实例化操作(item对象:存储解析到的数据值)。#2.pipeline.py:管道,作用就是接受爬虫文件提交的item对象,然后将该对象中的数据值进行持久化存储操作

qiubai.py-管道

# -*- coding: utf-8 -*-import pymysqlclass PipelineproPipeline(object):#作用:每当爬虫文件向管道提交一次item,该方法就会被调用一次。item参数就是接受到爬虫文件给提交过来的item对象#该方法只有在开始爬虫的时候被调用一次fp = Nonedef open_spider(self,spider): #父类的方法print('开始爬虫')self.fp = open('./qiubai_data.txt', 'w', encoding='utf-8')def process_item(self, item, spider): #父类的方法author = item['author']content = item['content']self.fp.write(author+":"+content)return item#该方法只有在爬虫结束后被调用一次def close_spider(self,spider): #父类的方法print('爬虫结束')self.fp.close()class MyPipeline(object):conn = Nonecursor = None# 作用:每当爬虫文件向管道提交一次item,该方法就会被调用一次。item参数就是接受到爬虫文件给提交过来的item对象def open_spider(self,spider):self.conn = pymysql.Connect(host="192.168.12.65", port=3306, db="scrapyDB", charset="utf8", user="root")self.cursor = self.conn.cursor()print('mysql')def process_item(self, item, spider):author = item['author']content = item['content']sql = "insert into qiubai values('%s','%s')" % (author,content) #qiubai是表名try:self.cursor.execute(sql) #执行mit() #事务的处理,没有问题提交,有问题回滚except Exception as e:print(e)self.conn.rollback()return item

pipelines.py

管道操作4步---听视频自己总结:

前提要在parse方法中获取解析到的数据,

1.将解析到的数据值存储到item对象中(前提item中要进行属性的声明),

2.使用yield关键字将item对象提交给管道

3.在pipelines.py中进行PipelineproPipeline方法的编写,编写process_item

4.在配置文件中开启管道

1.#实例化item对象

item = PipelineproItem()

2.在items.py中声明属性

3.#将解析到的数据值存储到item对象中

item['author'] = author

item['content'] = content

4.#将item对象提交给管道

yield item

将数据写入到数据库:新建数据库、表

select * from qiubai 查看写入的内容

如果觉得《day26-爬虫-scrapy框架初识》对你有帮助,请点赞、收藏,并留下你的观点哦!

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。