失眠网 > Python爬虫攻略(2)Selenium+多线程爬取链家网二手房信息

Python爬虫攻略(2)Selenium+多线程爬取链家网二手房信息

时间：2018-08-03 16:56:52

申明：本文对爬取的数据仅做学习使用，请勿使用爬取的数据做任何商业活动，侵删

前戏

安装Selenium:

pip install selenium

如果下载速度较慢, 推荐使用国内源:

pip install selenium -i https://pypi.tuna./simple

本次爬虫将会用到

Selenium爬虫的基本用法总结

ThreadPoolExecutor线程池

目标网站:链家网武汉二手房

页面调试

因为我们用的是selenium, 所以这次不必太过关注请求响应, 直接看渲染后的页面源码就好, 可以通过开发者工具中的元素选取功能来选择目标元素

右键点击选择的元素可以复制CSS选择器或Xpath的查询路径

有Anconda环境的童鞋推荐使用Jupyter Notebook来调试代码, 流畅且丝滑

目标数据如图所示:

通过上面的调试我们可以写出一个爬虫demo

from selenium import webdriverclass LianJia:def __init__(self):# 声明Chrome浏览器对象, 这里填写你自己的driver路径self.driver = webdriver.Chrome(r'E:\chromedriver.exe')def house_detail(self, item):"""获取一间房子的详情信息"""self.driver.get(item['houseURL']) # 访问一间房子的详情页# 获取页面上的房子信息item['title'] = self.driver.find_element_by_tag_name('h1').text # 标题item['price'] = self.driver.find_element_by_css_selector('span.total').text # 价格house_info = self.driver.find_elements_by_css_selector('div.mainInfo')item['room'] = house_info[0].text # 户型item['faceTo'] = house_info[1].text # 朝向item['area'] = house_info[2].text# 面积# 小区名item['communityName'] = self.driver.find_element_by_css_selector('munityName a.info').text# 发布日期item['releaseDate'] = self.driver.find_element_by_xpath('//div[@class="transaction"]/div[2]/ul/li/span[2]').textprint(item)def house_list(self, item):"""获取一个城区中所有房子的详情页链接"""# 访问城区的页面self.driver.get(item['partURL']) # 切换到'最新发布'页面self.driver.find_element_by_link_text('最新发布').click()# 获取到所有的房子链接house_ls = self.driver.find_elements_by_xpath('//ul[@class="sellListContent"]//div[@class="title"]/a')# 生成url列表house_url_ls = [house.get_attribute("href") for house in house_ls]# 遍历房子的链接for url in house_url_ls:item['houseURL'] = urlself.house_detail(item)def run(self):"""获取所有城区的页面链接"""# 访问二手房网址self.driver.get('/ershoufang/') # 获取所有城区的元素对象temp_ls = self.driver.find_elements_by_xpath('//div[@class="position"]/dl[2]/dd/div[1]/div/a')# 城区名part_name_ls = [ele.text for ele in temp_ls]# 城区链接part_url_ls = [ele.get_attribute("href") for ele in temp_ls]item = {} # 初始化一个容器, 用来存放房子的信息for i in range(len(temp_ls)):item['partName'] = part_name_ls[i] # 城区名item['partURL'] = part_url_ls[i] # 城区页面链接self.house_list(dict(item)) # 传递深拷贝的item对象if __name__ == '__main__':lj = LianJia() # 输入希望爬取的页数lj.run()

运行结果(例):

{'partName': '江岸', 'partURL': '/ershoufang/jiangan/', 'houseURL': '/ershoufang/104103247485.html', 'title': '运征大厦 2室2厅 295万', 'price': '295', 'room': '2室2厅', 'faceTo': '东北', 'area': '109.28平米', 'communityName': '运征大厦', 'releaseDate': '-11-13'}

单线程到多线程的转换

上面的demo已经可以实现爬取作业, 但是selenium的弊端却暴露无疑, 一页一页的跳转未免太过缓慢

因此我们需要对代码进行修改, 让原本单线程的脚本变身多线程, 以提高爬取效率

首先我们需要了解什么是多线程, 举个最简单的例子,

一片果园一个人全部摘完需要10小时, 派10个人一起摘就只需要1小时

大概了解了多线程的原理和作用之后我们就来分析实现的过程

流程分析

常规的单线程代码, 我们将上面demo中的方法抽象为ABC:

程序开始运行, 浏览器被打开A方法访问page1, 拿取数据并传给方法BB利用A得到的数据, 访问page2, 并遍历数据给CC再到page3中拿到目标数据, 然后执行打印或写入写入之后, C会继续访问B给的下一个链接, 直到结束

转换后的多线程代码(不唯一):

程序开始运行, 多个浏览器被打开A方法访问page1, 拿取数据并传给方法BB访问page2, 然后遍历多条page3的链接, 到一个asyn方法中asyn方法将page3链接分多次给到C方法, 然后让C在多个浏览器中运行多个C同时执行, 到page3中拿到目标数据, 并执行打印或写入写入之后, 多个C会到asyn方法中获取下一个page3链接, 直到结束

代码分析

了解了大概的流程之后, 开始修改之前的代码,

首先我们要明确什么地方需要多线程, 什么地方不需要,

run方法(A)这里肯定是不需要的, 因为获取的数据简单且量少

house_list方法(B)这里暂时不需要, 如果要实现分页爬取的话, 我们可以让主线程来担当这个角色

house_detail方法©接收的数据条目多, 处理的数据量大, 非常适合做多线程, 来提升效率

那么我们的修改按执行顺序由上至下

处理后的代码

from concurrent.futures import ThreadPoolExecutorfrom selenium import webdriverclass LianJia:def __init__(self):# 使用内置线程池, 设置最大线程数self.executor = ThreadPoolExecutor(max_workers=2) # 声明Chrome浏览器对象self.driver = webdriver.Chrome(r'E:\chromedriver.exe'')# 声明更多的Chrome浏览器对象self.driver2 = webdriver.Chrome(r'E:\chromedriver.exe')self.driver3 = webdriver.Chrome(r'E:\chromedriver.exe')def house_detail(self, item, url, driver):"""获取一间房子的详情信息"""driver.get(url) # 访问一间房子的详情页# 获取页面上的房子信息item['houseURL'] = url# 标题item['title'] = driver.find_element_by_tag_name('h1').text# 价格item['price'] = driver.find_element_by_css_selector('span.total').text house_info = driver.find_elements_by_css_selector('div.mainInfo')item['room'] = house_info[0].text # 户型item['faceTo'] = house_info[1].text # 朝向item['area'] = house_info[2].text # 面积# 小区名item['communityName'] = driver.find_element_by_css_selector('munityName a.info').text# 发布日期item['releaseDate'] = driver.find_element_by_xpath('//div[@class="transaction"]/div[2]/ul/li/span[2]').textprint(item)def asyn_page(self, item, url_list):"""异步处理线程，让两个driver同时访问不同的页面"""self.executor.submit(self.house_detail, item=dict(item), url=url_list[0], driver=self.driver2)self.executor.submit(self.house_detail, item=dict(item), url=url_list[1], driver=self.driver3)def house_list(self, item):"""获取一个城区中所有房子的详情页链接"""for page in range(1, 101):# 访问城区的页面, co32表示最新发布self.driver.get(item['partURL'] + f'pg{page}co32/') # 获取到所有的房子链接house_ls = self.driver.find_elements_by_xpath('//ul[@class="sellListContent"]//div[@class="title"]/a')# 生成url列表house_url_ls = [house.get_attribute("href") for house in house_ls]# 循环内的作用，同时给url_list参数提供两个不同的值for i in range(0, len(house_url_ls), 2):if i < len(house_url_ls) - 1:self.asyn_page(item=dict(item), url_list=[house_url_ls[i], house_url_ls[i + 1]])else:print(f'>>[{item["partName"]}]区，第[{page}]页, 处理完成')else:print(f'>[{item["partName"]}]处理完成')def run(self):"""获取所有城区的页面链接"""# 访问二手房网址self.driver.get('/ershoufang/') # 获取所有城区的元素对象temp_ls = self.driver.find_elements_by_xpath('//div[@class="position"]/dl[2]/dd/div[1]/div/a')# 城区名集part_name_ls = [ele.text for ele in temp_ls] # 城区链接part_url_ls = [ele.get_attribute("href") for ele in temp_ls] item = {} # 初始化一个容器, 用来存放房子的信息for i in range(len(temp_ls)):item['partName'] = part_name_ls[i] # 城区名item['partURL'] = part_url_ls[i] # 城区页面链接self.house_list(dict(item)) # 传递深拷贝的item对象def __del__(self):self.driver.close() # 关闭浏览器1self.driver2.close() # 关闭浏览器2self.driver3.close() # 关闭浏览器3print('>>>>[Well Done]')if __name__ == '__main__':lj = LianJia()lj.run()

这样就实现了多线程爬取

优化代码

上面的代码依然存在着的不足之处, 接下来就是优化代码的时间

1>写入方法

我们之前为了方便调试, 只是将获取的数据进行打印, 并没有保存, 接下来我们来完善写入的方法, 提供两种类型: 文件或数据库

1.1>写入文件

这里我们选择使用json文件

@staticmethod# 因为没有引入类中的变量, 所以建议写成静态方法def write_item_to_json(item):"""写入json文件"""# 将item字典转换为json格式, ensure_ascii为否, 表示返回值可以包含非ascii值, 比如汉字json_data = json.dumps(item, ensure_ascii=False)# a表示每次写入为追加, encoding是为了不让中文乱码with open('data.json', 'a', encoding='utf-8') as f:# 执行写入, 尾部追加换行符f.write(json_data + '\n')print(f'>>>[{item["title"]}]写入成功')

1.2>写入数据库

安装mongodb

这里选择mongodb数据库, 点击链接下载mongo客户端, 并将执行文件加入到环境变量中, 你可以按照这个文章来配置: mongodb安装教程（图解+链接）

然后安装pymongo, 使用命令:pip install pymongo

当你输入from pymongo import MongoClient并运行没有错误时, 就代表安装成功

代码:

在init初始化函数中声明数据库对象

def __init__(self, part=None, page=1):...# 声明线程池# 声明Chrome浏览器对象...# 声明数据库对象self.client = MongoClient(host="localhost", port=27017)self.db = self.client.LianJiaself.collection = self.db.houseInfo

这里的LianJia和houseInfo需要你到数据库中创建, 使用命令use LianJia创建库,db.createCollection('houseInfo')创建集合

def write_item_to_mongo(self, item):"""插入item到数据库"""self.collection.insert_one(item)print(f'>>>[{item["title"]}]写入成功')

2>增量爬取

房子的价格是随时都会变的, 这使得我们获得的价格数据具有时效性, 因此我们需要对价格的写入做一定的处理, 方便我们之后再次爬取

@staticmethod# 因为没有引入类中的变量, 所以建议写成静态方法def clock(obj):"""返回当前的时间，与对象组成字典"""now = datetime.now().strftime('%Y-%m-%d %H:%M:%S')return {now: obj}

house_detail这里的价格对象也需要改一下

def house_detail(self, item, url, driver):"""获取一间房子的详情信息"""...# 价格price = driver.find_element_by_css_selector('span.total').textfirst_price = self.clock(price) # 第一次的价格item['price'] = [first_price] # 房子的价格与更新日期...

3>选择爬取

一个城市具有多个城区, 拿武汉举例, 武汉具有15个城区, 然而有些时候我们只想知道个别城区房子的现状, 不需要很多的数据, 这就需要对我们获取城区的代码进行优化处理

init初始化函数中添加参数

def __init__(self, part=None, page=1):...self.part = part # 代表要爬取的城区self.page = page # 代表你要爬取多少页,这里指的是每个城区爬取多少页，默认为1页

run部分的代码改为

def run(self):"""获取所有城区的页面链接"""# 访问二手房网址self.driver.get('/ershoufang/') # 获取所有城区的元素对象temp_ls = self.driver.find_elements_by_xpath('//div[@class="position"]/dl[2]/dd/div[1]/div/a')if self.part:self.get_one_part(temp_ls, self.part)else:self.get_all_part(temp_ls)# 城区名和url组成键值对def get_one_part(self, temp_ls, part):"""获取一个城区的房子"""part_dict = {ele.text: ele.get_attribute("href") for ele in temp_ls}try:# 初始化一个容器, 用来存放房子的信息item = {'partName': part, 'partURL': part_dict[part]}self.house_list(dict(item)) # 传递深拷贝的item对象except KeyError:print(f'请指定有效的城区名, 如下：\n{list(part_dict.keys())}')def get_all_part(self, temp_ls):"""获取所有城区的房子"""# 城区名集part_name_ls = [ele.text for ele in temp_ls] # 城区链接集part_url_ls = [ele.get_attribute("href") for ele in temp_ls] item = {} # 初始化一个容器, 用来存放房子的信息for i in range(len(temp_ls)):item['partName'] = part_name_ls[i] # 城区名item['partURL'] = part_url_ls[i] # 城区页面链接self.house_list(dict(item)) # 传递深拷贝的item对象

完整代码:/hao4875/MySpider/tree/master/lianjia_spider

喜欢这篇文章的麻烦点个赞, 有话不知当讲否的请下方评论

如果觉得《Python爬虫攻略(2)Selenium+多线程爬取链家网二手房信息》对你有帮助，请点赞、收藏，并留下你的观点哦！

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。