失眠网,内容丰富有趣,生活中的好帮手!
失眠网 > python抓取京东联盟优惠券_[爬虫]使用python抓取京东全站数据(商品 店铺 分类 评论)...

python抓取京东联盟优惠券_[爬虫]使用python抓取京东全站数据(商品 店铺 分类 评论)...

时间:2020-05-03 03:29:42

相关推荐

python抓取京东联盟优惠券_[爬虫]使用python抓取京东全站数据(商品 店铺 分类 评论)...

网上抓取京东数据的文章,现在要么无法抓取数据,要么只能抓取部分数据,本文将介绍如何抓取京东全站数据,包括商品信息、店铺信息,评论信息,分类信息等。

-------------------------------------------------------------------------------

一、环境OS:win10

python:3.5

scrapy:1.3.2

pymongo:3.2

pycharm

环境搭建,自行百度

二、数据库说明

1. 产品分类

京东大概有1183个分类,这是除去了一些虚拟产品(话费、彩票、车票等)的分类,可以到如下网页查看:

/allSort.aspx

我们也是从这个网址开始抓取。由于这些分类里面也有属于频道的页面,也就是说,这个分类里面也有很多子分类,需要做一些特殊处理才可以拿到所有分类,具体方法,下文再说。

name #分类名称

url #分类url

_id #分类id

2. 产品

url #产品url

_id #产品id

category #产品分类

reallyPrice #产品价格

originalPrice #原价

description #产品描述

shopId #shop id

venderId #vender id

commentCount #评价总数

goodComment #好评数

generalComment #中评数

poolComment #差评数

favourableDesc1 #优惠描述1

favourableDesc2 #优惠描述2

3. 评论

_id #评论id

productId #产品id

guid

content #评论内容

creationTime #评论时间

isTop

referenceId

referenceName

referenceType

referenceTypeId

firstCategory

secondCategory

thirdCategory

replyCount #回复次数

score #分数

status

title

usefulVoteCount #被标记的有用评论数

uselessVoteCount #被标记的无用评论数

userImage

userImageUrl

userLevelId

userProvince

viewCount

orderId #订单id

isReplyGrade

nickname #评论人的名称

userClient

mergeOrderStatus

discussionId

productColor

productSize

imageCount #评论中图片的数量

integral

userImgFlag

anonymousFlag

userLevelName

plusAvailable

recommend

userLevelColor

userClientShow

isMobile #是否移动端评论

days

afterDays #追加评论数

4. 店铺

_id #店铺名称

name #店铺名称

url1 #店铺url1

url2 #店铺url2

shopId #shop id

venderId #vender id

5. 评论总结

_id

goodRateShow #好评率

poorRateShow #差评率

poorCountStr #差评数字符串

averageScore #平均分

generalCountStr #中评数字符串

showCount

showCountStr

goodCount #好评数

generalRate #中评率

generalCount #中评数

skuId

goodCountStr #好评数字符串

poorRate #差评率

afterCount #追评数

goodRateStyle

poorCount

skuIds

poorRateStyle

generalRateStyle

commentCountStr

commentCount

productId #产品id

afterCountStr

goodRate

generalRateShow

jwotestProduct

maxPage

score

soType

imageListCount

三、抓取说明

1. 抓取分类

代码如下:

def parse_category(self, response):

"""获取分类页"""

selector = Selector(response)

try:

texts = selector.xpath('//div[@class="category-item m"]/div[@class="mc"]/div[@class="items"]/dl/dd/a').extract()

for text in texts:

items = re.findall(r'(.*?)', text)

for item in items:

if item[0].split('.')[0][2:] in key_word:

if item[0].split('.')[0][2:] != 'list':

yield Request(url='https:' + item[0], callback=self.parse_category)

else:

categoriesItem = CategoriesItem()

categoriesItem['name'] = item[1]

categoriesItem['url'] = 'https:' + item[0]

categoriesItem['_id'] = item[0].split('=')[1].split('&')[0]

yield categoriesItem

yield Request(url='https:' + item[0], callback=self.parse_list)

except Exception as e:

print('error:', e)

如前文所说,有些类别里面包含有很多子类别,所以对于这样的url,需要再次进行类别抓取:

if item[0].split('.')[0][2:] != 'list':

yield Request(url='https:' + item[0], callback=self.parse_category)

2. 抓取产品

访问每个类别的url就可以获取得到产品列表,找到产品的URL,进入详情页面抓取产品的详情:

def parse_list(self, response):

"""分别获得商品的地址和下一页地址"""

meta = dict()

meta['category'] = response.url.split('=')[1].split('&')[0]

selector = Selector(response)

texts = selector.xpath('//*[@id="plist"]/ul/li/div/div[@class="p-img"]/a').extract()

for text in texts:

items = re.findall(r'', text)

yield Request(url='https:' + items[0], callback=self.parse_product, meta=meta)

产品的基本信息在详情页面基本可以获取,但是有些信息,比如:价格、优惠政策等信息,是需要动态获取的。

先来看价格信息,访问的URL格式为:

/prices/mgets?skuIds=J_(product_id)

这个url最后括号里面的信息就是产品的id,需要动态获取,代码如下:

response = requests.get(url=price_url + product_id)

price_json = response.json()

productsItem['reallyPrice'] = price_json[0]['p']

productsItem['originalPrice'] = price_json[0]['m']

获取得到的都是json格式,比较好解析。

再来看优惠信息,优惠信息分为两种:优惠券和满减描述:

所以需要抓取这两种信息,都是动态加载,代码如下:

# 优惠

res_url = favourable_url % (product_id, shop_id, vender_id, category.replace(',', '%2c'))

# print(res_url)

response = requests.get(res_url)

fav_data = response.json()

if fav_data['skuCoupon']:

desc1 = []

for item in fav_data['skuCoupon']:

start_time = item['beginTime']

end_time = item['endTime']

time_dec = item['timeDesc']

fav_price = item['quota']

fav_count = item['discount']

fav_time = item['addDays']

desc1.append(u'有效期%s至%s,满%s减%s' % (start_time, end_time, fav_price, fav_count))

productsItem['favourableDesc1'] = ';'.join(desc1)

if fav_data['prom'] and fav_data['prom']['pickOneTag']:

desc2 = []

for item in fav_data['prom']['pickOneTag']:

desc2.append(item['content'])

productsItem['favourableDesc1'] = ';'.join(desc2)

3. 抓取店铺信息

在每个产品的详情页面都可以直接找到店铺id和vender id:

ids = re.findall(r"venderId:(.*?),\s.*?shopId:'(.*?)'", response.text)

if not ids:

ids = re.findall(r"venderId:(.*?),\s.*?shopId:(.*?),", response.text)

vender_id = ids[0][0]

shop_id = ids[0][1]

店铺的名称比较难取,有多种不同页面,店铺标题也在不同地方,而且自营产品,在详情页面也可以店铺名称,代码如下:

try:

name = response.xpath('//ul[@class="parameter2 p-parameter-list"]/li/a//text()').extract()[0]

except:

try:

name = response.xpath('//div[@class="name"]/a//text()').extract()[0].strip()

except:

try:

name = response.xpath('//div[@class="shopName"]/strong/span/a//text()').extract()[0].strip()

except:

try:

name = response.xpath('//div[@class="seller-infor"]/a//text()').extract()[0].strip()

except:

name = u'京东自营'

4. 抓取评论

评论的信息也是动态加载,返回的格式也是json,访问url格式为:

/comment/productPageComments.action?productId=(product_id)&score=0&sortType=5&page=%s&pageSize=10

只需要产品的ID即可。

获取评论信息代码如下:

"""获取商品comment"""

try:

data = json.loads(response.text)

except Exception as e:

print('get comment failed:', e)

return None

product_id = response.meta['product_id']

commentSummaryItem = CommentSummaryItem()

commentSummary = data.get('productCommentSummary')

commentSummaryItem['goodRateShow'] = commentSummary.get('goodRateShow')

commentSummaryItem['poorRateShow'] = commentSummary.get('poorRateShow')

commentSummaryItem['poorCountStr'] = commentSummary.get('poorCountStr')

commentSummaryItem['averageScore'] = commentSummary.get('averageScore')

commentSummaryItem['generalCountStr'] = commentSummary.get('generalCountStr')

commentSummaryItem['showCount'] = commentSummary.get('showCount')

commentSummaryItem['showCountStr'] = commentSummary.get('showCountStr')

commentSummaryItem['goodCount'] = commentSummary.get('goodCount')

commentSummaryItem['generalRate'] = commentSummary.get('generalRate')

commentSummaryItem['generalCount'] = commentSummary.get('generalCount')

commentSummaryItem['skuId'] = commentSummary.get('skuId')

commentSummaryItem['goodCountStr'] = commentSummary.get('goodCountStr')

commentSummaryItem['poorRate'] = commentSummary.get('poorRate')

commentSummaryItem['afterCount'] = commentSummary.get('afterCount')

commentSummaryItem['goodRateStyle'] = commentSummary.get('goodRateStyle')

commentSummaryItem['poorCount'] = commentSummary.get('poorCount')

commentSummaryItem['skuIds'] = commentSummary.get('skuIds')

commentSummaryItem['poorRateStyle'] = commentSummary.get('poorRateStyle')

commentSummaryItem['generalRateStyle'] = commentSummary.get('generalRateStyle')

commentSummaryItem['commentCountStr'] = commentSummary.get('commentCountStr')

commentSummaryItem['commentCount'] = commentSummary.get('commentCount')

commentSummaryItem['productId'] = commentSummary.get('productId') # 同ProductsItem的id相同

commentSummaryItem['_id'] = commentSummary.get('productId')

commentSummaryItem['afterCountStr'] = commentSummary.get('afterCountStr')

commentSummaryItem['goodRate'] = commentSummary.get('goodRate')

commentSummaryItem['generalRateShow'] = commentSummary.get('generalRateShow')

commentSummaryItem['jwotestProduct'] = data.get('jwotestProduct')

commentSummaryItem['maxPage'] = data.get('maxPage')

commentSummaryItem['score'] = data.get('score')

commentSummaryItem['soType'] = data.get('soType')

commentSummaryItem['imageListCount'] = data.get('imageListCount')

yield commentSummaryItem

for hotComment in data['hotCommentTagStatistics']:

hotCommentTagItem = HotCommentTagItem()

hotCommentTagItem['_id'] = hotComment.get('id')

hotCommentTagItem['name'] = hotComment.get('name')

hotCommentTagItem['status'] = hotComment.get('status')

hotCommentTagItem['rid'] = hotComment.get('rid')

hotCommentTagItem['productId'] = hotComment.get('productId')

hotCommentTagItem['count'] = hotComment.get('count')

hotCommentTagItem['created'] = hotComment.get('created')

hotCommentTagItem['modified'] = hotComment.get('modified')

hotCommentTagItem['type'] = hotComment.get('type')

hotCommentTagItem['canBeFiltered'] = hotComment.get('canBeFiltered')

yield hotCommentTagItem

for comment_item in data['comments']:

comment = CommentItem()

comment['_id'] = comment_item.get('id')

comment['productId'] = product_id

comment['guid'] = comment_item.get('guid')

comment['content'] = comment_item.get('content')

comment['creationTime'] = comment_item.get('creationTime')

comment['isTop'] = comment_item.get('isTop')

comment['referenceId'] = comment_item.get('referenceId')

comment['referenceName'] = comment_item.get('referenceName')

comment['referenceType'] = comment_item.get('referenceType')

comment['referenceTypeId'] = comment_item.get('referenceTypeId')

comment['firstCategory'] = comment_item.get('firstCategory')

comment['secondCategory'] = comment_item.get('secondCategory')

comment['thirdCategory'] = comment_item.get('thirdCategory')

comment['replyCount'] = comment_item.get('replyCount')

comment['score'] = comment_item.get('score')

comment['status'] = comment_item.get('status')

comment['title'] = comment_item.get('title')

comment['usefulVoteCount'] = comment_item.get('usefulVoteCount')

comment['uselessVoteCount'] = comment_item.get('uselessVoteCount')

comment['userImage'] = 'http://' + comment_item.get('userImage')

comment['userImageUrl'] = 'http://' + comment_item.get('userImageUrl')

comment['userLevelId'] = comment_item.get('userLevelId')

comment['userProvince'] = comment_item.get('userProvince')

comment['viewCount'] = comment_item.get('viewCount')

comment['orderId'] = comment_item.get('orderId')

comment['isReplyGrade'] = comment_item.get('isReplyGrade')

comment['nickname'] = comment_item.get('nickname')

comment['userClient'] = comment_item.get('userClient')

comment['mergeOrderStatus'] = comment_item.get('mergeOrderStatus')

comment['discussionId'] = comment_item.get('discussionId')

comment['productColor'] = comment_item.get('productColor')

comment['productSize'] = comment_item.get('productSize')

comment['imageCount'] = comment_item.get('imageCount')

comment['integral'] = comment_item.get('integral')

comment['userImgFlag'] = comment_item.get('userImgFlag')

comment['anonymousFlag'] = comment_item.get('anonymousFlag')

comment['userLevelName'] = comment_item.get('userLevelName')

comment['plusAvailable'] = comment_item.get('plusAvailable')

comment['recommend'] = comment_item.get('recommend')

comment['userLevelColor'] = comment_item.get('userLevelColor')

comment['userClientShow'] = comment_item.get('userClientShow')

comment['isMobile'] = comment_item.get('isMobile')

comment['days'] = comment_item.get('days')

comment['afterDays'] = comment_item.get('afterDays')

yield comment

if 'images' in comment_item:

for image in comment_item['images']:

commentImageItem = CommentImageItem()

commentImageItem['_id'] = image.get('id')

commentImageItem['associateId'] = image.get('associateId') # 和CommentItem的discussionId相同

commentImageItem['productId'] = image.get('productId') # 不是ProductsItem的id,这个值为0

commentImageItem['imgUrl'] = 'http:' + image.get('imgUrl')

commentImageItem['available'] = image.get('available')

commentImageItem['pin'] = image.get('pin')

commentImageItem['dealt'] = image.get('dealt')

commentImageItem['imgTitle'] = image.get('imgTitle')

commentImageItem['isMain'] = image.get('isMain')

yield commentImageItem

# next page

for i in range(1, int(data['maxPage'])):

url = comment_url % (product_id, str(i))

meta = dict()

meta['product_id'] = product_id

yield Request(url=url, callback=self.parse_comments2, meta=meta)

5. 抓取过程

基本代码已经在文中贴出,写的比较乱,欢迎大家一起讨论。

如果觉得《python抓取京东联盟优惠券_[爬虫]使用python抓取京东全站数据(商品 店铺 分类 评论)...》对你有帮助,请点赞、收藏,并留下你的观点哦!

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。