失眠网 > python 爬虫 requests+BeautifulSoup 爬取巨潮资讯公司概况代码实例

python 爬虫 requests+BeautifulSoup 爬取巨潮资讯公司概况代码实例

时间：2021-08-17 15:40:10

第一次写一个算是比较完整的爬虫，自我感觉极差啊，代码low，效率差，也没有保存到本地文件或者数据库，强行使用了一波多线程导致数据顺序发生了变化。。。

贴在这里，引以为戒吧。

# -*- coding: utf-8 -*-"""Created on Wed Jul 18 21:41:34 @author: brave-manblog: /zrmw/"""import requestsfrom bs4 import BeautifulSoupimport jsonfrom threading import Thread

# 获取上市公司的全称，英文名称，地址，法定代表人（也可以获取任何想要获取的公司信息）def getDetails(url):headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/0101 Firefox/6.0"}res = requests.get("{}".format(url), headers = headers)res.encoding = "GBK"soup = BeautifulSoup(res.text, "html.parser")details = {"code": soup.select(".table")[0].td.text.lstrip("股票代码：")[:6], "Entire_Name": soup.select(".zx_data2")[0].text.strip("\r\n "), "English_Name": soup.select(".zx_data2")[1].text.strip("\r\n "), "Address": soup.select(".zx_data2")[2].text.strip("\r\n "), "Legal_Representative": soup.select(".zx_data2")[4].text.strip("\r\n ")}# 这里将details转换成json字符串格式用作后期存储处理jd = json.dumps(details)jd1 = json.loads(jd)print(jd1)

# 此函数用来获取上市公司的股票代码def getCode():headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/0101 Firefox/6.0"}res = requests.get("/cninfo-new/information/companylist", headers = headers)res.encoding = "gb1232"soup = BeautifulSoup(res.text, "html.parser")# print(soup.select(".company-list"))L = []l1 = []l2 = []l3 = []l4 = []for i in soup.select(".company-list")[0].find_all("a"):code = i.text[:6]l1.append(code)for i in soup.select(".company-list")[1].find_all("a"):code = i.text[:6]l2.append(code)for i in soup.select(".company-list")[2].find_all("a"):code = i.text[:6]l3.append(code)for i in soup.select(".company-list")[3].find_all("a"):code = i.text[:6]l4.append(code)L = [l1, l2, l3, l4]print(L[0])return getAll(L)def getAll(L):def t1(L):for i in L[0]:url_sszb = "/information/brief/szmb{}.html".format(i)getDetails(url_sszb)def t2(L):for i in L[1]:url_zxqyb = "/information/brief/szsme{}.html".format(i)getDetails(url_zxqyb)def t3(L):for i in L[2]:url_cyb = "/information/brief/szcn{}.html".format(i)getDetails(url_cyb)def t4(L):for i in L[3]:url_hszb = "/information/brief/shmb{}.html".format(i)getDetails(url_hszb)# tt1 = Thread(target = t1, args = (L, ))# tt2 = Thread(target = t2, args = (L, )) # tt3 = Thread(target = t3, args = (L, ))# tt4 = Thread(target = t4, args = (L, ))# # tt1.start()# tt2.start()# tt3.start()# tt4.start()# # tt1.join()# tt2.join()# tt3.join()# tt4.join() t1(L)t2(L)t3(L)t4(L)if __name__ == "__main__":getCode()

没有考虑实际生产中突发的状况，比如网速延迟卡顿等问题。

速度是真慢，有时间会分享给大家 selenium + 浏览器的爬取巨潮资讯的方法代码。晚安~

如果觉得《python 爬虫 requests+BeautifulSoup 爬取巨潮资讯公司概况代码实例》对你有帮助，请点赞、收藏，并留下你的观点哦！

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。