失眠网 > 网络爬虫爬取新浪某篇文章的标题日期时间来源作者及文章内容（Python）

网络爬虫爬取新浪某篇文章的标题日期时间来源作者及文章内容（Python）

时间：2023-05-13 04:52:51

1.准备工作

Python安装有BeautifulSoup4

Python安装有requests(可有可无，我会贴出两种方式)

2.当然进入主题了

先获得新浪的一篇文章的Url,我所用的Url为：/c/-04-22/doc-ifznefkh5284628.shtml

下面就是代码了：

（1）第一种方式：采用Python自带库urllib.request的方式获得链接

# 爬取文章标题，发表时间，文章来源,作者，文章内容from urllib.request import urlopen from bs4 import BeautifulSoupurl = urlopen("/c/-04-22/doc-ifznefkh5284628.shtml") #打开字符串的urlsoup = BeautifulSoup(url,"html.parser") #使用指定解析器解析获得链接内容head = soup.select(".main-title")[0].text #获取文章标题date = soup.select(".date")[0].text #获取日期source = soup.select(".source")[0].text #获取来源article = [] #定义列表for p in soup.select("#article p")[:-1]: #获得每段内容article.append(p.text.strip())#追加至列表里article = '\n\n'.join(article) #每段两个换行，为看起来方便# article = '\n\n'.join([p.text.strip() for p in soup.select("#article p")[:-1]])#Python的一行烩#获取文章的内容author = soup.select(".show_author")[0].text.strip("") #获取作者print(head.rjust(60),"\n",date.rjust(60)+' '+source,"\t\n",author.rjust(70),"\n",article)#打印输出（加rjust为模拟文章格式）

（2）第二种方式：采用requests请求获得链接

# 爬取文章标题，发表时间，文章来源,作者，文章内容import requestsfrom bs4 import BeautifulSoupres = requests.get("/c/-04-22/doc-ifznefkh5284628.shtml") #res获得请求到的结果soup = BeautifulSoup(res.text,"html.parser") #使用指定解析器解析获得res文本head = soup.select(".main-title")[0].text #获取文章标题date = soup.select(".date")[0].text #获取日期source = soup.select(".source")[0].text #获取来源article = '\n\n'.join([p.text.strip() for p in soup.select("#article p")[:-1]])#Python的一行烩,获取文章的内容author = soup.select(".show_author")[0].text.strip("") #获取作者print(head.rjust(60),"\n",date.rjust(60)+' '+source,"\t\n",author.rjust(70),"\n",article)#打印输出（加rjust为模拟文章格式）

就这些了，小白学爬虫，看视频整理而来，大神勿喷

有兴趣学爬虫的下面为链接

视频链接：/course/courseMain.htm?courseId=1003285002

老师讲的挺好

如果觉得《网络爬虫爬取新浪某篇文章的标题日期时间来源作者及文章内容（Python）》对你有帮助，请点赞、收藏，并留下你的观点哦！

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。