失眠网 > python博客项目评论_Python 爬虫入门——小项目实战（自动私信博客园某篇博客下的评

python博客项目评论_Python 爬虫入门——小项目实战（自动私信博客园某篇博客下的评

时间：2021-10-09 19:42:52

之前写的都是针对爬虫过程中遇到问题的解决方案，没怎么涉及到实际案例。这次，就以博客园为主题，写一个自动私信博客下的评论人员(在本篇留下的评论的同学也会被自动私信，如果不想被私信，同时又有问题，请私信我)。

1).确定监控的博客，这里以/hearzeus/p/5226546.html为例，后面会更改为本篇博客的博客地址。

2).获取博客下的评论人员。

打开浏览器控制台-网络面板，可以看到如下信息：

分析可知，获取评论人员的请求为：

/mvc/blog/GetComments.aspx?postId=5226546&blogApp=hearzeus&pageIndex=0&anchorCommentId=0&_=1456989055561

python代码如下：

defgetCommentsHtml(index):

url= "/mvc/blog/GetComments.aspx"params={"postId":"5226546","blogApp":"hearzeus","pageIndex":`index`,'anchorCommentId':`0`,'_=':'1456908852216'}

url_params=urllib.urlencode(params)return json.loads(urllib2.urlopen(url,data=url_params).read())['commentsHtml']

可以通过index来遍历所有的评论人员。如果，评论人员只有1页，但是，我把index设为2，这个时候就取不到数据。分析有无数据的返回值，可以通过关键特征告诉爬虫，已经遍历结束了。我用的特征代码如下：

if(html.count(u"comment_date")<1):print "遍历结束："+`i`

即提取返回值中是否有"comment_date"关键字来判断是否遍历结束

我们将这个链接，直接放在浏览器里面打开，可以看到请求结果，如下图所示：

放在json处理工具里面(/jsonviewernew/)，可以看到如下：

从图中可以看出，有效信息为commentsHtml 字段，同时可以发现，返回的用户列表形式为html，所以还要对返回值进行解析。

经过一步步的分析，我发现解析代码：

#parseHtml.py#encoding=utf8

from bs4 importBeautifulSoupdefparse(html):

soup= BeautifulSoup(html,"html.parser")

acount= len(soup.find_all("div","post"))

name_list=[]#print acount

for i inrange(acount):

name_list.append(soup.find_all("div","posthead")[i].find_all("a")[2].string)return name_list

3).保存用户名，以保证不重复给一个人发送私信。代码如下：

#FileOperation.py

#encoding=utf8

importsys

reload(sys)

sys.setdefaultencoding("utf-8")defcheckName(name):

file= open("../src/comments")

contents= file.read().split("\n")for i inrange(len(contents)):if(contents[i].count(name)>0):

file.close()returnTruedefwirteName(name):

file= open("../src/comments","a")

file.write(name+"\n")

file.close()return True

checkName函数，是用来检查该用户是否已经被发送过私信

writeName函数，是将发送私信成功后的用户写入文本

4).发送私信(这个接口可以自己在博客园发送私信截取到，方法同上),代码如下

#sendMessage.py

#encoding=utf8

importurllibimporturllib2defsend(name,content):

url= "/ajax/msg/send"header={"Cookie":"**********"}#print `name`

params ={"incept":name,"title":"脚本私信","content":`content`

}

url_param=urllib.urlencode(params)request= urllib2.Request(url=url,headers=header,data=url_param)print urllib2.urlopen(request).read()

a).其中header里面的cookie，需要登录博客园之后获取，如下图马赛克部分，

b).params通过名称可以看到每个参数的作用。

5).定时器

python定时器，代码示例：

importthreadingdefsayhello():print "hello world"t= threading.Timer(2.0, sayhello)

t.start()returnsayhello()

附录——完整代码

#spider.py#encoding=utf8

importurllib2importurllibimportjsonimportparseHtmlimportsendMessageimportFileOperationimportthreading

defgetCommentsHtml(index):

url= "/mvc/blog/GetComments.aspx"params={"postId":"5226546",#不要监控我的"blogApp":"hearzeus",#不要监控我的"pageIndex":`index`,'anchorCommentId':`0`,'_=':'1456908852216'}

url_params=urllib.urlencode(params)return json.loads(urllib2.urlopen(url,data=url_params).read())['commentsHtml']defgetCommentsUser(html):returnparseHtml.parse(html)defsendHello(name):#for i in range(len(list_name)):

sendMessage.send(name,"脚本私信。如有打扰，还望海涵")#print("hello:"+name)

defmain():for i in range(10):

html=getCommentsHtml(i)if(html.count(u"comment_date")<1):print "遍历结束："+`i`

t= threading.Timer(10.0, main)

t.start()returnlist_name=getCommentsUser(html)for i inrange(len(list_name)):if(FileOperation.checkName(list_name[i])!=True):

sendHello(list_name[i])

FileOperation.wirteName(list_name[i])

main()

其他三个py在上面都给出了

注意，

监控的博客页面一定要改，不要监控我的！！！！

我说了三遍！！！！！！