失眠网 > python爬虫如何实现每天爬取微信公众号的推送文章

python爬虫如何实现每天爬取微信公众号的推送文章

时间：2020-11-17 10:45:03

上上篇文章爬虫如何爬取微信公众号文章

上篇文章python爬虫如何爬取微信公众号文章(二)

上面的文章分别介绍了如何批量获取公众号的历史文章url和如何批量爬取公众号的文章，并抽取出需要的数据保存到数据库中。

这篇文章将会介绍如何实现每天自动爬取公众号推送的文章，然后抽取出数据保存到数据库。

首先介绍一下一个微信借口wxpy,wxpy是在 itchat 的基础上，通过大量接口优化提升了模块的易用性，并进行丰富的功能扩展。

通过wxpy可以实现接收微信公众号推送的文章，但是它只实现了获取每篇文章的标题（title），摘要（summary），链接（url)，封面图（cover），我在此基础上又增加了两个属性，就是文章的发布时间（pub_time)和文章来源（source)

@property def articles(self): """ 公众号推送中的文章列表 (首篇的标题/地址与消息中的 text/url 相同) 其中，每篇文章均有以下属性: * `title`: 标题 * `summary`: 摘要 * `url`: 文章 URL * `cover`: 封面或缩略图 URL """ from wxpy import MP if self.type == SHARING and isinstance(self.sender, MP): tree = ETree.fromstring(self.raw["Content"]) # noinspection SpellCheckingInspection items = tree.findall(".//mmreader/category/item") article_list = list() for item in items: def find_text(tag): found = item.find(tag) if found is not None: return found.text article = Article() article.title = find_text("title") article.summary = find_text("digest") article.url = find_text("url") article.cover = find_text("cover") article.pub_time = find_text("pub_time") article.source = find_text(".//name") article_list.append(article) return article_list

在article.py中也要增加这两个属性：

# 发布时间self.pub_time = None# 来源self.source = None

其实还有几个其他的一些属性，属性如下,通过ElementTree都能够获取到，具体根据自己的需求而定。

更改完上面的代码之后，上篇文章python爬虫如何爬取微信公众号文章(二)中的实现主要逻辑的函数也需要更改一下，也就是把url地址和发布时间作为参数传进取，而不再是list类型。

def wechat_run(self,url,pub_time): # 实现主要逻辑 # 打开数据库连接（ip/数据库用户名/登录密码/数据库名） db = pymysql.connect("localhost", "root", "root", "weixin_database") # 使用 cursor() 方法创建一个游标对象 cursor cursor = db.cursor() html_str = self.parse_url(url) content_list = self.get_content_list(html_str) title = "".join(content_list[0]["title"]) # other1 = "".join(content_list[0]["other"]) other = "\n".join(content_list[0]["other"]) create_time = pub_time # print(other) p1 = pile(r"\s*[（|(]20\d+[）|)]\s*[\u4e00-\u9fa5]*[\d]*[\u4e00-\u9fa5]+[\d]+号", re.S) anhao = re.search(p1, other) if (anhao): anhao = anhao.group().replace("\n", "") else: anhao = "" p2 = pile(r"\s[【]*裁判要[\u4e00-\u9fa5]\s*.*?(?=[【]|裁判文)", re.S) zhaiyao = "".join(re.findall(p2, other)).replace("\n", "") # print(zhaiyao) p3 = pile("<div>.*?</div>", re.S) html = re.search(p3, html_str) if (html): html = re.search(p3, html_str).group().replace("\n", "") else: html = html_str.replace("\n", "") sql = """INSERT INTO weixin_table(title,url,anhao,yaozhi,other,html,create_time,type_id) VALUES ({},{},{},{},{},{},{},{})""".format(""" + title + """, """ + url + """, """ + anhao + """, """ + zhaiyao + """, """ + other + """, """ + html + """, create_time, 4) # print(sql) try: # 执行sql语句 cursor.execute(sql) # 提交到数据库执行 mit() print("数据插入成功") except: print("数据插入失败:") info = sys.exc_info() print(info[0], ":", info[1]) # 如果发生错误则回滚 db.rollback() # 3.保存html page_name = title self.save_html(html, page_name) # 关闭数据库连接 db.close()

然后写接收微信消息和公众号推送的函数：

# -*- coding: utf-8 -*-# @Time : /8/29 上午8:31# @Author : jingyoushui# @Email : jingyoushui@# @File : wechat.py# @Software: PyCharmfrom beijing import WeixinSpider_1from wxpy import *import pandas as pdbot = Bot(cache_path=True, console_qr=True)# 打印来自其他好友、群聊和公众号的消息@bot.register()def print_others(msg): print("msg:" + str(msg)) articles = msg.articles if articles is not None: for article in articles: a = str(article.source) print("title:" + str(article.title)) print("url:" + str(article.url)) print("pub_time:" + article.pub_time) print("source:" + a) if a != "KMTV" and a != "北京行政裁判观察": pass else: content_list = [] items = [] items.append(str(article.title)) url = str(article.url) items.append(url) pub_time = article.pub_time items.append(pub_time) content_list.append(items) name = ["title", "link", "create_time"] test = pd.DataFrame(columns=name, data=content_list) if a == "KMTV": test.to_csv("everyday_url/kmtv.csv", mode="a", encoding="utf-8") print("保存成功") if a == "北京行政裁判观察": test.to_csv("everyday_url/beijing.csv", mode="a", encoding="utf-8") print("保存成功") weixin_spider_1 = WeixinSpider_1() weixin_spider_1.wechat_run(url, pub_time)if __name__ == "__main__": # 堵塞线程 bot.join()

首先获取到要爬取的公众号推送的文章的标题，url，发布时间，来源等信息，并保存到csv文件中，然后调用WeixinSpider_1类的wechat_run函数，实现对url的解析，数据的抽取，数据保存到数据库等操作。

在终端运行程序，会打印出二维码，手机微信扫描就可以登录了

运行方式是堵塞线程，可以一种处于登录状态，除非你在网页上又登录了这个账号，会被挤下来而退出。

8月30日补充：

接收公众号推送的函数可以注册是接收哪个公众号：

注册函数：

def register( self, chats=None, msg_types=None, except_self=True, run_async=True, enabled=True ): """ 装饰器：用于注册消息配置 :param chats: 消息所在的聊天对象：单个或列表形式的多个聊天对象或聊天类型，为空时匹配所有聊天对象 :param msg_types: 消息的类型：单个或列表形式的多个消息类型，为空时匹配所有消息类型 (SYSTEM 类消息除外) :param except_self: 排除由自己发送的消息 :param run_async: 是否异步执行所配置的函数：可提高响应速度 :param enabled: 当前配置的默认开启状态，可事后动态开启或关闭 """ def do_register(func): self.registered.append(MessageConfig( bot=self, func=func, chats=chats, msg_types=msg_types, except_self=except_self, run_async=run_async, enabled=enabled )) return func return do_register

接收消息的函数可以修改为：

bot = Bot(cache_path=True, console_qr=True)found1 = bot.mps().search("北京行政裁判观察")print(found1)@bot.register(found1)def print_found1(msg): articles = msg.articles if articles is not None: #一次推送可能有多篇文章 for article in articles: a = str(article.source) print("title:" + str(article.title)) print("url:" + str(article.url)) print("pub_time:" + article.pub_time) print("source:" + a) content_list = [] items = [] #文章标题 items.append(str(article.title)) #文章链接 url = str(article.url) items.append(url) #发布时间 pub_time = article.pub_time items.append(pub_time) content_list.append(items) #保存到csv文件 name = ["title", "link", "create_time"] test = pd.DataFrame(columns=name, data=content_list) test.to_csv("everyday_url/beijing.csv", mode="a", encoding="utf-8") print("保存成功") #调用WeixinSpider_1，完成url解析，数据抽取与保存到数据库等操作 weixin_spider_1 = WeixinSpider_1() weixin_spider_1.wechat_run(url, pub_time)

如果觉得《python爬虫如何实现每天爬取微信公众号的推送文章》对你有帮助，请点赞、收藏，并留下你的观点哦！

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。