失眠网 > java使用爬虫工具jsoup实现抓取网页的内容及图片并写入到word文档中

java使用爬虫工具jsoup实现抓取网页的内容及图片并写入到word文档中

时间：2021-06-22 12:31:42

背景：

有个朋友最近让帮忙写个小功能，需求大概是，1万个链接，让爬每个链接中的某一段文章并附一张图片，每五个链接写入到一个word文档中。

基本思路，就是先去找个爬虫框架把链接网页中内容和图片写到word中，后面在将1万个链接通过位除余分组，开几个线程去写。

1.导入maven依赖jar包

<dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.12.1</version></dependency>

2.编写测试单元

找一个简单的百度网页，做个简单的demo程序，测试一下框架的功能-------测试链接

测试代码如下

@org.junit.Testpublic void testJsoup() {try {String allUrl ="/newspage/data/landingshare?context=%7B%22nid%22%3A%22news_9881067036128581241%22%2C%22sourceFrom%22%3A%22bjh%22%7D";Document docAll = Jsoup.connect(allUrl).data("query", "Java").userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36").get();//标题Elements titile = docAll.getElementsByTag("titile");System.out.println(titile.text());//段落内容Elements contexts = docAll.getElementsByTag("p");String context = "";for (Element element:contexts) {context+=element.text();}System.out.println(context);//图片Elements img = docAll.getElementsByTag("img");for (Element image : img){//图片路径String src = image.attr("src");System.out.println(src);}} catch (Exception e) {e.printStackTrace();}}

运行可知得到的document就是整个页面内容，关于内容的解析，详细的在jsoup网站上有详细文档。这里我只简单解析。

3.关于图片的读取下载

//图片Elements img = docAll.getElementsByTag("img");int i=1;for (Element image : img){File file = new File("D://"+i+".JPEG");FileOutputStream fo = new FileOutputStream(file);//图片路径String src = image.attr("src");BufferedInputStream in = Jsoup.connect(src).ignoreContentType(true).execute().bodyStream();byte[] buf = new byte[1024];int length = 0;while ((length = in.read(buf, 0, buf.length)) != -1) {fo.write(buf, 0, length);}in.close();fo.close();System.out.println(src + "下载完成");i++;}

测试下载图片也没有问题。

4.最后就是将抓取的内容写到word文档中了。

往文档中写文字比较简单，关键在于图片的写入，查询了一些资料，做了各种测试，发现有个很简单的工具。

（附参考地址）

5 添加工具依赖

<dependency><groupId>com.lowagie</groupId><artifactId>itext</artifactId><version>2.1.7</version></dependency><dependency><groupId>com.lowagie</groupId><artifactId>itext-rtf</artifactId><version>2.1.7</version></dependency><dependency><groupId>com.itextpdf</groupId><artifactId>itext-asian</artifactId><version>5.2.0</version></dependency>

测试代码如下：

@org.junit.Testpublic void test() throws Exception{String filePath = "D:\\a.doc";try {String allUrl ="/newspage/data/landingshare?context=%7B%22nid%22%3A%22news_9881067036128581241%22%2C%22sourceFrom%22%3A%22bjh%22%7D";Document docAll = Jsoup.connect(allUrl).data("query", "Java").userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36").cookie("auth", "token").timeout(3000).get();// 设置纸张大小com.lowagie.text.Document document = new com.lowagie.text.Document(PageSize.A4);// 建立一个书写器(Writer)与document对象关联，通过书写器(Writer)可以将文档写入到磁盘中// ByteArrayOutputStream baos = new ByteArrayOutputStream();File file = new File(filePath);RtfWriter2.getInstance(document, new FileOutputStream(file));document.open();// 设置中文字体BaseFont bfChinese = BaseFont.createFont(BaseFont.HELVETICA,BaseFont.WINANSI, BaseFont.NOT_EMBEDDED);// 标题字体风格Font titleFont = new Font(bfChinese, 12, Font.BOLD);// // 正文字体风格//Font contextFont = new Font(bfChinese, 10, Font.NORMAL);/**标题*/String webTitle = docAll.getElementsByTag("title").text();/*** 内容*/Elements contexts = docAll.getElementsByTag("p");String contextString ="";for (Element context : contexts){contextString+=context.text();}/**end*/Paragraph title = new Paragraph(webTitle);//// 设置标题格式对齐方式title.setAlignment(com.lowagie.text.Element.ALIGN_CENTER);// title.setFont(titleFont);document.add(title);// 文本正文Paragraph context = new Paragraph(contextString);// 正文格式左对齐context.setAlignment(com.lowagie.text.Element.ALIGN_LEFT);// context.setFont(contextFont);// 离上一段落（标题）空的行数context.setSpacingBefore(5);// 设置第一行空的列数context.setFirstLineIndent(20);document.add(context);//// // 利用类FontFactory结合Font和Color可以设置各种各样字体样式//// Paragraph underline = new Paragraph("下划线的实现", FontFactory.getFont(// FontFactory.HELVETICA_BOLDOBLIQUE, 18, Font.UNDERLINE,// new Color(0, 0, 255)));//// document.add(underline);//// // 添加图片 Image.getInstance即可以放路径又可以放二进制字节流///**图片*/Elements imgs = docAll.getElementsByTag("img");for (Element image : imgs){//图片路径String src = image.attr("src");BufferedInputStream in = Jsoup.connect(src).ignoreContentType(true).execute().bodyStream();ByteArrayOutputStream out = new ByteArrayOutputStream();byte[] buf = new byte[1024];int length = 0;while ((length = in.read(buf, 0, buf.length)) != -1) {out.write(buf, 0, length);}Image img = Image.getInstance(out.toByteArray());img.setAbsolutePosition(0, 0);img.setAlignment(Image.LEFT);// 设置图片显示位置// img.scaleAbsolute(60, 60);// 直接设定显示尺寸//// // img.scalePercent(50);//表示显示的大小为原尺寸的50%//// // img.scalePercent(25, 12);//图像高宽的显示比例//// // img.setRotation(30);//图像旋转一定角度//document.add(img);in.close();out.close();}document.close();// 得到输入流// wordFile = new ByteArrayInputStream(baos.toByteArray());// baos.close();// Connection referrer = Jsoup.connect(src).referrer(src);// referrer.ignoreContentType(true);// Connection.Response execute = referrer.execute();// BufferedInputStream in = execute.bodyStream();} catch (Exception e) {e.printStackTrace();}}

基本测试完成，后面功能实现就简单了。

如果觉得《java使用爬虫工具jsoup实现抓取网页的内容及图片并写入到word文档中》对你有帮助，请点赞、收藏，并留下你的观点哦！

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。