Zong Laoshi self taught Jsoup

Keywords: Java Apache Maven nexus Attribute

Introduction to Jsoup

jsoup is a Java HTML parser, which can directly parse a URL address and HTML text content. It provides a set of very labor-saving API, which can extract and operate data through DOM, CSS and jQuery like operation methods.

In the field of crawler web page collection, our main role is to use HttpClient to obtain the specific web page information needed after web page extraction, we use Jsoup, which can use powerful jQuery and CSS selectors to obtain the required data;

 

Official address of Jsoup: https://jsoup.org/

 

Jsoup latest download: https://jsoup.org/download

Jsoup document: https://jsoup.org/cookbook/introduction/parsing-a-document

Latest Maven address:

  

1 <dependency>
2     <groupId>org.jsoup</groupId>
3     <artifactId>jsoup</artifactId>
4     <version>1.10.2</version>
5 </dependency>

Episode: when building maven project, the download speed of jar package is very slow. You can use Alibaba cloud warehouse, which is not fast at all

Specific methods:

Copy the settings.xml file under the maven installation directory conf to the /. m2 folder, and

Add the following code to the < mirrors > < mirrors > tag:

1  <mirror>  
2         <id>nexus-aliyun</id>  
3         <mirrorOf>central</mirrorOf>    
4         <name>Nexus aliyun</name>  
5         <url>http://maven.aliyun.com/nexus/content/groups/public</url>  
6  </mirror>

Then restart eclipse to build the project quickly.

Add jsoup and httpclient dependency in pom.xml file

 1       <dependency>
 2         <groupId>org.apache.httpcomponents</groupId>
 3         <artifactId>httpclient</artifactId>
 4         <version>4.5.5</version>
 5     </dependency>
 6     
 7     <dependency>
 8     <groupId>org.jsoup</groupId>
 9         <artifactId>jsoup</artifactId>
10         <version>1.11.2</version>
11     </dependency>

 

Get the example code of the homepage content of blog Park:

 1 package com.zhjxtf.jsoup;
 2 
 3 import org.apache.http.HttpEntity;
 4 import org.apache.http.client.methods.CloseableHttpResponse;
 5 import org.apache.http.client.methods.HttpGet;
 6 import org.apache.http.impl.client.CloseableHttpClient;
 7 import org.apache.http.impl.client.HttpClients;
 8 import org.apache.http.util.EntityUtils;
 9 import org.jsoup.Jsoup;
10 import org.jsoup.nodes.Document;
11 import org.jsoup.nodes.Element;
12 import org.jsoup.select.Elements;
13 
14 public class Demo01 {
15     public static void main(String[] args)  throws Exception{
16         CloseableHttpClient httpClient = HttpClients.createDefault();//Establish httpclient Example
17         HttpGet httpHet = new HttpGet("https://www.cnblogs.com/");//Establish httpGet Example
18         CloseableHttpResponse response = httpClient.execute(httpHet);//implement get request
19         HttpEntity httpEntity = response.getEntity();//Get the returned entity
20         //System.out.println("The content of the web page is: "+EntityUtils.toString(httpEntity, "utf-8"));//Specify the encoding and print the contents of the web page
21         String content = EntityUtils.toString(httpEntity, "utf-8");
22         Document  doc = Jsoup.parse(content); //Parsing web page to get document object
23         Elements elements = doc.getElementsByTag("title");//Get title
24         Element  elementTitle = elements.get(0);//Get first element
25         System.out.println(elementTitle.text());  //Return text element
26         System.out.println(elementTitle.html());        //Return html element
27         Element element2 =doc.getElementById("site_nav_top");//according to id Get and operate on elements
28         System.out.println(element2.text());        //Returns the text of the element
29         System.out.println(element2.html());        //Return html element
30         response.close();//Close flow release system resources
31     }
32 }

Jsoup looks up DOM elements:

getElementsByTag(String tagName) finds Dom elements based on the tag name

getElementById(String id) finds the Dom element according to the id name

getElementsByClass(String className)) finds Dom elements based on style name

getElementsByAttribute(String key) finds the Dom element according to the attribute name

getElementsByAttributeValue(key, value) finds DOM elements based on attribute name and attribute value

Posted by aceconcepts on Fri, 31 Jan 2020 08:48:06 -0800