Introduction to Jsoup
jsoup is a Java HTML parser, which can directly parse a URL address and HTML text content. It provides a set of very labor-saving API, which can extract and operate data through DOM, CSS and jQuery like operation methods.
In the field of crawler web page collection, our main role is to use HttpClient to obtain the specific web page information needed after web page extraction, we use Jsoup, which can use powerful jQuery and CSS selectors to obtain the required data;
Official address of Jsoup: https://jsoup.org/
Jsoup latest download: https://jsoup.org/download
Jsoup document: https://jsoup.org/cookbook/introduction/parsing-a-document
Latest Maven address:
1 <dependency> 2 <groupId>org.jsoup</groupId> 3 <artifactId>jsoup</artifactId> 4 <version>1.10.2</version> 5 </dependency>
Episode: when building maven project, the download speed of jar package is very slow. You can use Alibaba cloud warehouse, which is not fast at all
Specific methods:
Copy the settings.xml file under the maven installation directory conf to the /. m2 folder, and
Add the following code to the < mirrors > < mirrors > tag:
1 <mirror> 2 <id>nexus-aliyun</id> 3 <mirrorOf>central</mirrorOf> 4 <name>Nexus aliyun</name> 5 <url>http://maven.aliyun.com/nexus/content/groups/public</url> 6 </mirror>
Then restart eclipse to build the project quickly.
Add jsoup and httpclient dependency in pom.xml file
1 <dependency> 2 <groupId>org.apache.httpcomponents</groupId> 3 <artifactId>httpclient</artifactId> 4 <version>4.5.5</version> 5 </dependency> 6 7 <dependency> 8 <groupId>org.jsoup</groupId> 9 <artifactId>jsoup</artifactId> 10 <version>1.11.2</version> 11 </dependency>
Get the example code of the homepage content of blog Park:
1 package com.zhjxtf.jsoup; 2 3 import org.apache.http.HttpEntity; 4 import org.apache.http.client.methods.CloseableHttpResponse; 5 import org.apache.http.client.methods.HttpGet; 6 import org.apache.http.impl.client.CloseableHttpClient; 7 import org.apache.http.impl.client.HttpClients; 8 import org.apache.http.util.EntityUtils; 9 import org.jsoup.Jsoup; 10 import org.jsoup.nodes.Document; 11 import org.jsoup.nodes.Element; 12 import org.jsoup.select.Elements; 13 14 public class Demo01 { 15 public static void main(String[] args) throws Exception{ 16 CloseableHttpClient httpClient = HttpClients.createDefault();//Establish httpclient Example 17 HttpGet httpHet = new HttpGet("https://www.cnblogs.com/");//Establish httpGet Example 18 CloseableHttpResponse response = httpClient.execute(httpHet);//implement get request 19 HttpEntity httpEntity = response.getEntity();//Get the returned entity 20 //System.out.println("The content of the web page is: "+EntityUtils.toString(httpEntity, "utf-8"));//Specify the encoding and print the contents of the web page 21 String content = EntityUtils.toString(httpEntity, "utf-8"); 22 Document doc = Jsoup.parse(content); //Parsing web page to get document object 23 Elements elements = doc.getElementsByTag("title");//Get title 24 Element elementTitle = elements.get(0);//Get first element 25 System.out.println(elementTitle.text()); //Return text element 26 System.out.println(elementTitle.html()); //Return html element 27 Element element2 =doc.getElementById("site_nav_top");//according to id Get and operate on elements 28 System.out.println(element2.text()); //Returns the text of the element 29 System.out.println(element2.html()); //Return html element 30 response.close();//Close flow release system resources 31 } 32 }
Jsoup looks up DOM elements:
getElementsByTag(String tagName) finds Dom elements based on the tag name
getElementById(String id) finds the Dom element according to the id name
getElementsByClass(String className)) finds Dom elements based on style name
getElementsByAttribute(String key) finds the Dom element according to the attribute name
getElementsByAttributeValue(key, value) finds DOM elements based on attribute name and attribute value