java crawl learning-01

Keywords: Java crawler Ajax

1, Climbing tool

  1. httpclient
    Simulate the browser request for capturing html pages. After capturing, you can obtain data in combination with regular.
  2. fastjson
    Used to parse json. For some ajax requests, httpclient cannot grab the html generated dynamically by js later. Therefore, we need to obtain the ajax request url, then obtain the json strings returned from the background through the HTTP client request url, and finally parse these json strings to get the corresponding data.
  3. jsoup
    This tool is also used to parse html. It obtains the content of nodes through a css like selector.
  4. htmlunit
    Because httpclient can only grab static html pages, it is powerless for html generated dynamically by JS. Therefore, htmlunit is needed. After obtaining the html page, htmlunit can execute asynchronous JS code like a browser. After JS renders the html page, it can happily obtain the node information on the page.
  5. java-selenium
    It needs to be used in combination with the browser kernel (the driver of the corresponding version of the browser kernel needs to be downloaded. Note that the driver version must be consistent with the browser). selenium will really simulate the browser to operate the website. For some websites with high security, you generally need to log in to obtain data. Generally, the operation after login should also carry a token token. At this time, it may be troublesome to crawl directly through the url. Therefore, we can operate through a real simulation browser to capture these data.

2, httpclient crawling case

Crawl some data from this url: http://www.xinhuanet.com/legal/ej.htm?page=fzzt

(1) HTTP client get request url analysis html information

  1. Get html code
public static void main(String[] args) {
        //1. Generate httpclient, which is equivalent to opening a browser
        CloseableHttpClient httpClient = HttpClients.createDefault();
        CloseableHttpResponse response = null;
        //2. Creating a get request is equivalent to entering the web address in the browser address bar
        HttpGet request = new HttpGet("http://www.xinhuanet.com/legal/ej.htm?page=fzzt");
        //Camouflage head
        request.setHeader("User-Agent","Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Mobile Safari/537.36");
        //IP proxy
        //HttpHost proxy = new HttpHost("112.85.168.223", 9999);
        //RequestConfig config = RequestConfig.custom().setProxy(proxy).build();
        //request.setConfig(config);
        try {
            //3. Executing the get request is equivalent to typing the Enter key after entering the address bar
            response = httpClient.execute(request);

            //4. Judge that the response status is 200 and process it
            if(response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {
                //5. Get the response content
                HttpEntity httpEntity = response.getEntity();
                String html = EntityUtils.toString(httpEntity, "utf-8");
                System.out.println(html);
            } else {
                //If the return status is not 200, such as 404 (page does not exist), process it according to the situation, which is omitted here
                System.out.println("Return status is not 200");
                System.out.println(EntityUtils.toString(response.getEntity(), "utf-8"));
            }
        } catch (ClientProtocolException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //6. Close
            HttpClientUtils.closeQuietly(response);
            HttpClientUtils.closeQuietly(httpClient);
        }
    }
  1. html obtained after running
<!DOCTYPE html>
<html>
<head>
<link href="/favicon.ico" rel="shortcut icon" type="image/x-icon" />
<meta charset="utf-8" /><meta name="publishid" content="11165612.0.100.0"/>
<meta name="nodeid" content="0"/>
<meta name="nodename" content="" />

<meta name="viewport" content="width=device-width,initial-scale=1.0,user-scalable=0,minimum-scale=1.0,maximum-scale=1.0" />
<meta name="apple-mobile-web-app-capable" content="yes" />
<meta name="apple-mobile-web-app-status-bar-style" content="black" />
<meta content="telephone=no" name="format-detection" />
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
<script src="http://www.news.cn/global/js/pageCore.js"></script>
<title> </title>
    <meta name="description" content=" " />
<link rel="stylesheet" href="http://lib.news.cn/common/reset.css" />
<script src="http://res.wx.qq.com/open/js/jweixin-1.6.0.js"></script>
<script src="http://lib.news.cn/common/share.js"></script>
<script src="http://lib.news.cn/jquery/jquery1.12.4/jquery.min.js"></script>
<script src="http://lib.news.cn/xpage/xpage.min.js"></script>
<link rel="stylesheet" href="http://lib.news.cn/swiper/swiper3.4.2/swiper.min.css" />
<script src="http://lib.news.cn/swiper/swiper3.4.2/swiper.min.js"></script>
<!--[if lt IE 10]>
<link rel="stylesheet" href="http://lib.news.cn/swiper/swiper2.7.6/idangerous.swiper.css">
<script src="http://lib.news.cn/swiper/swiper2.7.6/idangerous.swiper.min.js"></script>
<style>
  .item-style1 {display:inline-block;width:100%}
  .item-style1 .img{ float:left}
  .item-style5 .img a img{margin-right:1%}
</style>
<![endif]-->
<link rel="stylesheet" href="/politics/newpage2020/styles/ej.css" />
<style>
.lib-top { background:#2f65c0 !important; }
.nav .nav-cont a:first-child { background:#2f65c0 !important; }
.nav .nav-cont a.active { color:##2f65c0 !important; }
.nav .nav-cont a.active:before { background:#2f65c0 !important; } 
.card .title a:before { background:#2f65c0 !important; }
.list .xpage-more-btn { background:#2f65c0 !important; }
.lib-foot { background:#2f65c0 !important; }
</style>
</head>
<body>
<script src="http://lib.news.cn/common/top.js"></script>
<script src="http://lib.news.cn/common/mobHead.js"></script>
<div class="main">
<div class="grid-1200 box clearfix">
<div class="grid-120 mr-30 nav domPc">
<div class="grid-120 nav-cont" id="navCont"></div>
</div>
<div class="grid-700 box-cont">
<div class="list">
<div class="xpage-container" id="list">
<ul class="xpage-content xpage-content-list"></ul>
<div class="xpage-more-btn"></div>
</div>
</div>
</div>
<div class="grid-320 ml-30 hot domPc">
<div class="grid-320 card">
<div class="title"><a href="ej.htm?page=fy" target="_blank">Magic eye</a></div>
<div class="list-style2">
<ul>
<li class="clearfix">
<a href="http://www.news.cn/2021-10/05/c_1127930385.htm" target="_blank">
<div class="img"><img src="titlepic/112793/1127930558_1633396049044_title0h.png" alt="Hiding in the online pharmacy, the quality of some Meitong is in doubt" /></div>
<div class="tit">Hiding in the online pharmacy, the quality of some Meitong is in doubt</div>
</a>
</li>
<li class="clearfix">
<a href="http://www.news.cn/legal/2021-09/27/c_1127906926.htm" target="_blank">
<div class="img"><img src="titlepic/112790/1127907358_1632711219452_title0h.jpg" alt="We will promote the integration of Punishing Financial Corruption and preventing and controlling financial risks" /></div>
<div class="tit">We will promote the integration of Punishing Financial Corruption and preventing and controlling financial risks</div>
</a>
</li>
<li class="clearfix">
<a href="http://www.news.cn/legal/2021-09/27/c_1127906945.htm" target="_blank">
<div class="img"><img src="titlepic/112790/1127907354_1632711121596_title0h.jpg" alt="granary&quot;large rat&quot;Present record" /></div>
<div class="tit">granary"large rat"Present record</div>
</a>
</li>
</ul>
</div>
</div>
<div class="grid-320 card observe" id="observe">
<div class="title"><a href="ej.htm?page=fztj" target="_blank">Illustration of rule of law</a></div>
<div class="swiper-container">
<div class="swiper-wrapper">
<div class="swiper-slide">
<a href="http://www.news.cn/legal/2021-10/06/c_1127932076.htm" target="_blank">
<div class="img"><img src="titlepic/112793/1127932991_1633508973562_title0h.jpg" alt="Explain the evidence requirements and standards of supervision organs in detail" /></div>
<div class="tit">Explain the evidence requirements and standards of supervision organs in detail</div>
</a>
</div>
<div class="swiper-slide">
<a href="http://www.news.cn/legal/2021-08/18/c_1211336053.htm" target="_blank">
<div class="img"><img src="titlepic/112777/1127776844_1629362752818_title0h.jpg" alt="Latest legislation&ldquo;Knowledge points&rdquo;coming!" /></div>
<div class="tit">The latest legislation "knowledge point" is coming!</div>
</a>
</div>
<div class="swiper-slide">
<a href="http://www.xinhuanet.com/politics/2021-08/17/c_1211334769.htm" target="_blank">
<div class="img"><img src="titlepic/112777/1127771007_1629248807162_title0h.jpg" alt="Epidemic prevention and control, these four awareness must have" /></div>
<div class="tit">Epidemic prevention and control, these four awareness must have</div>
</a>
</div>
<div class="swiper-slide">
<a href="http://www.xinhuanet.com/politics/2021-08/05/c_1211317032.htm" target="_blank">
<div class="img"><img src="titlepic/112773/1127735858_1628214570620_title0h.png" alt="These four types of obstacles to epidemic prevention and control&ldquo;evil&rdquo;´╝îBring to justice!" /></div>
<div class="tit">These four kinds of "evil deeds" that hinder epidemic prevention and control will be brought to justice!</div>
</a>
</div>
<div class="swiper-slide">
<a href="http://www.xinhuanet.com/legal/2021-07/30/c_1211268206.htm" target="_blank">
<div class="img"><img src="titlepic/112771/1127713933_1627634740365_title0h.jpg" alt="eliminate&ldquo;Face thief&rdquo;be imperative!" /></div>
<div class="tit">It is imperative to eliminate the "face thief"!</div>
</a>
</div>
</div>
<div class="swiper-pagination pagination"></div>
</div>
</div>
<div class="grid-320 card special" id="special">
<div class="title"><a href="ej.htm?page=fzzt" target="_blank">Topic of rule of law</a></div>
<div class="list-style3">
<ul class="clearfix">
<li class="clearfix">
<a href="http://www.xinhuanet.com/legal/fzldzt/quanmgjaqjy6.htm" target="_blank">
<div class="img"><img src="titlepic/112735/1127352605_1618907407390_title0h.jpg" alt="2021 The Sixth National Security Education Day" /></div>
<div class="tit">2021 The Sixth National Security Education Day</div>
</a>
</li>
<li class="clearfix">
<a href="http://www.xinhuanet.com/legal/fzldzt/2021zjw5qh.htm" target="_blank">
<div class="img"><img src="titlepic/112702/1127026351_1611631384945_title0h.png" alt="Focus on the Fifth Plenary Session of the 19th Central Commission for Discipline Inspection" /></div>
<div class="tit">Focus on the Fifth Plenary Session of the 19th Central Commission for Discipline Inspection</div>
</a>
</li>
</ul>
</div>
</div>
</div>
</div>
</div>
<script src="http://lib.news.cn/common/foot.js"></script>
<script src="http://lib.news.cn/common/rightFixed.js"></script>
<script src="/legal/newpage/js/ej.js"></script>
<div style="display:none"><div id="fwl">010020020120000000000000011200000000000000</div><script type="text/javascript" src="//webd.home.news.cn/webdig.js?z=1"></script><script type="text/javascript">wd_paramtracker("_wdxid=010020020120000000000000011200000000000000")</script><noscript><img src="//webd.home.news.cn/1.gif?z=1&_wdxid=010020020120000000000000011200000000000000" border="0" /></noscript></div>   </body>
</html>
  1. We found that we didn't get all the html
    After running, we found that the html obtained was incomplete and missing some data.

(2) Get dynamically loaded html

1. Parsing url requests

  1. Open the browser controller and grab the url request and response information
  2. View root js related data
  3. After finding the target data, get the url
  4. When the new tab is opened, it is found that the request will return a json data, which is the missing data of the html page.
  5. We found that the above url carries many parameters. Some parameters are not necessary. We can try to delete them.
    Modified url: http://da.wa.news.cn/nodeart/page?nid=11227931&pgnum=1

(3) Parsing json using json tools

1. Import jar

<dependency>
   <groupId>com.alibaba</groupId>
    <artifactId>fastjson</artifactId>
    <version>1.2.75</version>
</dependency>

2. Specific usage

  1. JSONObject object and JSONArray object
  • Let's first understand the format of json. json consists of {} and []. The content in {} is an object. [] the contents are a list set (an array)
  • In fastjason, the data of JSONObject is represented by {}, while JSONArray, as the name suggests, is an array composed of JSONObject, represented by [{}, {},..., {}].
  • For example:
    JSONObject: {"id": "123", "courseID": "huangt test", "title": "submit job", "content": null}
    JSONArray: [{"id": "123", "courseID": "huangt test", "title": "submit job"}, {"content": null, "beginTime": 1398873600000 "endTime"}];
    Of course: JSONArray can also be found in JSONObject.

be careful:

  • JSONObject [equivalent to map] can obtain the value (that is, the value after the colon) through the key (that is, the name before the colon)
  • JSONArray [equivalent to list] can be used to traverse. The JSONArray object calls iterator() to get an iterator
  1. give an example
    Objective: resolve the values of all titles in the following json string.

jsonStr =

{
 "status":0,
 "data":{
     "list":[
         {
             "DocID":1127940962,
             "Title":"2021 National cyber security publicity week",
             "NodeId":11227931,
             "PubTime":"2021-10-11 08:50:18",
             "LinkUrl":"http://www.news.cn/politics/ldzt/wlaqxcz2021/2021fld.htm",
             "Abstract":null,
             "keyword":null,
             "Editor":null,
             "Author":"Yu Ziru",
             "IsLink":1,
             "SourceName":null,
             "PicLinks":"http://www.news.cn/legal/titlepic/112794/1127940962_1633940116800_title0h.jpg",
             "IsMoreImg":0,
             "imgarray":[

             ],
             "SubTitle":null,
             "Attr":63,
             "m4v":null,
             "tarray":[

             ],
             "uarray":[

             ],
             "allPics":[
                 "http://www.news.cn/legal/titlepic/112794/1127940962_1633940116800_title0h.jpg"
             ],
             "IntroTitle":null,
             "Ext1":null,
             "Ext2":null,
             "Ext3":null,
             "Ext4":null,
             "Ext5":null,
             "Ext6":null,
             "Ext7":null,
             "Ext8":null,
             "Ext9":null,
             "Ext10":null
         },
         {
             "DocID":1127352605,
             "Title":"2021 The Sixth National Security Education Day",
             "NodeId":11227931,
             "PubTime":"2021-04-20 16:30:09",
             "LinkUrl":"http://www.xinhuanet.com/legal/fzldzt/quanmgjaqjy6.htm",
             "Abstract":null,
             "keyword":null,
             "Editor":null,
             "Author":"Yu Ziru",
             "IsLink":1,
             "SourceName":null,
             "PicLinks":"http://www.xinhuanet.com/legal/titlepic/112735/1127352605_1618907407390_title0h.jpg",
             "IsMoreImg":0,
             "imgarray":[

             ],
             "SubTitle":null,
             "Attr":63,
             "m4v":null,
             "tarray":[

             ],
             "uarray":[

             ],
             "allPics":[
                 "http://www.xinhuanet.com/legal/titlepic/112735/1127352605_1618907407390_title0h.jpg"
             ],
             "IntroTitle":null,
             "Ext1":null,
             "Ext2":null,
             "Ext3":null,
             "Ext4":null,
             "Ext5":null,
             "Ext6":null,
             "Ext7":null,
             "Ext8":null,
             "Ext9":null,
             "Ext10":null
         },
         {
             "DocID":1127026351,
             "Title":"Focus on the Fifth Plenary Session of the 19th Central Commission for Discipline Inspection",
             "NodeId":11227931,
             "PubTime":"2021-01-26 11:23:06",
             "LinkUrl":"http://www.xinhuanet.com/legal/fzldzt/2021zjw5qh.htm",
             "Abstract":null,
             "keyword":null,
             "Editor":null,
             "Author":"Lu Junyu",
             "IsLink":1,
             "SourceName":null,
             "PicLinks":"http://www.xinhuanet.com/legal/titlepic/112702/1127026351_1611631384945_title0h.png",
             "IsMoreImg":0,
             "imgarray":[

             ],
             "SubTitle":null,
             "Attr":63,
             "m4v":null,
             "tarray":[

             ],
             "uarray":[

             ],
             "allPics":[
                 "http://www.xinhuanet.com/legal/titlepic/112702/1127026351_1611631384945_title0h.png"
             ],
             "IntroTitle":null,
             "Ext1":null,
             "Ext2":null,
             "Ext3":null,
             "Ext4":null,
             "Ext5":null,
             "Ext6":null,
             "Ext7":null,
             "Ext8":null,
             "Ext9":null,
             "Ext10":null
         }
     ]
 },
 "totalnum":23
}
  • Convert json string to JSONObject.
    Note that it can only be parsed layer by layer

    JSONObject jsonObject = JSON.parseObject(jsonStr);
    
  • Get the content in jsonObject where the key is data. We find that the value of data is surrounded by {} parentheses. Therefore, we need to get jsonObject ("data") through the getJSONObject("data") method.

    JSONObject data = jsonObject.getJSONObject("data");
    
  • Get the content in data where the (New JSONObject) key is list. We find that the value of data is surrounded by [] parentheses. Therefore, we need to get JSONArray through the getJSONArray("list") method.

    JSONArray list = data.getJSONArray("list");
    
  • With JSONArray, we need to traverse it

    Iterator<Object> iterator = list.iterator();
    while (iterator.hasNext()){
    	  //We found that each element in the list is enclosed by {}, so each element is JSONObject.
          JSONObject next = (JSONObject)iterator.next();
          //If the value after the colon is a string, it means that the parsing is completed. You can get the values of all characters through the get(key) method.
          articles.add(new Article((String) next.get("Title"),(String) next.get("PubTime"),null));
          System.out.println(next.get("Title"));
          System.out.println(next.get("PubTime"));
          System.out.println(next.get("LinkUrl"));
      }
    
  • Complete code

    package com.lihua.crawlingfzzt.service.impl;
    
    import com.alibaba.fastjson.JSON;
    import com.alibaba.fastjson.JSONArray;
    import com.alibaba.fastjson.JSONObject;
    import com.lihua.crawlingfzzt.pojo.Article;
    import com.lihua.crawlingfzzt.service.CrawlingFzztService;
    import org.apache.http.HttpEntity;
    import org.apache.http.HttpStatus;
    import org.apache.http.client.ClientProtocolException;
    import org.apache.http.client.methods.CloseableHttpResponse;
    import org.apache.http.client.methods.HttpGet;
    import org.apache.http.client.utils.HttpClientUtils;
    import org.apache.http.impl.client.CloseableHttpClient;
    import org.apache.http.impl.client.HttpClients;
    import org.apache.http.util.EntityUtils;
    import org.springframework.stereotype.Service;
    
    import java.io.IOException;
    import java.util.ArrayList;
    import java.util.Iterator;
    import java.util.List;
    
    /**
     * @author 15594
     */
    @Service
    public class CrawlingFzztServiceImpl implements CrawlingFzztService {
    
        @Override
        public List<Article> getArticles(int pageNum, int cnt) {
    
    
            //Store Article List
            List<Article> articles = new ArrayList<>();
            //Request address
            String url = "http://da.wa.news.cn/nodeart/page?nid=11227931&pgnum="+pageNum+"&cnt="+cnt;
    //        String url = "http://da.wa.news.cn/nodeart/page?nid=11227931&pgnum=2&cnt=10";
            //Generate httpclient, which is equivalent to opening a browser
            CloseableHttpClient httpClient = HttpClients.createDefault();
            CloseableHttpResponse response = null;
            //Creating a get request is equivalent to entering the web address in the browser address bar
            HttpGet request = new HttpGet(url);
            //Camouflage head
            request.setHeader("User-Agent","Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Mobile Safari/537.36");
            //IP proxy
            //HttpHost proxy = new HttpHost("112.85.168.223", 9999);
            //RequestConfig config = RequestConfig.custom().setProxy(proxy).build();
            //request.setConfig(config);
            try {
                //Executing the get request is equivalent to typing the Enter key after entering the address bar
                response = httpClient.execute(request);
    
                //Judge that the response state is 200 and process it
                if(response.getStatusLine().getStatusCode() == HttpStatus.SC_OK) {
                    //5. Get the response content
                    HttpEntity httpEntity = response.getEntity();
                    String html = EntityUtils.toString(httpEntity, "utf-8");
                    System.out.println(html);
                    JSONObject jsonObject = JSON.parseObject(html);
                    JSONObject data = jsonObject.getJSONObject("data");
                    JSONArray list = data.getJSONArray("list");
                    Iterator<Object> iterator = list.iterator();
                    while (iterator.hasNext()){
                        JSONObject next = (JSONObject)iterator.next();
    
                        articles.add(new Article((String) next.get("Title"),(String) next.get("PubTime"),null));
                        System.out.println(next.get("Title"));
                        System.out.println(next.get("PubTime"));
                        System.out.println(next.get("LinkUrl"));
                    }
    
                } else {
                    System.out.println("Return status is not 200");
                    System.out.println(EntityUtils.toString(response.getEntity(), "utf-8"));
                }
            } catch (ClientProtocolException e) {
                e.printStackTrace();
            } catch (IOException e) {
                e.printStackTrace();
            } finally {
                //close resource
                HttpClientUtils.closeQuietly(response);
                HttpClientUtils.closeQuietly(httpClient);
            }
            return articles;
        }
    }
    
    

4, Reference

https://blog.csdn.net/gududedabai/article/details/78637186

Posted by Soogn on Mon, 11 Oct 2021 11:33:17 -0700