Java crawler, implementation of information capture

Keywords: Java network Junit snapshot

For reprint, please indicate the source: http://blog.csdn.net/lmj623565791/article/details/23272657

Today, the company has a need to do some data grabbing after the specified website query, so it took some time to write a demo for demonstration.

The idea is simple: you access links through Java, get html strings, and then parse links and other data you need.

Technical use of Jsoup to facilitate page analysis, of course, Jsoup is very convenient and simple, a line of code can know how to use:

  1. Document doc = Jsoup.connect("http://www.oschina.net/")   
  2.  .data("query""Java")   //Request parameters  
  3.  .userAgent("I ' m jsoup"//Setting up User-Agent  
  4.  .cookie("auth""token"//Setting cookie s  
  5.  .timeout(3000)           //Setting connection timeout  
  6.  .post();                 //Accessing URL s using the POST method  
The whole implementation process is described below:

1. Analyzing pages that need to be parsed:

Website: http://www1.sxcredit.gov.cn/public/infocomquery.do?method=publicIndexQuery

Page:

First, make a query on this page: observe the url, parameters, method, etc. of the request.

Here we use chrome's built-in developer tool (shortcut F12). Here's the result of the query:


We can see url, method, and parameters. Knowing how to query or the URL, I'll start the code below. To reuse and expand, I define several classes:

1. Rule.java is used to specify query url,method,params, etc.

  1. package com.zhy.spider.rule;  
  2.   
  3. /** 
  4.  * Rule class 
  5.  *  
  6.  * @author zhy 
  7.  *  
  8.  */  
  9. public class Rule  
  10. {  
  11.     /** 
  12.      * link 
  13.      */  
  14.     private String url;  
  15.   
  16.     /** 
  17.      * Parameter set 
  18.      */  
  19.     private String[] params;  
  20.     /** 
  21.      * Values corresponding to parameters 
  22.      */  
  23.     private String[] values;  
  24.   
  25.     /** 
  26.      * For the returned HTML, the label used for the first filtering, set the type first 
  27.      */  
  28.     private String resultTagName;  
  29.   
  30.     /** 
  31.      * CLASS / ID / SELECTION 
  32.      * Set the type of resultTagName by default to ID 
  33.      */  
  34.     private int type = ID ;  
  35.       
  36.     /** 
  37.      *GET / POST 
  38.      * Type of request, default GET 
  39.      */  
  40.     private int requestMoethod = GET ;   
  41.       
  42.     public final static int GET = 0 ;  
  43.     public final static int POST = 1 ;  
  44.       
  45.   
  46.     public final static int CLASS = 0;  
  47.     public final static int ID = 1;  
  48.     public final static int SELECTION = 2;  
  49.   
  50.     public Rule()  
  51.     {  
  52.     }  
  53.   
  54.       
  55.     public Rule(String url, String[] params, String[] values,  
  56.             String resultTagName, int type, int requestMoethod)  
  57.     {  
  58.         super();  
  59.         this.url = url;  
  60.         this.params = params;  
  61.         this.values = values;  
  62.         this.resultTagName = resultTagName;  
  63.         this.type = type;  
  64.         this.requestMoethod = requestMoethod;  
  65.     }  
  66.   
  67.     public String getUrl()  
  68.     {  
  69.         return url;  
  70.     }  
  71.   
  72.     public void setUrl(String url)  
  73.     {  
  74.         this.url = url;  
  75.     }  
  76.   
  77.     public String[] getParams()  
  78.     {  
  79.         return params;  
  80.     }  
  81.   
  82.     public void setParams(String[] params)  
  83.     {  
  84.         this.params = params;  
  85.     }  
  86.   
  87.     public String[] getValues()  
  88.     {  
  89.         return values;  
  90.     }  
  91.   
  92.     public void setValues(String[] values)  
  93.     {  
  94.         this.values = values;  
  95.     }  
  96.   
  97.     public String getResultTagName()  
  98.     {  
  99.         return resultTagName;  
  100.     }  
  101.   
  102.     public void setResultTagName(String resultTagName)  
  103.     {  
  104.         this.resultTagName = resultTagName;  
  105.     }  
  106.   
  107.     public int getType()  
  108.     {  
  109.         return type;  
  110.     }  
  111.   
  112.     public void setType(int type)  
  113.     {  
  114.         this.type = type;  
  115.     }  
  116.   
  117.     public int getRequestMoethod()  
  118.     {  
  119.         return requestMoethod;  
  120.     }  
  121.   
  122.     public void setRequestMoethod(int requestMoethod)  
  123.     {  
  124.         this.requestMoethod = requestMoethod;  
  125.     }  
  126.   
  127. }  

To put it simply: This rule class defines all the information we need in our query process, facilitates our expansion, and reuses the code. We can't write a set of code for every website we need to crawl.


2. The data object needed only needs links at present, LinkTypeData.java

  1. package com.zhy.spider.bean;  
  2.   
  3. public class LinkTypeData  
  4. {  
  5.     private int id;  
  6.     /** 
  7.      * Link Address 
  8.      */  
  9.     private String linkHref;  
  10.     /** 
  11.      * Title of the link 
  12.      */  
  13.     private String linkText;  
  14.     /** 
  15.      * abstract 
  16.      */  
  17.     private String summary;  
  18.     /** 
  19.      * content 
  20.      */  
  21.     private String content;  
  22.     public int getId()  
  23.     {  
  24.         return id;  
  25.     }  
  26.     public void setId(int id)  
  27.     {  
  28.         this.id = id;  
  29.     }  
  30.     public String getLinkHref()  
  31.     {  
  32.         return linkHref;  
  33.     }  
  34.     public void setLinkHref(String linkHref)  
  35.     {  
  36.         this.linkHref = linkHref;  
  37.     }  
  38.     public String getLinkText()  
  39.     {  
  40.         return linkText;  
  41.     }  
  42.     public void setLinkText(String linkText)  
  43.     {  
  44.         this.linkText = linkText;  
  45.     }  
  46.     public String getSummary()  
  47.     {  
  48.         return summary;  
  49.     }  
  50.     public void setSummary(String summary)  
  51.     {  
  52.         this.summary = summary;  
  53.     }  
  54.     public String getContent()  
  55.     {  
  56.         return content;  
  57.     }  
  58.     public void setContent(String content)  
  59.     {  
  60.         this.content = content;  
  61.     }  
  62. }  

3. Core query class: ExtractService.java

  1. package com.zhy.spider.core;  
  2.   
  3. import java.io.IOException;  
  4. import java.util.ArrayList;  
  5. import java.util.List;  
  6. import java.util.Map;  
  7.   
  8. import javax.swing.plaf.TextUI;  
  9.   
  10. import org.jsoup.Connection;  
  11. import org.jsoup.Jsoup;  
  12. import org.jsoup.nodes.Document;  
  13. import org.jsoup.nodes.Element;  
  14. import org.jsoup.select.Elements;  
  15.   
  16. import com.zhy.spider.bean.LinkTypeData;  
  17. import com.zhy.spider.rule.Rule;  
  18. import com.zhy.spider.rule.RuleException;  
  19. import com.zhy.spider.util.TextUtil;  
  20.   
  21. /** 
  22.  *  
  23.  * @author zhy 
  24.  *  
  25.  */  
  26. public class ExtractService  
  27. {  
  28.     /** 
  29.      * @param rule 
  30.      * @return 
  31.      */  
  32.     public static List<LinkTypeData> extract(Rule rule)  
  33.     {  
  34.   
  35.         //Necessary validation of rule  
  36.         validateRule(rule);  
  37.   
  38.         List<LinkTypeData> datas = new ArrayList<LinkTypeData>();  
  39.         LinkTypeData data = null;  
  40.         try  
  41.         {  
  42.             /** 
  43.              * Parsing rule 
  44.              */  
  45.             String url = rule.getUrl();  
  46.             String[] params = rule.getParams();  
  47.             String[] values = rule.getValues();  
  48.             String resultTagName = rule.getResultTagName();  
  49.             int type = rule.getType();  
  50.             int requestType = rule.getRequestMoethod();  
  51.   
  52.             Connection conn = Jsoup.connect(url);  
  53.             //Setting query parameters  
  54.   
  55.             if (params != null)  
  56.             {  
  57.                 for (int i = 0; i < params.length; i++)  
  58.                 {  
  59.                     conn.data(params[i], values[i]);  
  60.                 }  
  61.             }  
  62.   
  63.             //Setting the request type  
  64.             Document doc = null;  
  65.             switch (requestType)  
  66.             {  
  67.             case Rule.GET:  
  68.                 doc = conn.timeout(100000).get();  
  69.                 break;  
  70.             case Rule.POST:  
  71.                 doc = conn.timeout(100000).post();  
  72.                 break;  
  73.             }  
  74.   
  75.             //Processing returned data  
  76.             Elements results = new Elements();  
  77.             switch (type)  
  78.             {  
  79.             case Rule.CLASS:  
  80.                 results = doc.getElementsByClass(resultTagName);  
  81.                 break;  
  82.             case Rule.ID:  
  83.                 Element result = doc.getElementById(resultTagName);  
  84.                 results.add(result);  
  85.                 break;  
  86.             case Rule.SELECTION:  
  87.                 results = doc.select(resultTagName);  
  88.                 break;  
  89.             default:  
  90.                 //Default the body tag when resultTagName is empty  
  91.                 if (TextUtil.isEmpty(resultTagName))  
  92.                 {  
  93.                     results = doc.getElementsByTag("body");  
  94.                 }  
  95.             }  
  96.   
  97.             for (Element result : results)  
  98.             {  
  99.                 Elements links = result.getElementsByTag("a");  
  100.   
  101.                 for (Element link : links)  
  102.                 {  
  103.                     //Necessary screening  
  104.                     String linkHref = link.attr("href");  
  105.                     String linkText = link.text();  
  106.   
  107.                     data = new LinkTypeData();  
  108.                     data.setLinkHref(linkHref);  
  109.                     data.setLinkText(linkText);  
  110.   
  111.                     datas.add(data);  
  112.                 }  
  113.             }  
  114.   
  115.         } catch (IOException e)  
  116.         {  
  117.             e.printStackTrace();  
  118.         }  
  119.   
  120.         return datas;  
  121.     }  
  122.   
  123.     /** 
  124.      * Necessary validation of incoming parameters 
  125.      */  
  126.     private static void validateRule(Rule rule)  
  127.     {  
  128.         String url = rule.getUrl();  
  129.         if (TextUtil.isEmpty(url))  
  130.         {  
  131.             throw new RuleException("url Can't be empty!");  
  132.         }  
  133.         if (!url.startsWith("http://"))  
  134.         {  
  135.             throw new RuleException("url The format is incorrect!");  
  136.         }  
  137.   
  138.         if (rule.getParams() != null && rule.getValues() != null)  
  139.         {  
  140.             if (rule.getParams().length != rule.getValues().length)  
  141.             {  
  142.                 throw new RuleException("The key value of the parameter does not match the number!");  
  143.             }  
  144.         }  
  145.   
  146.     }  
  147.   
  148.   
  149. }  

4. It uses an exception class: RuleException.java

  1. package com.zhy.spider.rule;  
  2.   
  3. public class RuleException extends RuntimeException  
  4. {  
  5.   
  6.     public RuleException()  
  7.     {  
  8.         super();  
  9.         // TODO Auto-generated constructor stub  
  10.     }  
  11.   
  12.     public RuleException(String message, Throwable cause)  
  13.     {  
  14.         super(message, cause);  
  15.         // TODO Auto-generated constructor stub  
  16.     }  
  17.   
  18.     public RuleException(String message)  
  19.     {  
  20.         super(message);  
  21.         // TODO Auto-generated constructor stub  
  22.     }  
  23.   
  24.     public RuleException(Throwable cause)  
  25.     {  
  26.         super(cause);  
  27.         // TODO Auto-generated constructor stub  
  28.     }  
  29.   
  30. }  

5. Finally, the test: Here we use two websites to test, using different rules, see the code specifically.

  1. package com.zhy.spider.test;  
  2.   
  3. import java.util.List;  
  4.   
  5. import com.zhy.spider.bean.LinkTypeData;  
  6. import com.zhy.spider.core.ExtractService;  
  7. import com.zhy.spider.rule.Rule;  
  8.   
  9. public class Test  
  10. {  
  11.     @org.junit.Test  
  12.     public void getDatasByClass()  
  13.     {  
  14.         Rule rule = new Rule(  
  15.                 "http://www1.sxcredit.gov.cn/public/infocomquery.do?method=publicIndexQuery",  
  16.         new String[] { "query.enterprisename","query.registationnumber" }, new String[] { "Xing Wan","" },  
  17.                 "cont_right", Rule.CLASS, Rule.POST);  
  18.         List<LinkTypeData> extracts = ExtractService.extract(rule);  
  19.         printf(extracts);  
  20.     }  
  21.   
  22.     @org.junit.Test  
  23.     public void getDatasByCssQuery()  
  24.     {  
  25.         Rule rule = new Rule("http://www.11315.com/search",  
  26.                 new String[] { "name" }, new String[] { "Xing Wan" },  
  27.                 "div.g-mn div.con-model", Rule.SELECTION, Rule.GET);  
  28.         List<LinkTypeData> extracts = ExtractService.extract(rule);  
  29.         printf(extracts);  
  30.     }  
  31.   
  32.     public void printf(List<LinkTypeData> datas)  
  33.     {  
  34.         for (LinkTypeData data : datas)  
  35.         {  
  36.             System.out.println(data.getLinkText());  
  37.             System.out.println(data.getLinkHref());  
  38.             System.out.println("***********************************");  
  39.         }  
  40.   
  41.     }  
  42. }  

Output results:

  1. Shenzhen Netxing Technology Co., Ltd.
  2. http://14603257.11315.com  
  3. ***********************************  
  4. Jingzhou Xingnet Highway Material Co., Ltd.
  5. http://05155980.11315.com  
  6. ***********************************  
  7. Quanxing Internet Bar in Xi'an City
  8. #  
  9. ***********************************  
  10. Zichang County New Network City
  11. #  
  12. ***********************************  
  13. Shaanxi Tongxing Network Information Co., Ltd. Third Branch
  14. #  
  15. ***********************************  
  16. Xi'an Happy Network Technology Co., Ltd.
  17. #  
  18. ***********************************  
  19. Shaanxi Tongxing Network Information Co., Ltd. Xi'an Branch
  20. #  
  21. ***********************************  

Finally, we use a Baidu news to test our code: to show that our code is generic.

  1.         /** 
  2.  * Using Baidu News, only set url and keyword and return type 
  3.  */  
  4. @org.junit.Test  
  5. public void getDatasByCssQueryUserBaidu()  
  6. {  
  7.     Rule rule = new Rule("http://news.baidu.com/ns",  
  8.             new String[] { "word" }, new String[] { "Alipay" },  
  9.             null, -1, Rule.GET);  
  10.     List<LinkTypeData> extracts = ExtractService.extract(rule);  
  11.     printf(extracts);  
  12. }  
We only set links, keywords, and request types, without setting specific filter conditions.

Result: Some garbage data is positive, but the data needed must also be captured. We can set Rule.SECTION and further restrictions on screening conditions.

  1. Sort by time  
  2. /ns?word=Alipay&ie=utf-8&bs=Alipay&sr=0&cl=2&rn=20&tn=news&ct=0&clk=sortbytime  
  3. ***********************************  
  4. x  
  5. javascript:void(0)  
  6. ***********************************  
  7. Alipay will work together to build safety fund The first batch invested 40 million yuan  
  8. http://finance.ifeng.com/a/20140409/12081871_0.shtml  
  9. ***********************************  
  10. 7 A piece of the same news  
  11. /ns?word=%E6%94%AF%E4%BB%98%E5%AE%9D+cont:2465146414%7C697779368%7C3832159921&same=7&cl=1&tn=news&rn=30&fm=sd  
  12. ***********************************  
  13. Baidu snapshot  
  14. http://cache.baidu.com/c?m=9d78d513d9d437ab4f9e91697d1cc0161d4381132ba7d3020cd0870fd33a541b0120a1ac26510d19879e20345dfe1e4bea876d26605f75a09bbfd91782a6c1352f8a2432721a844a0fd019adc1452fc423875d9dad0ee7cdb168d5f18c&p=c96ec64ad48b2def49bd9b780b64&newp=c4769a4790934ea95ea28e281c4092695912c10e3dd796&user=baidu&fm=sc&query=%D6%A7%B8%B6%B1%A6&qid=a400f3660007a6c5&p1=1  
  15. ***********************************  
  16. OpenSSL Vulnerabilities involve many websites Alipay says there is no data leakage  
  17. http://tech.ifeng.com/internet/detail_2014_04/09/35590390_0.shtml  
  18. ***********************************  
  19. 26 A piece of the same news  
  20. /ns?word=%E6%94%AF%E4%BB%98%E5%AE%9D+cont:3869124100&same=26&cl=1&tn=news&rn=30&fm=sd  
  21. ***********************************  
  22. Baidu snapshot  
  23. http://cache.baidu.com/c?m=9f65cb4a8c8507ed4fece7631050803743438014678387492ac3933fc239045c1c3aa5ec677e4742ce932b2152f4174bed843670340537b0efca8e57dfb08f29288f2c367117845615a71bb8cb31649b66cf04fdea44a7ecff25e5aac5a0da4323c044757e97f1fb4d7017dd1cf4&p=8b2a970d95df11a05aa4c32013&newp=9e39c64ad4dd50fa40bd9b7c5253d8304503c52251d5ce042acc&user=baidu&fm=sc&query=%D6%A7%B8%B6%B1%A6&qid=a400f3660007a6c5&p1=2  
  24. ***********************************  
  25. YAHOO Japan began supporting Alipay payment in June.  
  26. http://www.techweb.com.cn/ucweb/news/id/2025843  
  27. ***********************************  

Posted by Plasma on Sat, 18 May 2019 18:34:15 -0700