[Jsoup in action] Simulated Browser: Use of Jsoup Tool Classes and retry Strategy for Failed Retries (3)

Keywords: github Maven Mac OS X

Get other sister chapters of a Document object from a URL:

Simulated Browser: Getting Web Page Data Simply (1)
Simulated Browser: post Simulated Log-in to Get Web Page Data (2)
Simulated Browser: Use of Jsoup Tool Class and retry Strategy for Failed Retries (3)

Tool class: As the name implies, it is a tool for others to use as a tool. It only provides stateless classes of static methods (usually ending with Helper and Util(s). It has no clear business functions and is not instantiated into objects in a project.
In practical crawling, when visiting different websites or different addresses of the same website, different Jsoup Connection s will be created, which will cause a lot of repetitive logic in the code.
The process of creating connection is encapsulated into util tool class, which solves this kind of problem well.
Generally, crawlers are doing unattended work, and many websites will have anti-crawler strategies, so that your crawlers are likely to have problems after normal execution for a period of time. The following tool class provides a simple retry failure retry strategy. Retry strategies can be more abundant in the project, such as adding dynamic agents, keeping track of visiting time, and so on.
/**
 * En:Utils class to parse website html by <code>Jsoup</code></br>
 * Jp:ウェブサイトをAnalysis of </br>
 * Zh:Jsoup Simulated Browser Parsing Web Tool Class </br>
 * 
 * @since crawler(datasnatch) version(1.0)</br>
 * @author bluetata / https://github.com/bluetata</br>
 * @version 1.0</br>
 * 
 */
public final class JsoupUtil {
    /**
     * Method Usage and Description: Simulated Browser Returns Accessed Web Site Source with Document Type
     * 
     * @param url The visited website. The URL must start with "http://www."
     * @return doc Return html of visited web pages with Document type
     * @throws Exception
     */
    public static Document getDocument(String url) throws Exception {

        Document doc = null;
        StringWriter strWriter = new StringWriter();
        PrintWriter prtWriter = new PrintWriter(strWriter);

        // En:get max retry count from properties file(com-constants.properties)
        // Jp: Max. Zh: Max. retry times through properties
        int maxRetry = Integer.parseInt(PropertyReader.getProperties(SystemConstants.COM_CONSTANTS)
                .getProperty(UtilsConstants.MAX_RETRY_COUNT));
		// En: get sleep time from properties file Jp: get sleep time from properties file Jp: get 123
        int sleepTime = Integer.parseInt(PropertyReader.getProperties(SystemConstants.COM_CONSTANTS)
                .getProperty(UtilsConstants.SLEEP_TIME_COUNT));

        // En: if exception is occurred then retry loop is continue to run;
        // Jp: The occasion of abnormal onset, the course of abnormal onset, the course of abnormal onset, the course of abnormal onset, and the course of abnormal onset of abnormal onset of abnormal onset of abnormal onset of juvenile onset of juvenile onset of juvenile onset of juvenile onset of juvenile onset of juvenile onset of juvenile onset of juvenile onset
        for (int j = 1; j <= maxRetry; j++) {

            try {
                if (j != 1) {
                    Thread.sleep(sleepTime);
                }
                doc = Jsoup.connect(url).timeout(10 * 1000)
                        .userAgent(
                                // add userAgent. TODO There is a plan to configure userAgent to load that userAgent from a property file.
                                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/534.30 (KHTML, like Gecko) Chrome/12.0.742.122 Safari/534.30")
                        .get();

                // En: normal finish situation,loop is broken.
                // Jp: Where
                // Zh: Normal termination, termination of cycle.
                break;

            } catch (Exception ex) {
                // throw new Exception(ex); dead code is occurred

                // Acquisition of StackTrace Character Column
                ex.printStackTrace(prtWriter);
                String stackTrace = strWriter.toString();

                if (strWriter != null) {
                    try {
                        strWriter.close();
                    } catch (IOException ioe) {
                        ioe.printStackTrace();
                    }
                }
                if (prtWriter != null) {
                    prtWriter.close();
                }

                // En:info log is output. Jp: Info,  Zh: Output to info log.
                Log4jUtil.info(stackTrace);
            }
        }
        return doc;
    }
}

Jsoup Learning and Discussing QQ Group: 50695115

Jsoup crawler code example and blog internal source download: https://github.com/bluetata/crawler-jsoup-maven

For more articles on Jsoup, please refer to the column: [Jsoup in action]


Note: This article was originally published in blog.csdn.net by `bluetata'. Please note the source when reproducing it.


Posted by FVxSF on Mon, 31 Dec 2018 21:00:08 -0800