Java crawler technology httpclient / jsup / webmagic

Keywords: Java crawler

1. Course plan

Entry program
Introduction to web crawler
HttpClient grabs data
Jsup parsing data
Reptile case

2. Web crawler

Web crawler is a method that automatically grabs the information of the world wide web according to certain rules

2.1. Introduction to crawler

2.1.1. Environmental preparation

JDK1.8
IntelliJ IDEA
Mave from IDEA

2.1.2. Environmental preparation

Create Maven project and add dependency to pom.xml

<dependencies>
    <!-- HttpClient -->
    <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpclient</artifactId>
        <version>4.5.3</version>
    </dependency>

    <!-- journal -->
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
        <version>1.7.25</version>
    </dependency>
</dependencies>

2.1.3. Add log4j.properties

log4j.rootLogger=DEBUG,A1
log4j.logger.cn.itcast = DEBUG

log4j.appender.A1=org.apache.log4j.ConsoleAppender
log4j.appender.A1.layout=org.apache.log4j.PatternLayout
log4j.appender.A1.layout.ConversionPattern=%-d{yyyy-MM-dd HH:mm:ss,SSS} [%t] [%c]-[%p] %m%n

2.1.4. Code writing

Write the simplest crawler and grab the wisdom home page: http://www.itcast.cn/

public static void main(String[] args) throws Exception {
    CloseableHttpClient httpClient = HttpClients.createDefault();

    HttpGet httpGet = new HttpGet("http://www.itcast.cn/");

    CloseableHttpResponse response = httpClient.execute(httpGet);

    if (response.getStatusLine().getStatusCode() == 200) {
        String content = EntityUtils.toString(response.getEntity(), "UTF-8");
        System.out.println(content);
    }
}

3. Web crawler

3.1. Introduction to web crawler

In the era of big data, information collection is an important work, and the data in the Internet is massive. If you only rely on manpower to collect information, it will not only be inefficient and cumbersome, but also increase the cost of collection. How to automatically and efficiently obtain the information we are interested in on the Internet and use it for us is an important problem, and crawler technology is born to solve these problems.

Web crawler, also known as web robot, can replace people to automatically collect and sort out data information in the Internet. It is a program or script that automatically captures the World Wide Web information according to certain rules. It can automatically collect all the page contents it can access to obtain relevant data.

In terms of function, crawlers are generally divided into three parts: data acquisition, processing and storage. The crawler obtains the URL on the initial web page from the URL of one or several initial web pages. In the process of crawling the web page, it continuously extracts new URLs from the current page and puts them into the queue until certain stop conditions of the system are met.

3.2. Why learn web crawler

We have a preliminary understanding of web crawlers, but why should we learn web crawlers? Only by clearly knowing our learning purpose can we learn this knowledge better. Here, the reasons for four common learning crawlers are summarized:

1. Search engine can be realized

After we learn how to write a crawler, we can use the crawler to automatically collect the information in the Internet and store or process it accordingly. When we need to retrieve some information, we only need to retrieve it in the collected information, that is, we realize a private search engine.

2. In the era of big data, we can get more data sources.

When conducting big data analysis or data mining, data sources are needed for analysis. We can obtain data from some websites that provide data statistics, as well as from some literature or internal materials, but these methods of obtaining data are sometimes difficult to meet our needs for data, and it takes too much energy to find these data manually from the Internet. At this time, we can use crawler technology to automatically obtain the data content we are interested in from the Internet, crawl these data contents back as our data source, conduct deeper data analysis, and obtain more valuable information.

3. It can better carry out search engine optimization (SEO).

For many SEO practitioners, in order to better complete their work, they must be very clear about the working principle of search engine, and also need to master the working principle of search engine crawler. Learning crawler can deeply understand the working principle of search engine crawler, so that when optimizing search engine, you can know yourself and the enemy and be invincible in a hundred battles.

4.HttpClient

Web crawlers use programs to help us access resources on the network. We have always used HTTP protocol to access Internet web pages. Web crawlers need to write programs and use the same HTTP protocol to access web pages here.

Here, we use the HTTP protocol client HttpClient technology of Java to capture web page data.

4.1.GET request

public static void main(String[] args) throws IOException {
    //Create HttpClient object
    CloseableHttpClient httpClient = HttpClients.createDefault();

    //Create HttpGet request
    HttpGet httpGet = new HttpGet("http://www.itcast.cn/");

    CloseableHttpResponse response = null;
    try {
        //Initiate a request using HttpClient
        response = httpClient.execute(httpGet);

        //Judge whether the response status code is 200
        if (response.getStatusLine().getStatusCode() == 200) {
            //If it is 200, the request is successful and the returned data is obtained
            String content = EntityUtils.toString(response.getEntity(), "UTF-8");
            //Print data length
            System.out.println(content);
        }

    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        //Release connection
        if (response == null) {
            try {
                response.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
            httpClient.close();
        }
    }
}

Posted by snowrhythm on Tue, 21 Sep 2021 12:00:31 -0700

Programmer Group