java Reptiles Series I - Introduction to Reptiles

Keywords: Java Lombok Maven Spring

1. overview

What does the java crawler series contain?

  1. Introduction to java crawler framework webmgic
  2. Using webmgic to crawl the movie resources in http://ady01.com (action movie list page, movie download address, etc.)
  3. Using webmgic to crawl course resources for geek time (article series and video series)

The main contents of this article are as follows:

  1. Introduce a useful crawler framework in java
  2. Introduction of java crawler framework webmagic
  3. Using webgic to crawl action movie list information

2. Useful crawler framework in Java

How to judge whether the framework is good or not?

  1. Easy to learn and use. There are many corresponding learning materials on the Internet, and they are relatively perfect.
  2. There are many people who use it. The pits that exist are almost filled by others. It will be more comfortable to use them.
  3. The framework is updated quickly and the community is active. It can quickly experience some better functions and communicate with the author.
  4. The framework is stable and easy to expand

According to the above points, we recommend a very useful java crawler framework webmgic

3. Introduction to webmgic

  • WebMagic is a simple and flexible Java crawler framework. Based on WebMagic, you can quickly develop an efficient and easy to maintain crawler.
  • webmagic website: http://webmagic.io/
  • Web mgic Chinese Learning Document: http://webmagic.io/docs/zh/

4. Use webgic to crawl action movie lists

Using webgic crawling Love movies Film List Resource Information

Sample source address

1. New springboot project java-pachong

2. Import maven configuration

<dependencies>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter</artifactId>
    </dependency>

    <dependency>
        <groupId>org.projectlombok</groupId>
        <artifactId>lombok</artifactId>
        <optional>true</optional>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-test</artifactId>
        <scope>test</scope>
    </dependency>

    <!-- webmagic start -->
    <dependency>
        <groupId>us.codecraft</groupId>
        <artifactId>webmagic-core</artifactId>
        <version>0.7.3</version>
        <exclusions>
            <exclusion>
                <artifactId>fastjson</artifactId>
                <groupId>com.alibaba</groupId>
            </exclusion>
            <exclusion>
                <artifactId>commons-io</artifactId>
                <groupId>commons-io</groupId>
            </exclusion>
            <exclusion>
                <artifactId>commons-io</artifactId>
                <groupId>commons-io</groupId>
            </exclusion>
            <exclusion>
                <artifactId>fastjson</artifactId>
                <groupId>com.alibaba</groupId>
            </exclusion>
            <exclusion>
                <artifactId>fastjson</artifactId>
                <groupId>com.alibaba</groupId>
            </exclusion>
            <exclusion>
                <artifactId>log4j</artifactId>
                <groupId>log4j</groupId>
            </exclusion>
            <exclusion>
                <artifactId>slf4j-log4j12</artifactId>
                <groupId>org.slf4j</groupId>
            </exclusion>
        </exclusions>
    </dependency>
    <dependency>
        <groupId>us.codecraft</groupId>
        <artifactId>webmagic-extension</artifactId>
        <version>0.7.3</version>
    </dependency>
    <dependency>
        <groupId>us.codecraft</groupId>
        <artifactId>webmagic-selenium</artifactId>
        <version>0.7.3</version>
    </dependency>
    <dependency>
        <groupId>net.minidev</groupId>
        <artifactId>json-smart</artifactId>
        <version>2.2.1</version>
    </dependency>
    <!-- webmagic end -->
    <dependency>
        <groupId>com.alibaba</groupId>
        <artifactId>fastjson</artifactId>
        <version>1.2.49</version>
    </dependency>
    <dependency>
        <groupId>commons-lang</groupId>
        <artifactId>commons-lang</artifactId>
        <version>2.6</version>
    </dependency>
    <dependency>
        <groupId>commons-io</groupId>
        <artifactId>commons-io</artifactId>
        <version>2.6</version>
    </dependency>
    <dependency>
        <groupId>commons-codec</groupId>
        <artifactId>commons-codec</artifactId>
        <version>1.11</version>
    </dependency>
    <dependency>
        <groupId>commons-collections</groupId>
        <artifactId>commons-collections</artifactId>
        <version>3.2.2</version>
    </dependency>
</dependencies>

3. Write code to capture movie data

  • Access in Google Browser Love Movie Action Film List

  • F12 finds that the data in the list page is retrieved through an ajax request, and we get the request address

    http://m.ady01.com/rs/film/listJson/1/2?_=1555726508180

  • Writing Grab Replacement Code

package com.ady01.demo1;

import lombok.extern.slf4j.Slf4j;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;

/**
 * <b>description</b>: The first crawler example, crawl action movie list information < br >
 * <b>time</b>: 2019/4/20 10:58 <br>
 * <b>author</b>: ready likun_557@163.com
 */
@Slf4j
public class Ady01comPageProcessor implements PageProcessor {
    @Override
    public void process(Page page) {
        log.info("Climb to success!");
        log.info("Crawl content:" + page.getRawText());
    }

    @Override
    public Site getSite() {
        return Site.me().setSleepTime(1000).setRetryTimes(3);
    }

    public static void main(String[] args) {
        String url = "http://m.ady01.com/rs/film/listJson/1/2?_=1555726508180";
        Spider.create(new Ady01comPageProcessor()).addUrl(url).thread(1).run();
    }
}

4. Running crawler code

Run the main method in Ady01comPageProcessor, and the results are as follows:

5. summary

  1. In this paper, we mainly use an example to illustrate that webgic is so simple to complete data capture. From the code, we can see that the complex code webmagic has shielded us. We just need to pay attention to business code writing.
  2. There is no detailed description of how webmagic is used. As for why I did not explain it in the document, the main reason is that webigc has provided a very complete learning document, which can be moved to webgic Chinese Documents You need to know more about the source code of webgic. It's very useful for you to write crawlers.
  3. Tomorrow we will crawl the details page of each action movie and collect the download address of the movie in the details page.
  4. Sample code Import to idea, where you need to install maven and lombok support
  5. For more technical articles, please pay attention to the public number: Javacode 2018

Posted by ldsmike88 on Wed, 08 May 2019 21:42:39 -0700