Multi-topic crawler keyword switching based on webmagic framework

Keywords: Database encoding network JSON

 

1. Background introduction

In multi-topic crawlers, we usually first analyze the URL characteristics of the website (focusing on the list page), then pre-set keywords according to the project requirements, and precisely control the crawling url, or seed url.

1.1. Analysis I

There are many url scenarios with keywords, such as specific sections of the website, AJAX requests sent by a module and so on.

eg: We need to crawl the information of Hangzhou's tourist attractions on the same trip. The url is: https://so.ly.com/hot?q=%E6%9D%AD%E5%B7%9E

Among them %E6%9D%AD%E5%B7%9E It's the result of Unicode encoding the keyword "Hangzhou"

eg: The domestic tour from Hangzhou to Beijing by the same trip travel network, url is:

       https://gny.ly.com/list?src=%E6%9D%AD%E5%B7%9E&dest=%E5%8C%97%E4%BA%AC&prop=0 Equal to https://gny.ly.com/list?src = Hangzhou & dest = Beijing & prop = 0 In fact, it's the same as https://gny.ly.com/list?src=Hangzhou & dest=Beijing . When the browser enters the url above, the first page of the topic list will be displayed. Clicking on the next page, we will find that the second page of the topic list url is:

       https://gny.ly.com/list?src=%E6%9D%AD%E5%B7%9E&dest=%E5%8C%97%E4%BA%AC&start=2

The third page is:

       https://gny.ly.com/list?src=%E6%9D%AD%E5%B7%9E&dest=%E5%8C%97%E4%BA%AC&start=3

       ····

       https://gny.ly.com/list?src=%E6%9D%AD%E5%B7%9E&dest=%E5%8C%97%E4%BA%AC&start=n

From this, we can conclude that the url splicing rules of the module are as follows: https://gny.ly.com/list?src= Keyword 1 (Unicode Coding)+“ &dest= ”+ Keyword 2 (Unicode Coding)+“ &start= ”+ Index (page index)

Another example: Baidu News, keyword search url:

       https://www.baidu.com/s?rtt=1&bsst=1&cl=2&tn=news&rsv_dl=ns_pc&word=Zhejiang + + Fire Protection & PN = 10

       Https://www.baidu.com/s?Rtt=1&bsst=1&cl=2&tn=news&rsv_dl=ns_pc&word=Zhejiang++ Fire Fighting&pn=20

1.2. Analysis II

Further abstraction, we usually set keywords in the configuration file or write them into the database, from which the crawler reads and stores them in kw1List and kw2List.

Examples of the configuration of the two modes are as follows:

  • Configuration 1 (yaml file)
filters:
  searchfilter:
    kwfixvalue: [ Zhejiang, Jiangsu, Shanghai, Beijing, Tianjin ]
    kwvalue: [ fire, collapse, blast, Accident, security, casualties ]
  • Configuration 2 (database)
Initial City ID Starting City Name Destination ID Destination Name
0510 Wuxi 0571 Hangzhou
001 Beijing 021 Nanjing
0519 Changzhou 0996 Urumqi

 

In splicing the logic of the next list page (i.e. page change, page switch), we need to use variables: current keyword 1, current keyword 2, index index 1 in list 1 where current keyword 1 is located, index 2 in list 2 where current keyword 2 is located, and index 2 of the page that has been crawled (i.e. the website is displayed). How many pages are shown?

2, solve

After the above analysis, the selection switch of keywords in the list page url splicing logic is extracted and defined by a pojo class, which can be named Keyword Options. The code is as follows:

  • CurryPage: Current page index (which page in the search results of keyword 1 + keyword 2)
  • CurrtFixIndex: index in list2 where keyword 2 resides
  • kwFixValue: Keyword 2
  • CurrtIndex: index in list1 where keyword 1 resides
  • kwValue: Keyword 1
public class KeywordOptions {
    Long currentPage;
    Integer currentFixIndex;
    String kwFixValue = null;
    Integer currentIndex;
    String kwValue = null;

    public KeywordOptions() {
    }

    public Long getCurrentPage() {
        return this.currentPage;
    }

    public void setCurrentPage(Long currentPage) {
        this.currentPage = currentPage;
    }

    public Integer getCurrentFixIndex() {
        return this.currentFixIndex;
    }

    public void setCurrentFixIndex(Integer currentFixIndex) {
        this.currentFixIndex = currentFixIndex;
    }

    public Integer getCurrentIndex() {
        return this.currentIndex;
    }

    public void setCurrentIndex(Integer currentIndex) {
        this.currentIndex = currentIndex;
    }

    public String getKwFixValue() {
        return this.kwFixValue;
    }

    public void setKwFixValue(String kwFixValue) {
        this.kwFixValue = kwFixValue;
    }

    public String getKwValue() {
        return this.kwValue;
    }

    public void setKwValue(String kwValue) {
        this.kwValue = kwValue;
    }
}

The abstract class BasePageProcessor is written based on the PageProcessor interface in the framework of webmagic. In this abstract class, the related methods are written according to the general business requirements. First, the keyword switching logic is introduced.

private boolean nextKeyword(KeywordOptions ko) {
    if (this.searchFilterConfig == null) {
        return false;
    } else {
        int kwSize = this.kwValues.size();
        int kwFixSize;
        
        if (this.kwFixValues == null) {
            kwFixSize = 0;
        } else {
            kwFixSize = this.kwFixValues.size();
        }

        if (ko.getCurrentIndex() >= kwSize - 1) {
            ko.setCurrentIndex(0);
            if (ko.getCurrentFixIndex() >= kwFixSize - 1) {
                return false;
            } else {
                ko.setCurrentFixIndex(ko.getCurrentFixIndex() + 1);
                ko.setKwValue((String)this.kwValues.get(ko.getCurrentIndex()));
                if (this.kwFixValues != null) {
                    ko.setKwFixValue((String)this.kwFixValues.get(ko.getCurrentFixIndex()));
                }
                return true;
            }
        } else {
            ko.setCurrentIndex(ko.getCurrentIndex() + 1);
            ko.setKwValue((String)this.kwValues.get(ko.getCurrentIndex()));
            if (this.kwFixValues != null) {
                ko.setKwFixValue((String)this.kwFixValues.get(ko.getCurrentFixIndex()));
            }
            return true;
        }
    }
}

The method of splicing url according to Keyword Options object is as follows, which is set to public so that it can be inherited and rewritten according to different splicing rules.

public String koToUrl(KeywordOptions ko) {
    StringBuilder builder = new StringBuilder(this.baseUrl);
    builder.append(ko.getCurrentPage());
    if (this.searchFilterConfig == null) {
        return builder.toString();
    }else if (ko.getKwValue() == null && ko.getKwFixValue() == null) {
        return builder.toString();
    } else {
        builder.append("&");
        if (ko.getKwValue() != null) {
            if (this.kwCharset != null) {
                try {
                    builder.append(URLEncoder.encode(ko.getKwValue(), this.kwCharset));
                } catch (UnsupportedEncodingException var5) {
                    var5.printStackTrace();
                }
            } else {
                builder.append(ko.getKwValue());
            }
        }

        if (ko.getKwFixValue() != null) {
            builder.append("+");
            if (this.kwCharset != null) {
                try {
                    builder.append(URLEncoder.encode(ko.getKwFixValue(), this.kwCharset));
                } catch (UnsupportedEncodingException var4) {
                    var4.printStackTrace();
                }
            } else {
                builder.append(ko.getKwFixValue());
            }
        }

        return builder.toString();
    }
}

Finally get the next list of page requests (encapsulated url)

public synchronized Request nextListPage(KeywordOptions ko) {
    //Determine whether the task ends and the list switch is locked
    if (!this.listAddLock && !this.isComplete) {
        //Get a profile parser instance
        ConfigParser parser = ConfigParser.getInstance();
        Boolean fixed = (Boolean)parser.getValue(this.commonConfig, "fixed", false, this.spiderConfig.getConfigPath() + ".common");
        //Determine whether the page url is fixed
        if (fixed) {
            return null;
        } else {
            String url;
            //Determine whether the current page is the end page of the list page
            if (ko.getCurrentPage() >= this.totalPages) {
                //Switch keywords if true
                ko.setCurrentPage(Long.valueOf(String.valueOf(this.commonConfig.get("firstpage"))));
                if (this.nextKeyword(ko)) {
                    url = this.koToUrl(ko);
                    return this.nextListPageHook(this.pushRequest(url, ko));
                } else {
                    this.isComplete = true;
                    return this.nextListPageHook((Request)null);
                }
            } else {
                //For non-tail pages, add one to the current page index
                ko.setCurrentPage(ko.getCurrentPage() + 1L);
                url = this.koToUrl(ko);
                return this.nextListPageHook(this.pushRequest(url, ko));
            }
        }
    } else {
        return null;
    }
}

Write the page processing logic in BasePage Processor. The relevant code is as follows:

public void process(Page page) {
    Iterator var4;
    if (page.getUrl().toString().contains(this.baseUrl)) {
        //Determine whether to download an exception, custom error code 600
        if (page.getStatusCode() == 600) {
            this.listAddLock = false;
            return;
        }
        //Parsing list pages, subsequent businesses will rewrite the processList(page) method
        if (this.processList(page)) {
            this.processSuccessListPageCount.incrementAndGet();
            logger.info("list page crawl success url={}", page.getUrl());
            this.listAddLock = false;
        } else {
            this.processErrorListPageCount.incrementAndGet();
            logger.warn("list page crawl failed url={}", page.getUrl());
        }
        //Store Keyword Options instances in each List request
        KeywordOptions ko = (KeywordOptions)JSON.parseObject((String)page.getRequest().getExtra("ko"), KeywordOptions.class);
        if (ko != null) {
            List<Request> requests = page.getTargetRequests();
            var4 = requests.iterator();

            while(var4.hasNext()) {
                Request request = (Request)var4.next();
                request.putExtra("kw", ko.getKwValue());
            }
        }
        //Get the next list page
        Request listpage = this.nextListPage(ko);
        if (listpage != null) {
            listpage.putExtra("nocheckdup", true);
            page.putField("listPage", listpage);
        }
    } else {
        //Detailed page parsing, as well as anomaly checking
        if (page.getStatusCode() == 600) {
            return;
        }
        
        try {
            //The processPage method will also be rewritten by subsequent specific business
            this.processPage(page);
            this.processSuccessPageCount.incrementAndGet();
        } catch (Exception var7) {
            this.processErrorPageCount.incrementAndGet();
            logger.warn("page process failed url={} , error:{}", new Object[]{page.getUrl(), var7});
        }

        ResultItems items = page.getResultItems();
        String keyword = (String)page.getRequest().getExtra("kw");
        if (keyword == null) {
            keyword = this.kwValues != null ? (String)this.kwValues.get(0) : null;
        }

        if (keyword != null) {
            var4 = items.getAll().entrySet().iterator();

            while(var4.hasNext()) {
                Map.Entry<String, Object> entry1 = (Map.Entry)var4.next();
                Map<String, Object> map = (Map)entry1.getValue();
                map.put("keyword", keyword);
            }
        }
    }
}

 

Posted by Volte6 on Wed, 25 Sep 2019 05:05:13 -0700