WebMagic uses proxy ip crawling data to solve HTTP407 problem

Keywords: Java JSON


One small task at hand is crawling website data. Use webmagic to implement. If the company's IP is not good, blocked will affect normal business access. Just as another project of the company purchased the proxy IP resource "Station Master", the project died, and then applied for borrowing.

It is not technically difficult to get the api interface of Proxy ip provided by the station master. However, when running the crawler, you receive a 407 error from http. It is understood that 407 is an authorization error requiring Proxy authentication. The station master technical support reminds us to check the product configuration. It is found that the current authorization mode in "Private Private Agent" is "User Name + Password". Then, look at the Proxy class of webmagic. There is a constructor that can pass user name and password in addition to the required ip and port. That's it. After correction, test ok.


webmagic uses proxy IP to implement part of the crawler code:

Request request = new Request("https://www.xxx.com/a/b");
request.setMethod("POST");
try {
    request.addHeader("Proxy-Authorization","Basic "+ Base64.getEncoder().encodeToString("20190430**********:password".getBytes("utf-8")));
    request.addHeader("Authorization","Basic "+ Base64.getEncoder().encodeToString("20190430**********:password".getBytes("utf-8")));
}catch (Exception e){
    log.error("",e);
}
request.setRequestBody(HttpRequestBody.json("{pageIdx:'"+pageIdx+"'}","utf-8"));

HttpClientDownloader httpClientDownloader = new HttpClientDownloader();


//            call api Getting agent IP list
List<ZdoIpVO> proxyIPList = spiderConfig.getIps();
if(!CollectionUtils.isEmpty(proxyIPList)) {
    ZdoIpVO zdoIpVO = proxyIPList.get(0);
    httpClientDownloader.setProxyProvider(SimpleProxyProvider.from(
            new Proxy(zdoIpVO.getIp(), zdoIpVO.getPort(),spiderConfig.getZdoId(),spiderConfig.getZdoPassword())
    ));
}

Spider.create(this)
        .addRequest(request)
        .setDownloader(httpClientDownloader)
        //Open 2 threads to grab
        .thread(2)
        //Start crawler
        .run();
}

 

Looking at the information of the station master, we docked the first-hand private agent IP with a 1-4-hour survival period, and can extract about 1000 at the same time (most in Jiangsu, Zhejiang and Guangdong), which shows that the technology of this enterprise is strong. However, one price, one purchase, the cost of a year as high as 18,000 yuan. It's a waste of the boss's money that the company's projects have been idle and unused since they stagnated.~~

Posted by Xu Wei Jie on Thu, 03 Oct 2019 02:40:38 -0700