This is the fourth in a series of Java crawler blogs, in the last one Java crawler encounters data asynchronous loading, try these two methods! ) In this article, we have a brief talk about how to deal with the problem of asynchronous loading of data from the perspective of built-in browser kernel and reverse parsing. In this article, we will briefly talk about the crawler, the resource website according to user access behavior to shield the crawler program and its corresponding solutions.
Screening crawler program is a protection measure for resource websites. The most commonly used anti-crawler strategy should be based on user's access behavior. For example, every server can only access X times in a certain period of time. If it exceeds this number, it will be considered as the access of the crawler program. Based on the user's access behavior, it will judge whether the crawler program is based not only on the number of visits, but also on the User Agent request head of each request and the interval time between visits. Generally speaking, it is determined by many factors, among which the number of visits is the main one.
Anti-crawler is a self-protection measure for every resource website, aiming at protecting resources from being occupied by crawler programs. For example, Douban, which we used before, will shield the crawler program according to the user's access behavior. After each IP accesses a certain number of times per minute, the request returns directly to 403 error in a later period of time, thinking that you have no right to access the page. So let's take Douban Net as an example again today. We simulate this phenomenon with program. Here's a program I wrote to collect Douban Film.
/** * Collecting Douban Films */ public class CrawlerMovie { public static void main(String[] args) { try { CrawlerMovie crawlerMovie = new CrawlerMovie(); // Douban movie link List<String> movies = crawlerMovie.movieList(); //Create a thread pool of 10 threads ExecutorService exec = Executors.newFixedThreadPool(10); for (String url : movies) { //Execution thread exec.execute(new CrawlMovieThread(url)); } //Thread closure exec.shutdown(); } catch (Exception e) { e.printStackTrace(); } } /** * Douban Film List Link * Using reverse analytic method * * @return */ public List<String> movieList() throws Exception { // Get 100 Movie links String url = "https://Movie. douban. com/j/search_subjects? Type = movie & tag = hot & sort = recommend & page_limit = 200 & page_start = 0 "; CloseableHttpClient client = HttpClients.createDefault(); List<String> movies = new ArrayList<>(100); try { HttpGet httpGet = new HttpGet(url); CloseableHttpResponse response = client.execute(httpGet); System.out.println("Get the list of Douban Movies and return the validation code:" + response.getStatusLine().getStatusCode()); if (response.getStatusLine().getStatusCode() == 200) { HttpEntity entity = response.getEntity(); String body = EntityUtils.toString(entity, "utf-8"); // Format request results as json JSONObject jsonObject = JSON.parseObject(body); JSONArray data = jsonObject.getJSONArray("subjects"); for (int i = 0; i < data.size(); i++) { JSONObject movie = data.getJSONObject(i); movies.add(movie.getString("url")); } } response.close(); } catch (Exception e) { e.printStackTrace(); } finally { client.close(); } return movies; } } /** * Collecting Douban Film Threads */ class CrawlMovieThread extends Thread { // Links to be collected String url; public CrawlMovieThread(String url) { this.url = url; } public void run() { try { Connection connection = Jsoup.connect(url) .method(Connection.Method.GET) .timeout(50000); Connection.Response Response = connection.execute(); System.out.println("Collecting Douban Films,Return status code:" + Response.statusCode()); } catch (Exception e) { System.out.println("Collecting Douban Films,Anomalies were collected:" + e.getMessage()); } } }
The logic of this program is relatively simple. First, we collect the popular movies of Douban. Here, we use direct access to Ajax to get the link of the popular movies of Douban. Then we can parse the link of the details page of the movie, and multi-threaded access to the link of details page, because only in the case of multi-threaded access to the requirements of Douban. Douban Popular Film Page is as follows:
Run the above program many times and you'll get the results shown below.
From the figure above, we can see that the status code returned by httpclient access is 403, which means that we have no right to access the page. That is to say, Douban has considered us a crawler and refused to accept our access request. Let's analyze our current access architecture, because we access Douban directly, so the access architecture at this time is as follows:
If we want to break through this restriction, we can't directly access the server of Douban. We need to pull in a third party and let others visit instead of us. Every time we visit, we find different people, so that we won't be restricted. This is called IP proxy. At this point, the access architecture becomes the following diagram:
When we use IP proxy, we need IP proxy pool. Next, let's talk about IP proxy pool.
IP proxy pool
Proxy server has many manufacturers doing this, I will not say specifically, Baidu IP proxy can find a large number of their own, these IP proxy providers have to provide charges and free proxy IP, charging agent IP availability, high speed, online environment if need to use proxy, it is recommended to use charged proxy IP. If we only study it ourselves, we can collect the free public proxy IP of these vendors. The performance and availability of these IP are poor, but it does not affect our use.
Because we are Demo projects, we build our own IP proxy pool. How do we design an IP proxy pool? The following is a simple IP proxy pool architecture diagram I drew.
From the above schematic diagram, we can see that an IP proxy system will involve four modules: IP acquisition module, IP storage module, IP detection module and API interface module.
- IP Acquisition Module
Responsible for collecting proxy IP from major IP proxy vendors. The more websites collected, the higher the availability of proxy IP.
- IP Storage Module
To store the collected proxy IP, Redis is commonly used as a high-performance database. In terms of storage, we need to store two kinds of data, one is to detect the available proxy IP, the other is to collect the undetected proxy IP.
- IP Detection Module
To detect whether the collected IP is available or not, so as to improve the availability of the IP we provide, we first filter out the unavailable IP.
- API interface module
Provide available proxy IP in the form of interface
The above is about the design of IP proxy pool, for which we just need a brief understanding, because we basically do not need to write IP proxy pool services now, there are a lot of excellent open source projects on GitHub, there is no need to repeat the wheel. I chose proxy_pool, an open source IP proxy pool project with 8K star t on GitHub, which we will use as our IP proxy pool. For proxy_pool, visit https://github.com/jhao104/proxy_pool
Deployment of proxy_pool
Proxy_pool is written in Python language, but it doesn't matter, because it can be deployed in containerized form now. Containerized form can shield the installation of some environments. It only needs to run a mirror to run the service. It doesn't need to know the specific implementation in it. So Java programmers who don't know Python can also use this project. Proxy_pool uses Redis to store the collected IP, so you need to start the Redis service before starting proxy_pool. The following is the proxy_pool docker startup step.
- Pull mirror image
docker pull jhao104/proxy_pool
- Running mirroring
docker run --env db_type=REDIS --env db_host=127.0.0.1 --env db_port=6379 --env db_password=pwd_str -p 5010:5010 jhao104/proxy_pool
After running the image, we wait for a while, because it takes a while to start collecting and processing data for the first time. After waiting, visit http://{your_host}:5010/get_all/. If you get the results shown in the figure below, you have successfully deployed the proxy_pool project.
Using IP proxy
After setting up the IP proxy pool, we can use proxy IP to collect Douban movie. We already know that besides IP, the User Agent request head is also a factor for Douban network to judge whether access is a crawler program. So we also forge the User Agent request head. We use different User Agent request head for each visit.
We introduce IP Agent and Random User Agent Request Header for Douban Film Acquisition Program. The code is as follows:
public class CrawlerMovieProxy { /** * List of Common User Agents */ static List<String> USER_AGENT = new ArrayList<String>(10) { { add("Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Safari/535.19"); add("Mozilla/5.0 (Linux; U; Android 4.0.4; en-gb; GT-I9300 Build/IMM76D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30"); add("Mozilla/5.0 (Linux; U; Android 2.2; en-gb; GT-P1000 Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1"); add("Mozilla/5.0 (Windows NT 6.2; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0"); add("Mozilla/5.0 (Android; Mobile; rv:14.0) Gecko/14.0 Firefox/14.0"); add("Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36"); add("Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19"); add("Mozilla/5.0 (iPad; CPU OS 5_0 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A334 Safari/7534.48.3"); add("Mozilla/5.0 (iPod; U; CPU like Mac OS X; en) AppleWebKit/420.1 (KHTML, like Gecko) Version/3.0 Mobile/3A101a Safari/419.3"); } }; /** * Random access to user agent * * @return */ public String randomUserAgent() { Random random = new Random(); int num = random.nextInt(USER_AGENT.size()); return USER_AGENT.get(num); } /** * Setting up proxy ip pool * * @param queue queue * @throws IOException */ public void proxyIpPool(LinkedBlockingQueue<String> queue) throws IOException { // Random access to one proxy ip at a time String proxyUrl = "http://192.168.99.100:5010/get_all/"; CloseableHttpClient httpclient = HttpClients.createDefault(); HttpGet httpGet = new HttpGet(proxyUrl); CloseableHttpResponse response = httpclient.execute(httpGet); if (response.getStatusLine().getStatusCode() == 200) { HttpEntity entity = response.getEntity(); String body = EntityUtils.toString(entity, "utf-8"); JSONArray jsonArray = JSON.parseArray(body); int size = Math.min(100, jsonArray.size()); for (int i = 0; i < size; i++) { // Format the request result into json JSONObject data = jsonArray.getJSONObject(i); String proxy = data.getString("proxy"); queue.add(proxy); } } response.close(); httpclient.close(); return; } /** * Random acquisition of a proxy ip * * @return * @throws IOException */ public String randomProxyIp() throws IOException { // Random access to one proxy ip at a time String proxyUrl = "http://192.168.99.100:5010/get/"; String proxy = ""; CloseableHttpClient httpclient = HttpClients.createDefault(); HttpGet httpGet = new HttpGet(proxyUrl); CloseableHttpResponse response = httpclient.execute(httpGet); if (response.getStatusLine().getStatusCode() == 200) { HttpEntity entity = response.getEntity(); String body = EntityUtils.toString(entity, "utf-8"); // Format the request result into json JSONObject data = JSON.parseObject(body); proxy = data.getString("proxy"); } return proxy; } /** * Douban Film Link List * * @return */ public List<String> movieList(LinkedBlockingQueue<String> queue) { // Get 60 movie links String url = "https://Movie. douban. com/j/search_subjects? Type = movie & tag = hot & sort = recommend & page_limit = 40 & page_start = 0 "; List<String> movies = new ArrayList<>(40); try { CloseableHttpClient client = HttpClients.createDefault(); HttpGet httpGet = new HttpGet(url); // Setting up ip proxy HttpHost proxy = null; // Random access to a proxy IP String proxy_ip = randomProxyIp(); if (StringUtils.isNotBlank(proxy_ip)) { String[] proxyList = proxy_ip.split(":"); System.out.println(proxyList[0]); proxy = new HttpHost(proxyList[0], Integer.parseInt(proxyList[1])); } // Random access to a request header httpGet.setHeader("User-Agent", randomUserAgent()); RequestConfig requestConfig = RequestConfig.custom() .setProxy(proxy) .setConnectTimeout(10000) .setSocketTimeout(10000) .setConnectionRequestTimeout(3000) .build(); httpGet.setConfig(requestConfig); CloseableHttpResponse response = client.execute(httpGet); System.out.println("Get the list of Douban Movies and return the validation code:" + response.getStatusLine().getStatusCode()); if (response.getStatusLine().getStatusCode() == 200) { HttpEntity entity = response.getEntity(); String body = EntityUtils.toString(entity, "utf-8"); // Format the request result into json JSONObject jsonObject = JSON.parseObject(body); JSONArray data = jsonObject.getJSONArray("subjects"); for (int i = 0; i < data.size(); i++) { JSONObject movie = data.getJSONObject(i); movies.add(movie.getString("url")); } } response.close(); } catch (Exception e) { e.printStackTrace(); } finally { } return movies; } public static void main(String[] args) { // Queue to store proxy ip LinkedBlockingQueue<String> queue = new LinkedBlockingQueue(100); try { CrawlerMovieProxy crawlerProxy = new CrawlerMovieProxy(); // Initialize the ip proxy queue crawlerProxy.proxyIpPool(queue); // Get the Douban Film List List<String> movies = crawlerProxy.movieList(queue); //Create a fixed-size thread pool ExecutorService exec = Executors.newFixedThreadPool(5); for (String url : movies) { //Execution thread exec.execute(new CrawlMovieProxyThread(url, queue, crawlerProxy)); } //Thread closure exec.shutdown(); } catch (Exception e) { e.printStackTrace(); } } } /** * Collecting Douban Film Threads */ class CrawlMovieProxyThread extends Thread { // Links to be collected String url; // Proxy ip queue LinkedBlockingQueue<String> queue; // proxy class CrawlerMovieProxy crawlerProxy; public CrawlMovieProxyThread(String url, LinkedBlockingQueue<String> queue, CrawlerMovieProxy crawlerProxy) { this.url = url; this.queue = queue; this.crawlerProxy = crawlerProxy; } public void run() { String proxy; String[] proxys = new String[2]; try { Connection connection = Jsoup.connect(url) .method(Connection.Method.GET) .timeout(50000); // If the proxy ip queue is empty, retrieve the ip proxy if (queue.size() == 0) crawlerProxy.proxyIpPool(queue); // Getting proxy ip from queue proxy = queue.poll(); // Parsing agent ip proxys = proxy.split(":"); // Setting up proxy ip connection.proxy(proxys[0], Integer.parseInt(proxys[1])); // Setting up user agent connection.header("User-Agent", crawlerProxy.randomUserAgent()); Connection.Response Response = connection.execute(); System.out.println("Collecting Douban Films,Return status code:" + Response.statusCode() + " ,request ip: " + proxys[0]); } catch (Exception e) { System.out.println("Collecting Douban Films,Anomalies were collected:" + e.getMessage() + " ,request ip: " + proxys[0]); } } }
Running the modified acquisition program may require multiple runs, because your proxy IP is not always valid. If proxy IP is valid, you will get the following results
As a result, we can see that a large number of proxy IP is invalid and only a small part of proxy IP is valid for 40 visits to movie details pages. The results directly prove that the availability of free proxy IP is not high, so if you need to use proxy IP online, it is better to use charged proxy IP. Although the availability of our own IP proxy pool is not too high, the IP proxy we set up to access Douban Movie has been successful, and the use of IP proxy has successfully circumvented the limitations of Douban Network.
There are many reasons why the crawler server is blocked. This article mainly introduces how to circumvent the access restriction of Douban Network by setting up IP proxy and forging User Agent request header. How can we make our programs not be regarded as crawlers by resource websites? The following three points need to be done well:
- Forged User Agent Request Header
- Using IP proxy
- Unfixed acquisition interval
I hope this article will help you. The next one is about exploring multithreaded crawlers. If you are interested in reptiles, you might as well pay attention to a wave, learn from each other and make progress with each other.
The article's shortcomings, I hope you can give more advice, learn together, and make progress together.
Last
Play a small advertisement. Welcome to pay close attention to the Wechat Public Number: "Hirago's Technological Blog" and make progress together.