Compared with C ා crawler, java crawler, python crawler is more convenient and simple. First of all, the urllib2 package of Python provides a relatively complete API for accessing web documents. Second, for the picked articles, Python's urllib2 package provides a relatively complete API for accessing web documents beautifulsoap It provides a simple document processing function, which makes it an advantage of his crawler.
As a programmer who is full of brains and wants to be a big bull, Xiaobai doesn't have to love which language, or which one he thinks is easy to use.
So today I'm going to share a series of java crawlers that I like but not easy to use
Code first and renderings
package org.lq.wzq.Test; /** * Read and analyze the data of Youth Network * xutao 2018-11-22 09: 09 */ import java.io.*; import java.net.*; public class pachong { public static void main(String args[]){ //Determine the address of the webpage to be crawled. This is the webpage of youth hot news //The address is http://news.youth.cn/sz/201811/t20181121_11792273.htm String strurl="http://news.youth.cn/sz/201811/t20181121_11792273.htm"; //establish url Crawling core objects try { URL url=new URL(strurl); //adopt url Establish a connection to a web page URLConnection conn=url.openConnection(); //Get the data returned by the web page through the link InputStream is=conn.getInputStream(); System.out.println(conn.getContentEncoding()); //Generally, read the web page data by line and analyze the content //So use BufferedReader and InputStreamReader A buffer stream that converts a byte stream to a character stream //In the process of conversion, it is necessary to deal with the problem of coding format. Note that generally GBK perhaps UTF-8(Change the code for another one) BufferedReader br=new BufferedReader(new InputStreamReader(is,"GBK")); //Read and print by line String line=null; while((line=br.readLine())!=null){ System.out.println(line); } br.close(); } catch (Exception e) { // TODO Auto-generated catch block e.printStackTrace(); } } }
Check the source code of the website, and you will find that what the program crawls is actually the whole web page
The code is output line by line. The key point of sorting out is the application of regular expression to get the data suitable for you, and finally store it in the txt or excle table.
For details, please visit
2.java reads the txt file, and exports the txt file after operating on the string