Introduction to java crawler

Keywords: Java Python network

Compared with C ා crawler, java crawler, python crawler is more convenient and simple. First of all, the urllib2 package of Python provides a relatively complete API for accessing web documents. Second, for the picked articles, Python's urllib2 package provides a relatively complete API for accessing web documents beautifulsoap It provides a simple document processing function, which makes it an advantage of his crawler.

As a programmer who is full of brains and wants to be a big bull, Xiaobai doesn't have to love which language, or which one he thinks is easy to use.

So today I'm going to share a series of java crawlers that I like but not easy to use

Code first and renderings

package org.lq.wzq.Test;
/**
 * Read and analyze the data of Youth Network
 * xutao   2018-11-22  09: 09
 */
import java.io.*;
import java.net.*;

public class pachong {
    public static void main(String args[]){
        //Determine the address of the webpage to be crawled. This is the webpage of youth hot news
        //The address is       http://news.youth.cn/sz/201811/t20181121_11792273.htm
        String strurl="http://news.youth.cn/sz/201811/t20181121_11792273.htm";
        //establish url Crawling core objects
        try {
            URL url=new URL(strurl);
            //adopt url Establish a connection to a web page
            URLConnection conn=url.openConnection();
            //Get the data returned by the web page through the link
            InputStream is=conn.getInputStream();
            System.out.println(conn.getContentEncoding());
            //Generally, read the web page data by line and analyze the content
            //So use BufferedReader and InputStreamReader A buffer stream that converts a byte stream to a character stream
            //In the process of conversion, it is necessary to deal with the problem of coding format. Note that generally GBK perhaps UTF-8(Change the code for another one)
            BufferedReader br=new BufferedReader(new InputStreamReader(is,"GBK"));
            //Read and print by line
            String line=null;
            while((line=br.readLine())!=null){
                System.out.println(line);
            }
            br.close();
        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        
    }
}

Check the source code of the website, and you will find that what the program crawls is actually the whole web page

The code is output line by line. The key point of sorting out is the application of regular expression to get the data suitable for you, and finally store it in the txt or excle table.

For details, please visit

1.java imports the excle table, modifies the table, arranges the table data, and exports the local table

2.java reads the txt file, and exports the txt file after operating on the string

Posted by snteran on Fri, 06 Dec 2019 04:17:50 -0800