Acquisition and Analysis of Web Pages

Keywords: Windows Java JQuery JSON

Web page acquisition & Analysis

Writing with JAVA requires packages and tools: Jsoup, Phantom JS
Objectives: To get complete asynchronous loading pages and parse them using Jsoup

1. Preparations

jsoup is a Java HTML parser that can directly parse a URL address and HTML text content. It provides a very labor-saving API for extracting and manipulating data through DOM, CSS and jQuery-like operations.

download Jsoup

Phantom JS is an interface-free, scriptable WebKit browser engine. It naturally supports a variety of web standards: DOM operations, CSS selectors, JSON, Canvas and SVGPhantom JS are a non-interface, scriptable WebKit browser engine. It naturally supports a variety of web standards: DOM operations, CSS selectors, JSON, Canvas and SVG.

download PhantomJS

2.Jsoup Gets Static Web Pages

Jsoup's connect method can get static web pages directly
For example:

Document doc=Jsoup.connect("https://www.zhihu.com/question/19551007")
                .userAgent("Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)")//Simulated Firefox Browser
                .get();

Parsing the acquired Document

String content = doc.html();  //Get the static web page source code
String tilte = doc.title();   //Get title

//The advantage of Jsoup is that it can use operations like jQuery
//For me, it's much more convenient than using the regular form.==
Elements writer = doc.select(".UserLink-link");
Elements contents = doc.getElementsByClass("RichText CopyrightRichText-richText");  

//output
for (Element span : contents) { 
          String cText = span.text(); 
          System.out.println("\n"+cText);
        }

The advantage of Jsoup is that it is convenient to parse web pages, but it also has disadvantages: the connect method can get static web pages, unless it loads documents from local files with parse, it can not get complete dynamic web pages.

Detailed reference: Jsoup Cookbook

To get a dynamic web page, it's time for Phantom JS to play.

3. Simple Use of Phantom JS

Phantom JS is a JavaScript API based on webkit. It uses QtWebKit as its core browser function and WebKit to compile, interpret and execute JavaScript code.

That is to say, we can use Phantom JS to achieve what any browser can do, including browsing an asynchronous request to load the web page, it is said that simulation click button can also be achieved, more fun use later, only use it to get a complete web page here.

Using the Open method of Webpage module, GET and POST data can be implemented.

The JScode.js code is as follows:

system = require('system')  //First refer to the system module
address = system.args[1];   //Enter the Web Site by JS File Name + Web Site
var page = require('webpage').create();  //Refer to the Webpage module and create an instance
var url = address;
phantom.outputEncoding = "gbk";          //The output is encoded in gbk mode. If the encoding is not correct, there will be Chinese scrambling.
page.open(url, function (status) {
    //Page is loaded!  
    if (status !== 'success') {
        console.log('Unable to post!');
    } else {
        window.setTimeout(function () {
            //page.render("test.png"); //screenshot
            console.log(page.content);  //Output GET to content
            phantom.exit();             //Sign out
        }, 5000);                       //Wait 5 seconds
    }
});

Call Phantom JS to execute the above JS file and get the content of the web page.
Execute the call command through the Runtime.exec() method

The following is the declaration of the java.lang.Runtime.exec() method:

public Process exec(String command)

Parameters:
command - Specified system commands

Then, the system command here should be:

String execc = "F:\\PhantomJS\\phantomjs-2.1.1-windows\\bin\\phantomjs  F:\\PhantomJS\\phantomjs-2.1.1-windows\\bin\\JScode.js " + url;
//The former address is where phantomjs is located, and the latter is where JScode.js is located.

Encapsulated as a method:

   /**
     * Enter the address to get the Document of the page
     * @param url
     * @return
     * @throws IOException
     */
    public static Document getHtml(String url) throws IOException{
        Runtime rt = Runtime.getRuntime();
        String execc = "F:\\PhantomJS\\phantomjs-2.1.1-windows\\bin\\phantomjs  F:\\PhantomJS\\phantomjs-2.1.1-windows\\bin\\JScode.js " + url;
        Process p = rt.exec(execc);
        InputStream getW = p.getInputStream();
        Document doc = Jsoup.parse(getW, "gbk", url);
        return doc;
    }

Completed, you can use this getHtml() method to get the specified page directly
try/catch is required when calling this method

.................................................................................................................

In fact, only Phantom JS can also be used to implement the parsing operation of web pages.
Detailed use of Phantom JS reference: PhantomJS API

Posted by horstuff on Thu, 27 Jun 2019 17:20:25 -0700