Implementation of node crawling web pages

Keywords: node.js zlib Big Data encoding

I. Preface

It has always been felt that the crawler is a very high-end thing. In the era of big data, the crawler is particularly important. After a lot of exploration, we finally realized this function with node, including the analysis of grabbed content.

Two, text

1. First of all, build an http service, where we are familiar with koa (this is not necessary, you can also use pure node to grab here mainly for the convenience of interaction, viewing the effect or for non-technical personnel to use)
Server-side index.js code

const Koa = require('koa');
const Router = require('koa-router'); // Route
const {greenFmt, green, purple} = require('color7log'); // log tools 

const app = new Koa();
const router = new Router();

// Default page
router.get('/', async (ctx, next) => {
    ctx.response.type = 'html';
    ctx.response.body = fs.createReadStream('./index.html');
});

app.use(router.routes())

app.listen(3000);
green('Service is running, port: 3000')

node index.js can access your page by launching the service. If there is an index.html content under the project, please solve it yourself.

2. Core code requests an html page using node http module

Dependency module, please install it yourself

const cheerio = require('cheerio');
const zlib = require('zlib');
const iconv = require('iconv-lite');
var http = require('http') // Import module

var url = 'http://kaijiang.500.com/shtml/ssq/03001.shtml'

// Get the page source code, call the method for parsing and output
http.get(url, function(res) {
    var html = ''
    var arr = [];
    var chunks;
    res.on('data', function(data) {
        arr.push(data);
    })

    res.on('end', function() {
        chunks = Buffer.concat(arr);
        chunks = zlib.unzipSync(chunks) // Because gzip is on the page, it's unzipped if normal text doesn't need this on the page.
        var body = iconv.decode(chunks, 'gbk');  // Converting to Visible Characters

        var cheerio = require('cheerio'), // A jq-like library node back-end handles various html templates easily
        $ = cheerio.load(body); // Initialize dom objects
        let content = $(".iSelectList a")
        let params = []
        for (let i = 0; i < content.length; i++) {
            params.push($(content[i]).html()) // Get the encoding for each period for easy later traversal
        }
        let nums = $(".ball_box01 li")
        for (let i = 0; i < nums.length; i++) {
            green($(nums[i]).html()) // Here the winning number of the lottery ticket is realized.
        }
        
        // Write the codes of each period into the file for easy use
        fs.writeFile("./data.txt", params.join(','), function(){
            console.log("complete")
        });
    })
}).on('error', function() {
    console.log('Error acquiring data!')
})

The way to see if the page is gzip is as follows

Complete Runnable Code Address node crawler

Posted by itisme on Fri, 11 Oct 2019 11:39:34 -0700