I. Preface
It has always been felt that the crawler is a very high-end thing. In the era of big data, the crawler is particularly important. After a lot of exploration, we finally realized this function with node, including the analysis of grabbed content.
Two, text
1. First of all, build an http service, where we are familiar with koa (this is not necessary, you can also use pure node to grab here mainly for the convenience of interaction, viewing the effect or for non-technical personnel to use)
Server-side index.js code
const Koa = require('koa'); const Router = require('koa-router'); // Route const {greenFmt, green, purple} = require('color7log'); // log tools const app = new Koa(); const router = new Router(); // Default page router.get('/', async (ctx, next) => { ctx.response.type = 'html'; ctx.response.body = fs.createReadStream('./index.html'); }); app.use(router.routes()) app.listen(3000); green('Service is running, port: 3000')
node index.js can access your page by launching the service. If there is an index.html content under the project, please solve it yourself.
2. Core code requests an html page using node http module
Dependency module, please install it yourself
const cheerio = require('cheerio'); const zlib = require('zlib'); const iconv = require('iconv-lite');
var http = require('http') // Import module var url = 'http://kaijiang.500.com/shtml/ssq/03001.shtml' // Get the page source code, call the method for parsing and output http.get(url, function(res) { var html = '' var arr = []; var chunks; res.on('data', function(data) { arr.push(data); }) res.on('end', function() { chunks = Buffer.concat(arr); chunks = zlib.unzipSync(chunks) // Because gzip is on the page, it's unzipped if normal text doesn't need this on the page. var body = iconv.decode(chunks, 'gbk'); // Converting to Visible Characters var cheerio = require('cheerio'), // A jq-like library node back-end handles various html templates easily $ = cheerio.load(body); // Initialize dom objects let content = $(".iSelectList a") let params = [] for (let i = 0; i < content.length; i++) { params.push($(content[i]).html()) // Get the encoding for each period for easy later traversal } let nums = $(".ball_box01 li") for (let i = 0; i < nums.length; i++) { green($(nums[i]).html()) // Here the winning number of the lottery ticket is realized. } // Write the codes of each period into the file for easy use fs.writeFile("./data.txt", params.join(','), function(){ console.log("complete") }); }) }).on('error', function() { console.log('Error acquiring data!') })
The way to see if the page is gzip is as follows
Complete Runnable Code Address node crawler