nodejs crawler project

Keywords: SQL MySQL Session Database

The news data of several websites have been crawled before. Now we need to organize and display these data. The specific requirements are as follows

The first step is to install the dependency package in npm install under the final project folder
Here, I met a problem during the installation process. The installation has been failing and the installation progress is very slow. Baidu has learned that this is the resource to obtain the package from the foreign image server, so I guess it may be related to my home network. Sure enough, after connecting to the VPN of the school, running npm install will soon complete the installation (one said, the mobile network is really not good).

Next, you need to access the mysql database that has been installed before, and create two new mysql tables to save the user's operation logs. The specific process and code are as follows

Specific code
--Create user information data table CREATE TABLEcrawl.user(idINT UNSIGNED NOT NULL AUTO_INCREMENT,usernameVARCHAR(45) NOT NULL,passwordVARCHAR(45) NOT NULL,registertimedatetime DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (id), UNIQUE KEYusername_UNIQUE(username`))
ENGINE=InnoDB DEFAULT CHARSET=utf8;

- record the user's login and query (specific query statement) operations
CREATE TABLE crawl.user_action (
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
username VARCHAR(45) NOT NULL,
request_time VARCHAR(45) NOT NULL,
request_method VARCHAR(20) NOT NULL,
request_url VARCHAR(300) NOT NULL,
status int(4),
remote_addr VARCHAR(100) NOT NULL,
PRIMARY KEY (id))
ENGINE=InnoDB DEFAULT CHARSET=utf8;`
Then you need to create mysql configuration file in the project folder
Next, full users are required to register and log on to the website, while non registered users are not allowed to log in to view data. There should also be appropriate prompts for errors in login and registration, such as when logging in
The user name or password is wrong. The user does not exist. When registering, the two passwords are not the same. The user already exists or successfully registers to jump to the login page.
First, the code of the login page
Import first, import first angular.js In this way, the successful login will jump to news.html Page, and then the code for the registration page
Users can register to log in to the website, and non registered users can not log in to view the data. This part is implemented by JavaScript. Because the code is relatively small, it is directly written in the html of the login page
In the login page routing, first call userDAO, and then save the session information. Otherwise, when recording the user operation log, you do not know which user is responsible for the operation.
The code implementation of UserDAO and the setting of session mentioned before are as follows

The implementation of registration and logout. Please clear the session when logout

Then realize the query function, first write the code of the query page, and then news.html Introduce it in

<div class="row" style="margin-bottom: 10px;">
    <label class="col-lg-2 control-label">Title Key</label>
    <div class="col-lg-3">
        <input type="text" class="form-control" placeholder="Title Key" ng-model="$parent.title1">
    </div>
    <div class="col-lg-1">
        <select class="form-control" autocomplete="off" ng-model="$parent.selectTitle">
            <option selected="selected">AND</option>
            <option>OR</option>

        </select>
    </div>
    <div class="col-lg-3">
        <input type="text" class="form-control" placeholder="Title Key" ng-model="$parent.title2">
    </div>
</div>



<div class="row" style="margin-bottom: 10px;">
    <label class="col-lg-2 control-label">Content keywords</label>
    <div class="col-lg-3">
        <input type="text" class="form-control" placeholder="Content keywords" ng-model="$parent.content1">
    </div>
    <div class="col-lg-1">
        <select class="form-control" autocomplete="off" ng-model="$parent.selectContent">
            <option selected="selected">AND</option>
            <option>OR</option>
        </select>
    </div>
    <div class="col-lg-3">
        <input type="text" class="form-control" placeholder="Content keywords" ng-model="$parent.content2">
    </div>
</div>


<div class="form-group">
    <div class="col-md-offset-9">
        <button type="submit" class="btn btn-default" ng-click="search()">query</button>
    </div>
</div>
<table class="table table-striped">
    <thead>
        <tr>
            <td>No</td>
            <td>title</td>
            <td>author</td>
            <td>key word</td>
            <td>link</td>
            <td>Release time</td>
        </tr>

    </thead>
    <tbody>
    <tr ng-repeat="(key, item) in items">
        <td>{{index+key}}</td>
        <td>{{item.title}}</td>
        <td>{{item.author}}</td>
        <td>{{item.keywords}}</td>
        <td>{{item.url}}</td>
        <td>{{item.publish_date}}</td>
    </tr>

    </tbody>
</table>

<div class="row">
    <div class="pull-left" style="margin-top: 12px;">
        <button type="submit" class="btn btn-primary" ng-click="searchsortASC()" >Publish time ascending</button>
        <button type="submit" class="btn btn-primary" ng-click="searchsortDESC()">Release time descending</button>
    </div>
    <div class="pull-right">
        <nav>
            <ul class="pagination">
                <li>
                    <a ng-click="Previous()" role="button"><span role="button">previous page</span></a>
                </li>
                <li ng-repeat="page in pageList" ng-class="{active:isActivePage(page)}" role="button">
                    <a ng-click="selectPage(page)" >{{ page }}</a>
                </li>
                <li>
                    <a ng-click="Next()" role="button"><span role="button">next page</span></a>
                </li>
            </ul>
        </nav>
    </div>
</div>



Line 52, collage the route and pass the get method to the back end for processing. The sorting is arranged according to the publishing time, which is also the parameter of transmission. In the routing, the code of query page routing is as follows

use newsDAO.search Function implementation query words support Boolean expression, mainly spell sql.

var mysql = require('mysql'); var mysqlConf = require('.../conf/mysqlConf'); var pool =
mysql.createPool(mysqlConf.mysql);

module.exports = {
query_noparam :function(sql, callback) {
pool.getConnection(function(err, conn) {
if (err) {
callback(err, null, null);
} else {
conn.query(sql, function(qerr, vals, fields) {
conn.release(); / / release connection
callback(qerr, vals, fields); / / event driven callback
});
}
});
},
search :function(searchparam, callback) {
//Combined query criteria
var sql = 'select * from fetches ';

    if(searchparam["t2"]!="undefined"){
        sql +=(`where title like '%${searchparam["t1"]}%' ${searchparam['ts']} title like '%${searchparam["t2"]}%' `);
    }else if(searchparam["t1"]!="undefined"){
        sql +=(`where title like '%${searchparam["t1"]}%' `);
    };

    if(searchparam["t1"]=="undefined"&&searchparam["t2"]=="undefined"&&searchparam["c1"]!="undefined"){
        sql+='where ';
    }else if(searchparam["t1"]!="undefined"&&searchparam["c1"]!="undefined"){
        sql+='and ';
    }

    if(searchparam["c2"]!="undefined"){
        sql +=(`content like '%${searchparam["c1"]}%' ${searchparam['cs']} content like '%${searchparam["c2"]}%' `);
    }else if(searchparam["c1"]!="undefined"){
        sql +=(`content like '%${searchparam["c1"]}%' `);
    }

    if(searchparam['stime']!="undefined"){
        if(searchparam['stime']=="1"){
            sql+='ORDER BY publish_date ASC ';
        }else {
            sql+='ORDER BY publish_date DESC ';
        }
    }

    sql+=';';
    pool.getConnection(function(err, conn) {
        if (err) {
            callback(err, null, null);
        } else {
            conn.query(sql, function(qerr, vals, fields) {
                conn.release(); //Release connection
                callback(qerr, vals, fields); //Event driven callback
            });
        }
    });
},;

Display of query results

In line 47, ng show is to hide the query results before displaying the pictures when clicking the display chart; it also controls to hide the chart display before displaying the query results when clicking the chart first and then the query.
When there is too much crawler data in the page list, the list content needs to be paginated. angularjs pagination is used here, and no background cooperation is needed. The foreground takes all the data at one time and then displays it by pagination. Disadvantages when the amount of data is too large, the page loading efficiency is relatively low, but the interface is more user-friendly. Here is the code to implement paging.

During initialization, the content of the first page should be displayed first, and the total number of pages (75 lines) should be calculated. pageList is an array with a maximum length of 5, indicating that the box in the screenshot at the lower right corner displays a maximum of 5 pages.
When other pages are selected, the number of pages in the lower right corner will change as the maximum number of pages is displayed. The specific code is as follows.


The next step is to add data analysis chart, taking the code of histogram as an example
Front end code

$scope.histogram = function () {
$scope.isShow = false;
$http.get("/news/histogram")
.then(
function (res) {

                if(res.data.message=='url'){
                    window.location.href=res.data.result;
                }else {

                    // var newdata = washdata(data);
                    let xdata = [], ydata = [], newdata;

                    var pattern = /\d{4}-(\d{2}-\d{2})/;
                    res.data.result.forEach(function (element) {
                        // "x":"2020-04-28T16:00:00.000Z", process x, take only the month and day
                        xdata.push(pattern.exec(element["x"])[1]);
                        ydata.push(element["y"]);
                    });
                    newdata = {"xdata": xdata, "ydata": ydata};

                    var myChart = echarts.init(document.getElementById('main1'));

                    // Specify configuration items and data for the chart
                    var option = {
                        title: {
                            text: 'Press releases over time'
                        },
                        tooltip: {},
                        legend: {
                            data: ['Number of press releases']
                        },
                        xAxis: {
                            data: newdata["xdata"]
                        },

                        yAxis: {},
                        series: [{
                            name: 'Number of news',
                            type: 'bar',
                            data: newdata["ydata"]
                        }]
                    };
                    // Use the configuration items and data you just specified to display the chart.
                    myChart.setOption(option);
                }
            },
            function (err) {
                $scope.msg = err.data;
            });

};

Routing code

router.get('/histogram', function(request, response) {
//sql strings and parameters
console.log(request.session['username']);

//sql strings and parameters
if (request.session['username']===undefined) {
    // response.redirect('/index.html')
    response.json({message:'url',result:'/index.html'});
}else {
    var fetchSql = "select publish_date as x,count(publish_date) as y from fetches group by publish_date order by publish_date;";
    newsDAO.query_noparam(fetchSql, function (err, result, fields) {
        response.writeHead(200, {
            "Content-Type": "application/json",
            "Cache-Control": "no-cache, no-store, must-revalidate",
            "Pragma": "no-cache",
            "Expires": 0
        });
        response.write(JSON.stringify({message:'data',result:result}));
        response.end();
    });
}

});
Finally, the user registration, login, query and other operations are recorded in the database log, directly in the app.js var logger = require('morgan ');
Information saved with middleware
The saved operation log can be queried in mysql database, and enter select * from user_ faction
Attach full demo
In final_ Under the project folder, cmd runs node bin/www
get into http://localhost:3000 /, and register
The results of the search are as follows

Data analysis chart

Posted by moonman89 on Sun, 28 Jun 2020 20:05:01 -0700