Sogou Wechat Search provides two types of keyword search, one is to search the content of public documents, the other is to search the Wechat public number directly. The basic information of the public number and the 10 articles recently published can be obtained by searching the public number of Wechat. Today, we will grab the account information of the public number of Wechat.(
Reptile
First of all, you can enter through the home page, you can crawl by category, and you can find out the page linking rules by "View more":
import requests as req import re reTypes = r'id="pc_\d*" uigs="(pc_\d*)">([\s\S]*?)</a>' Entry = "http://weixin.sogou.com/" entryPage = req.get(Entry) allTypes = re.findall(reTypes, getUTF8(entryPage)) for (pcid, category) in allTypes: for page in range(1, 100): url = 'http://weixin.sogou.com/pcindex/pc/{}/{}.html'.format(pcid, page) print(url) categoryList = req.get(url) if categoryList.status_code != 200: break
The above code gets the loading list by loading more pages, and then grabs the details page of Wechat Public Number from it:
reProfile = r'<li id[\s\S]*?<a href="([\s\S]*?)"' allProfiles = re.findall(reOAProfile, getUTF8(categoryList)) for profile in allProfiles: profilePage = req.get(profile) if profilePage.status_code != 200: continue
Accessing the details page, you can get the name/ID/function introduction of the public number, the main body of the account/avatar/two-dimensional code/the latest 10 articles and so on.
If you are still confused in the world of programming, you can join our Python Learning button qun: 784758214 to see how our predecessors learned. From the basic Python script to web development, crawler, django, data mining, etc., 0-based to the actual project data are collated. To every Python buddy! Share learning methods and interesting practical courses, technical experience every day! Click to join us python learner gathering place
Matters needing attention
Details page link: http://mp.weixin.qq.com/profile? Src=3×tamp=1477208282&ver=1&signature=8rYJ4QV2w5FXSOy6vGn37s UdcSLa8uoyHv3Ft7CrhZhB4wO-bbWG94aUCNexyB7lqNSua-2MROwk835g==
1. verification code
Verification codes may be needed when accessing details pages. It is still very difficult to identify verification codes automatically, so the camouflage work of crawlers should be done well.
2. Unsaved Details Page Links
There are two important parameters in the links of detail pages: timestamp & signature, which indicates that the links are timeliness, so it should be useless to save them.
3. two-dimensional code
Two-dimensional code image links also have timeliness, so it is better to download the picture if necessary.
Display the results with Flask
Recently, an asynchronous enhanced Flask framework has emerged in the Python community: Sanic Based on uvloop and httptools It can achieve asynchronous and faster effect, but keep the concise grammar consistent with Flask. Although the project has just started, there are still many basic functions to achieve, but has received a lot of attention.( 2,222 Star ) This time, I was going to make a simple interactive application based on Snap Wechat Public Number Information, but I have no template function, asynchronous redis driver and BUG, so after a simple attempt, I still switch back to Flask + SQLite. First, I present the grabbing results, and then organic. It will be updated again.
Install Sanic
Debug Sanic
Flask + SQLite App
What I don't know in the process of learning can be added to me? python Learning Exchange Button qun,784758214 //There are good learning video tutorials, development tools and e-books in the group. //Share with you the current talent needs of python enterprises and how to learn python from zero foundation, and what to learn from flask import g, Flask, render_template import sqlite3 app = Flask(__name__) DATABASE = "./db/wx.db" def get_db(): db = getattr(g, '_database', None) if db is None: db = g._database = sqlite3.connect(DATABASE) return db @app.teardown_appcontext def close_connection(exception): db = getattr(g, '_database', None) if db is not None: db.close() @app.route("/<int:page>") @app.route("/") def hello(page=0): cur = get_db().cursor() cur.execute("SELECT * FROM wxoa LIMIT 30 OFFSET ?", (page*30, )) rows = [] for row in cur.fetchall(): rows.append(row) return render_template("app.html", wx=rows, cp=page) if __name__ == "__main__": app.run(debug=True, port=8000)