Last year, we did a community activity called "question and answer for seniors", which precipitated a lot of valuable interactive information, and then it was terminated for various reasons. Today, I started to talk with Tu Teng, and thought that the information was so quiet and wasteful. So try to use python to crawl the content of the knowledge planet.
This process has learned some new knowledge, which has been written in the form of annotations in the code. However, there is still an unsolved problem that several comments can be followed under a question. I can't give "one question + n comments" as a whole in the output results for the time being, but I can only show the information of the comments in the form of a dictionary, which is not very thorough. See if there are any new solutions in the future.
import requests import json import urllib import csv #Header information. The website only provides the way of scanning code to log in, without account password. I think it should be troublesome, but after finding the Authorization information in the header information, you can directly keep the login status. # To make a mark is to directly visit the URL of the inner page in the browser, the error reported by the browser is "{" succeeded ": false," Code: 401, "info": "", "resp_data": {}} ", which is very similar to the error reported by the original node.js data center without login, and the simulated Login of the data center is realized by adding Authorization in the header. headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36', 'Referer': 'https://wx.zsxq.com/dweb/', 'Authorization': '51EC09CA-6BCC-8847-4419-FA04A2FC9E00' } #Open and write csv file f = open('/Users/damo/Desktop/wendatuan.csv', 'w+') writer = csv.writer(f) writer.writerow(['created_time','ask_name','ask_content','comment']) #Defining the function body of crawling information def get_info(url): res = requests.get(url,headers = headers) json_data = json.loads(res.text) datas = json_data['resp_data']['topics'] for data in datas: if 'talk' in data.keys(): # Determine whether the json contains the talk key ask_name = data['talk']['owner']['name'] ask_content = data['talk']['text'] else: ask_name = '' ask_content = '' if 'show_comments' in data.keys(): comment = data['show_comments'] else: comment = '' created_time = data['create_time'] writer.writerow([created_time,ask_name,ask_content,comment]) # Up to the previous code, it has been possible to crawl a page. The main task of the following code content is to realize "how to automatically implement multi page crawling" # Multi page crawling is realized by Query String Parameters in the Network: two parameters are provided here. After observation, it is found that count is a fixed value, while end time is the same as the last time of the web address. # Only the urlencode transformation is used in the web page. This part is the newly learned knowledge. # In this crawler case, the core logic of web page construction is "the creation time of the last data in the previous group is just a parameter in the next group of data access web address", so as to construct the web address of circular grabbing end_time = datas[19]['create_time'] url_encode = urllib.parse.quote(end_time) # urlencode, convert text in web address next_url = 'https://API. Zsxq. COM / v1.10/groups/5182858584/topics? Count = 20 & end ﹐ time = '+ URL ﹐ encode ﹐ construct the next set of data through observation get_info(next_url) # It's more ingenious to call the function again directly inside the function, so as to realize continuous self circulation if __name__ == '__main__': url = 'https://api.zsxq.com/v1.10/groups/518282858584/topics?count=20' get_info(url)