Last week in the web crawler course, a practice was left: download multi-page topic content from Baidu Post Bar. What I accomplished was to crawl multi-page content from a post in the post bar, which was different from the topic asked by the teacher. Moreover, after the teacher commented, I found the gap between myself and the teacher's code in an instant. There were still many irregular and imprecise places in my code writing, and it did not reflect object-oriented. Thought, so, do this topic again, learn how the big man wrote.
Example: Download multi-page topic content from Baidu Post Bar
Let's know Baidu Tie first.( http://tieba.baidu.com/f? ) We define several functions:
- loadPage(url) for accessing web pages
- WrittePage (html, filename) is used to store acquired web pages as local files
- Tieba Crawler (url, beginpPage, endPage, keyword) is used for scheduling, providing page URLs to be crawled
- Main: the main control module of the program, completing the basic command line interface
import urllib.request import urllib.parse def loadPage(url): ''' Function: Fetching url and accessing the webpage content url: the wanted webpage url ''' headers = { 'Accept': 'text/html', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36', } print('To send HTTP request to %s ' % url) request = urllib.request.Request(url, headers=headers) return urllib.request.urlopen(request).read().decode('utf-8') def writePage(html, filename): ''' Function : To write the content of html into a local file html : The response content filename : the local filename to be used stored the response ''' print('To write html into a local file %s ...' % filename) with open(filename, 'w', encoding='utf-8') as f: f.write(str(html)) print('Work done!') print('---'*10) def tiebaCrawler(url, beginPage, endPage, keyword): ''' Function: The scheduler of crawler, is used to access every wanted url in turns url: the url of tieba's webpage beginPage: initial page endPage: end page keyword: the wanted keyword ''' for page in range(beginPage,endPage+1): pn = (page - 1) * 50 queryurl = url + '&pn=' + str(pn) filename = keyword + str(page) + '_tieba.html' writePage(loadPage(queryurl), filename) if __name__ == '__main__': kw = input('Please input the wanted tieba\'s name:') beginPage = int(input('The beginning page number:')) endPage = int(input('The ending page number:')) # Baidu Post Bar query url example: http://tieba.baidu.com/f?Ie=utf-8&kw=%E8%B5%B5%E4%B8%BD%E9%A2%96&red_tag=m2239217474 url = 'http://tieba.baidu.com/f?ie=utf-8&' key = urllib.parse.urlencode({'kw':kw}) queryurl = url + key tiebaCrawler(queryurl, beginPage, endPage, kw)
Operation result: Please input the wanted tieba's name:Zhao Liying The beginning page number:1 The ending page number:3 To send HTTP request to http://tieba.baidu.com/f?ie=utf-8&kw=%E8%B5%B5%E4%B8%BD%E9%A2%96&pn=0 To write html into a local file Zhao Liying 1_tieba.html ... Work done! ------------------------------ To send HTTP request to http://tieba.baidu.com/f?ie=utf-8&kw=%E8%B5%B5%E4%B8%BD%E9%A2%96&pn=50 To write html into a local file Zhao Liying 2_tieba.html ... Work done! ------------------------------ To send HTTP request to http://tieba.baidu.com/f?ie=utf-8&kw=%E8%B5%B5%E4%B8%BD%E9%A2%96&pn=100 To write html into a local file Zhao Liying 3_tieba.html ... Work done! ------------------------------
Gain from experience
- Separate the functions as far as possible. Each function only does one thing, controls the input and output of the function, and embodies the object-oriented idea.
- Each function should be noted (as in the code above), and two things should be clarified. What does A. function do? b. What do the parameters of a function represent? Explain these clearly, not only can greatly increase the readability of the code, but also can urge themselves to standardize the code.