Python crawler example: download multi-page topic content from Baidu Post Bar

Keywords: Big Data IE Windows encoding

Last week in the web crawler course, a practice was left: download multi-page topic content from Baidu Post Bar. What I accomplished was to crawl multi-page content from a post in the post bar, which was different from the topic asked by the teacher. Moreover, after the teacher commented, I found the gap between myself and the teacher's code in an instant. There were still many irregular and imprecise places in my code writing, and it did not reflect object-oriented. Thought, so, do this topic again, learn how the big man wrote.

 

Example: Download multi-page topic content from Baidu Post Bar

Let's know Baidu Tie first.( http://tieba.baidu.com/f? ) We define several functions:

  • loadPage(url) for accessing web pages
  • WrittePage (html, filename) is used to store acquired web pages as local files
  • Tieba Crawler (url, beginpPage, endPage, keyword) is used for scheduling, providing page URLs to be crawled
  • Main: the main control module of the program, completing the basic command line interface
import urllib.request
import urllib.parse

def loadPage(url):
    '''
        Function: Fetching url and accessing the webpage content
        url: the wanted webpage url
    '''
    headers = {
            'Accept': 'text/html',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
    }
    print('To send HTTP request to %s ' % url)
    request = urllib.request.Request(url, headers=headers)
    
    return urllib.request.urlopen(request).read().decode('utf-8')

def writePage(html, filename):
    '''
        Function : To write the content of html into a local file
        html : The response content
        filename : the local filename to be used stored the response
    '''
    print('To write html into a local file %s ...' % filename)
    with open(filename, 'w', encoding='utf-8') as f:
        f.write(str(html))
    print('Work done!')
    print('---'*10)

def tiebaCrawler(url, beginPage, endPage, keyword):
    '''
        Function: The scheduler of crawler, is used to access every wanted url in turns
        url: the url of tieba's webpage
        beginPage: initial page
        endPage: end page
        keyword: the wanted keyword
    '''
    for page in range(beginPage,endPage+1):
        pn = (page - 1) * 50
        queryurl = url + '&pn=' + str(pn)
        filename = keyword + str(page) + '_tieba.html'
        writePage(loadPage(queryurl), filename)
        

if __name__ == '__main__':
    kw = input('Please input the wanted tieba\'s name:')
    beginPage = int(input('The beginning page number:'))
    endPage = int(input('The ending page number:'))
    
    # Baidu Post Bar query url example: http://tieba.baidu.com/f?Ie=utf-8&kw=%E8%B5%B5%E4%B8%BD%E9%A2%96&red_tag=m2239217474
    url = 'http://tieba.baidu.com/f?ie=utf-8&'
    key = urllib.parse.urlencode({'kw':kw})
    queryurl = url + key
    
    tiebaCrawler(queryurl, beginPage, endPage, kw)
Operation result:

Please input the wanted tieba's name:Zhao Liying
The beginning page number:1
The ending page number:3
To send HTTP request to http://tieba.baidu.com/f?ie=utf-8&kw=%E8%B5%B5%E4%B8%BD%E9%A2%96&pn=0 
To write html into a local file Zhao Liying 1_tieba.html ...
Work done!
------------------------------
To send HTTP request to http://tieba.baidu.com/f?ie=utf-8&kw=%E8%B5%B5%E4%B8%BD%E9%A2%96&pn=50 
To write html into a local file Zhao Liying 2_tieba.html ...
Work done!
------------------------------
To send HTTP request to http://tieba.baidu.com/f?ie=utf-8&kw=%E8%B5%B5%E4%B8%BD%E9%A2%96&pn=100 
To write html into a local file Zhao Liying 3_tieba.html ...
Work done!
------------------------------

Gain from experience

  1. Separate the functions as far as possible. Each function only does one thing, controls the input and output of the function, and embodies the object-oriented idea.
  2. Each function should be noted (as in the code above), and two things should be clarified. What does A. function do? b. What do the parameters of a function represent? Explain these clearly, not only can greatly increase the readability of the code, but also can urge themselves to standardize the code.

Posted by maciek4 on Thu, 31 Jan 2019 10:03:15 -0800