Reptile [1] python reptile introduction case - crawling pictures

Keywords: Python Linux Android nexus

Preface:

After learning the basic knowledge of python, let's write a simple case of crawler, which uses urllib and re library.

The reptile principle of this case:

First of all, we use urllib library to simulate the behavior of browser visiting website, and get the source code (html tag) of corresponding webpage from the given website link (url). Where the source code is returned as a string.

Then we use the regular expression re library to match the small string representing the picture link in the string (web page source code) and return a list. Finally, loop through the list, and save the pictures locally according to the picture link.

Among them, the use of urlib Library in python2.x and python3.x is quite different. This case takes python3.x as an example

 

Text:

Direct code:

'''
    //The first simple crawler, using the python 3. X and urllib and re Libraries
'''
 
import urllib.request
import re
 
def getHtmlCode(url):  # This method passes in the url and returns the html source code of the url
    headers = {
        'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Mobile Safari/537.36'
    }
    url1 = urllib.request.Request(url, headers=headers) # The Request function adds a header to the url to simulate browser access
    page = urllib.request.urlopen(url1).read()  # Save the source code of the url page as a string
    page = page.decode('UTF-8')  # String transcoding
    return page
 
def getImg(page):  # This method passes the source code of html, intercepts the img tag, and saves the picture to the local machine
 
    # Findall (regular, STR representing the source code of the page) function, intercepts the small string in the string according to the regular expression
    # findall() returns a list. The elements in the list are tuples. The first element of the tuple is the url of the picture, and the second element is the suffix of the url
    # The list shape is as follows: [('http://www.zhangzishi.cc/732x120.gif ',' GIF '), ('http://ww2.sinaimg.cn/qomayo.jpg', 'JPG')
    imgList = re.findall(r'(http:[^\s]*?(jpg|png|gif))"',page)
    x = 0
    for imgUrl in imgList:  # List loop
        print('Downloading:%s'%imgUrl[0])
        # urlretrieve(url,local) method saves the image to the local machine according to the url of the image
        urllib.request.urlretrieve(imgUrl[0],'E:/pythonSpiderFile/img/%d.jpg'%x)
        x+=1
 
if __name__ == '__main__':  
    url = 'http://www.zhangzishi.cc/20151004mt.html'
    page = getHtmlCode(url)
    getImg(page)

When looking for the url of the image, we use re (regular expression). If re is used well, it will be very effective. If it is not used well, the effect is very poor.

Now that we have the source code of the web page, why don't we get the content according to the name of the tag.

The beautiful soupl library to be introduced next can get the content of the tag according to the name of the tag.

Posted by fabiuz on Fri, 29 Nov 2019 12:24:11 -0800