I. Introduction
I've been learning Python for a long time. I've heard that Python crawlers are so powerful before. I've just learned here. I wrote a crawler program with the python video of little turtle, which can download simple web pictures.
Two, code
__author__ = "JentZhang" import urllib.request import os import random import re def url_open(url): ''' //Open web page :param url: :return: ''' req = urllib.request.Request(url) req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36') # Application agent ''' proxyies = ["111.155.116.237:8123","101.236.23.202:8866","122.114.31.177:808"] proxy = random.choice(proxyies) proxy_support = urllib.request.ProxyHandler({"http": proxy}) opener = urllib.request.build_opener(proxy_support) urllib.request.install_opener(opener) ''' response = urllib.request.urlopen(url) html = response.read() return html def save_img(folder, img_addrs): ''' //Save pictures :param folder: Folders to save :param img_addrs: Picture address (list) :return: ''' # Create folders for pictures if not os.path.exists(folder): os.mkdir(folder) os.chdir(folder) for each in img_addrs: filename = each.split('/')[-1] try: with open(filename, 'wb') as f: img = url_open("http:" + each) f.write(img) except urllib.error.HTTPError as e: # print(e.reason) pass print('Complete!') def find_imgs(url): ''' //Get all picture links :param url: Connection address :return: List of picture addresses ''' html = url_open(url).decode("utf-8") img_addrs = re.findall(r'src="(.+?\.gif)', html) return img_addrs def get_page(url): ''' //How many pages of pictures are there currently :param url: Web address :return: ''' html = url_open(url).decode('utf-8') a = html.find("current-comment-page") + 23 b = html.find("]</span>", a) return html[a:b] def download_mm(url="http://jandan.net/ooxx/", folder="OOXX", pages=1): ''' //Main program (download picture) :param folder:Default folder :param pages: Pages downloaded :return: ''' page_num = int(get_page(url)) for i in range(pages): page_num -= i page_url = url + "page-" + str(page_num) + "#comments" img_addrs = find_imgs(page_url) save_img(folder, img_addrs) if __name__ == "__main__": download_mm()
Three, summary
Because of the code access to the URL has used the anti crawler algorithm. so I can't get the picture I want. so, take a note of a reptile. It's just for learning reference....
Finally: I can change the jpg format to gif and climb to a poor gif image:
The first is a picture placeholder for the anti crawler mechanism, nothing at all