preface
Recently, I was doing monitoring related supporting facilities, and found that many scripts are based on python. I heard a long time ago that its name is short of life. I learned python, which is not a joke. With the rise of artificial intelligence, machine learning and deep learning, most of the AI code on the market is written in Python. So in the age of artificial intelligence, it's time to learn python.
Basic environment configuration
-
Python3
-
PyCharm
Implementation steps
Take the picture of a girl as an example. It's very simple. It's divided into the following four steps:
-
Get the page number of the first page and create a folder corresponding to the page number
-
Get the column address of the page
-
Enter the column to get the page number of the column (there are multiple pictures under each column, displayed in pages)
-
Get the picture in the opposite label under the column and download it
matters needing attention
In the process of crawling, you need to pay attention to the following points, which may help you:
1) The guide library is actually similar to the framework or tool class in Java. The bottom layer is encapsulated
2) To define method functions, a crawler may have hundreds of lines, so try not to write them as one
3) Define global variables
4) Anti theft chain
5) Switch versions
6) Exception capture
code implementation
import requests from bs4 import BeautifulSoup import os import urllib import random class mzitu(): def all_url(self, url): html = self.request(url) all_a = BeautifulSoup(html.text, 'lxml').find('div', class_='all').find_all('a') for a in all_a: title = a.get_text() print(u'Start saving:', title) path = str(title).replace("?", '_') if not self.mkdir(path): ##Skip existing folders print(u'Skipped:', title) continue href = a['href'] self.html(href) def html(self, href): html = self.request(href) max_span = BeautifulSoup(html.text, 'lxml').find('div', class_='pagenavi').find_all('span')[-2].get_text() for page in range(1, int(max_span) + 1): page_url = href + '/' + str(page) self.img(page_url) def img(self, page_url): img_html = self.request(page_url) img_url = BeautifulSoup(img_html.text, 'lxml').find('div', class_='main-image').find('img')['src'] self.save(img_url, page_url) def save(self, img_url, page_url): name = img_url[-9:-4] try: img = self.requestpic(img_url, page_url) f = open(name + '.jpg', 'ab') f.write(img.content) f.close() except FileNotFoundError: ##Catch exceptions and move on print(u'Picture does not exist skipped:', img_url) return False def mkdir(self, path): ##This function creates a folder path = path.strip() isExists = os.path.exists(os.path.join("C:dmzitu", path)) if not isExists: print(u'Built a name called', path, u'Folder for!') os.makedirs(os.path.join("C:dmzitu", path)) os.chdir(os.path.join("C:dmzitu", path)) ##Switch to directory return True else: print(u'It's called', path, u'Folder already exists!') return False def requestpic(self, url, Referer): ##This function takes the response of the web page and returns user_agent_list = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ] ua = random.choice(user_agent_list) headers = {'User-Agent': ua, "Referer": Referer} ##Here are the key parameters for getting pictures from previous versions content = requests.get(url, headers=headers) return content def request(self, url): ##This function takes the response of the web page and returns headers = { 'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"} content = requests.get(url, headers=headers) return content Mzitu = mzitu() ##instantiation Mzitu.all_url('http://www.mzitu.com/all ') to function all_ The URL passed in parameter can be used as the starting crawler (that is, the entry) print(u'Congratulations on downloading!')
Now, please open your eyes and let's have a picture.
summary
In fact, the script is very simple, from the configuration environment, the installation of the integrated development environment, the writing of the script to the smooth execution of the whole script, it took almost four or five hours, and finally the execution of the script is one-stop. Limited to the impact of server bandwidth and configuration, the 17G graph has been downloaded for almost three or four hours. As for the remaining 83G, let's download it by ourselves.
For beginners who want to learn Python development, reptile technology, python data analysis, artificial intelligence and other technologies more easily, here is also a set of system teaching resources for you, plus the python technology learning course qq skirt: 784758214, free of charge. There are questions in the learning process. There are professional old drivers in the group to answer questions for free! Click to join us python learning circle