Reptile
Author: ychh_
Bedding content
Reptile classification
-
Universal crawler:
- An important part of grasping system
-
Focus crawler:
- Based on the universal crawler
- What is captured is the local content of the page
-
Incremental crawler:
- Detect the update of data in the website
Anti creep mechanism
- Portal websites can prevent crawlers from stealing data by specifying corresponding policies
- Anti crawling strategy: crack the anti crawling strategy and obtain data
Related agreements
- robots.txt protocol:
- Gentlemen's agreement. It specifies which data in the website can be crawled and which can not be crawled
- http protocol:
- Common communication protocols between client and server
- Common request header information:
- User agent: identity of request carrier
- Connection: whether to disconnect or maintain the connection after the request is completed
- Common response header information:
- Content type: the data type of the corresponding client of the server
- https protocol:
- Secure Hypertext Transfer Protocol
- Encryption method:
-
Symmetric key encryption:
Both the ciphertext and the key are sent by the client to the server
Defect: the key and ciphertext may be intercepted by the intermediary -
Asymmetric key encryption:
The ciphertext is sent by the client to the server
The key is sent by the server to the client
Defect: there is no guarantee that the key obtained by the client must be provided by the server -
Certificate key encryption (https):
The third-party authentication mechanism is used for key anti-counterfeiting authentication
-
requests module
requests function
Simulate browser send request
- response return type:
- Text: text format
- json: json object
- content: image format
UA camouflage (anti climbing mechanism)
If the portal detects that the request carrier is request instead of browser, it will refuse access
focused crawler
Data analysis and classification
- regular
- bs4
- xpath
bs4
-
Data analysis principle
1. Label positioning
2. Extract the data value in the tag attribute -
bs4 data analysis principle:
1. instantiation beautysoup Object and load the source data into the beautysoup in 2. By calling beautysoup Object for label location and data extraction
-
Attribute positioning:
- soup.tagName: find the attribute of the tag that appears for the first time
- soup.find():
1. find(tagName): equivalent to soup.tagName
2. find(tagName,class / attr / id...): locate by attribute - soup.find_all(): find all tags (list) that meet the requirements, or locate them as attributes
- soup.select():
1. Label selector
2. Level selector:
-Parent label > child label (one layer, i.e.)
-'spaces represent multiple layers, i.e - Attention: the results of find and select are not the same object
-
Get the content in the tag:
- soup.text
- soup.string
- soup.get_text()
-
Code example (Three Kingdoms crawling)
import requests import json from bs4 import BeautifulSoup if __name__ == "__main__": url = "https://www.shicimingju.com/book/sanguoyanyi.html" headers = { "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36" } response = requests.get(url=url,headers=headers) response.encoding = response.apparent_encoding #Set encoding format """ among r.encoding According to the in the response header charset Judge the website code. If it is not set, it will be returned by default iso-8859-1 Coding, and r.apparent_encoding The coding is judged by the content of the web page r.encoding=r.apparent_encoding There will be no garbled code problem. """ html = response.text # print(html) soup = BeautifulSoup(html,'lxml') muluList = soup.select(".book-mulu a") muluRecord = [] for mulu in muluList: muluRecord.append(mulu.text) pageNum = len(muluRecord) dataTotalUrl = "https://www.shicimingju.com/book/sanguoyanyi/%d.html" for i,title in enumerate(muluRecord): dataUrl = dataTotalUrl%(i + 1) response = requests.get(url=dataUrl,headers=headers) response.encoding = response.apparent_encoding dataHtml = response.text dataSoup = BeautifulSoup(dataHtml,'lxml') data = dataSoup.find("div",class_="chapter_content").text data = data.replace(" ","\n") path = r"C:\Users\Y_ch\Desktop\spider_test\data\text\sanguo\\" + title + ".txt" with open(path,'w',encoding="utf-8") as fp: fp.write(data) print("The first%d Download completed"%(i + 1)
xpath
-
Data analysis principle:
- Instantiate the etree object and load the page source code data into the parsed object
- Call the method in etree and locate it with the xpath method in etree
-
Resolution method:
- Load the local html source data into etree
- etree.parse(filepath)
- You can load the source data obtained on the Internet into etree
- etree.HTML(text)
- Load the local html source data into etree
-
xpath usage:
- Absolute path: / xx/xx/x
- Omitted path: / / xx
- Attribute location: / / tagName[@attr = ""]
- Index location: / / tagName[@attr=""]/xx
- Duplicate index: / / tagName[@attr]//p[pos]
- Text acquisition: / / tagName/text()
- Property acquisition: / / tagName/@attr
-
Code example (4K image crawling)
import json from lxml import etree import requests if __name__ == "__main__": url = "https://pic.netbian.com/4kdongman/index_%d.html" headers = { "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36" } pageNum = 2 for page in range(pageNum): if page == 0: new_url = "https://pic.netbian.com/4kdongman/" else: new_url = url % (page + 1) response = requests.get(url=new_url, headers=headers) html_code = response.text tree = etree.HTML(html_code) urlList = tree.xpath("//ul[@class=\"clearfix\"]//img//@src") urlHead = "https://pic.netbian.com" path = r"C:\Users\Y_ch\Desktop\spider_test\data\pic\4K\\" picUrlList = [] for index, eachUrl in enumerate(urlList): picUrl = urlHead + eachUrl picUrlList.append(picUrl) for index, picUrl in enumerate(picUrlList): picReq = requests.get(url=picUrl, headers=headers) pic = picReq.content picPath = path + str(page)+ "." +str(index) + ".jpg" with open(picPath, 'wb') as fp: fp.write(pic) print("The first%d Page%d Pictures downloaded successfully!" % ((page + 1),index + 1))
Verification code identification
-
The verification code is the anti climbing mechanism of portal website
-
The img is obtained by crawling, and then the verification code is identified by the third-party verification code identification software
-
Code example
import json import requests from lxml import etree from verication import vercation if __name__ == "__main__": url = "https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx" headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36" } response = requests.get(url=url,headers=headers) tree = etree.HTML(response.text) varication_path = tree.xpath("//img[@id=\"imgCode\"]/@src") picUrl = "https://so.gushiwen.cn" + varication_path[0] pic = requests.get(url=picUrl,headers=headers).content print(vercation(pic=pic)) #!/usr/bin/env python # coding:utf-8 import requests from hashlib import md5 class Chaojiying_Client(object): def __init__(self, username, password, soft_id): self.username = username password = password.encode('utf8') self.password = md5(password).hexdigest() self.soft_id = soft_id self.base_params = { 'user': self.username, 'pass2': self.password, 'softid': self.soft_id, } self.headers = { 'Connection': 'Keep-Alive', 'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)', } def PostPic(self, im, codetype): """ im: Picture byte codetype: Topic type reference http://www.chaojiying.com/price.html """ params = { 'codetype': codetype, } params.update(self.base_params) files = {'userfile': ('ccc.jpg', im)} r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers) return r.json() def ReportError(self, im_id): """ im_id:Pictures of wrong topics ID """ params = { 'id': im_id, } params.update(self.base_params) r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers) return r.json() def vercation(pic,picCode=1902,picMoudle=None): chaojiying = Chaojiying_Client('1325650083', 'ych3362632', '94271a5f53dc7b0e34efdb06a88692c1') if picMoudle == None: return chaojiying.PostPic(pic, picCode)["pic_str"] else : im = open(pic, 'rb').read() # Local image file path to replace a.jpg. Sometimes WIN system needs to// return chaojiying.PostPic(im, picCode)["pic_str"] # if __name__ == '__main__': # chaojiying = Chaojiying_Client('1325650083', 'ych3362632', '94271a5f53dc7b0e34efdb06a88692c1') #The user center > > software ID generates a replacement 96001 # im = open('a.jpg', 'rb').read() #Local image file path to replace a.jpg. Sometimes WIN system needs to// # print (chaojiying.PostPic(im, 1902)) #1902 verification code type official website > > price system version 3.4 + print should be added ()
agent
-
What are the agents:
- proxy server
-
Role of the agent:
- Break through its own IP limit
- Hide your real IP
-
Agent related websites:
- Fast agent
- Xici agent
- www.goubanjia.com
-
Transparency of agents:
- Transparent: the server knows the proxy ip and the real ip
- Anonymity: the server knows the proxy ip, but does not know the real ip
- High concealment: the server knows neither the proxy ip nor the real ip
-
In python, the proxy ip is used as the proxies parameter as the request parameter of requests
-
http proxy can only request hettp server, and https proxy can only request hettps server
-
Code example
from lxml import etree import requests if __name__ == "__main__": url = "https://www.baidu.com/s?wd=ip" headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36" } proxies = { "https":"221.110.147.50:3128" } response = requests.get(url=url,headers=headers,proxies=proxies) with open(r"C:\Users\Y_ch\Desktop\spider_test\dd.html",'w') as fp: fp.write(response.text)
Asynchronous crawler
- Function: use asynchrony in crawler to realize high-performance data crawling operation
Asynchronous crawler mode
- Multithreading (not recommended):
- Benefits: threads or processes can be opened for related blocked operations, and blocked operations can be executed asynchronously
- Disadvantages: threads and processes cannot be opened without restrictions
- Thread pool:
- Benefits: reduce the efficiency of creating and destroying threads or processes, so as to better reduce the overhead of the system
- Disadvantages: the number of threads in the thread pool is online
- Single thread + asynchronous coroutine
selenium module
-
Browser driver (Google):
- http://chromedriver.storage.googleapis.com/index.html
-
Relationship between selenium and reptiles:
- Conveniently obtain the data dynamically loaded in the website (js files that cannot be parsed by etree and soup can also be obtained)
- Convenient simulated Login
-
Sample code (crawling pear video):
from selenium import webdriver from lxml import etree import requests import time from multiprocessing.dummy import Pool """ Using thread pool crawling, it is easy to be intercepted by anti crawler measures!!! """ headers = { "Useer-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36" } def getUrlList(): url = "https://www.pearvideo.com/category_5" response = requests.get(url=url, headers=headers) htmlHead = 'https://www.pearvideo.com/' initHtml = etree.HTML(response.text) videoUrlList = initHtml.xpath("//ul[@class=\"category-list clearfix\"]//a/@href") print((videoUrlList)) videoHtml = [] for each in videoUrlList: videoHtml.append(htmlHead + each) return videoHtml def get_video(url): if url == None: return bro.get(url=url) page_text = bro.page_source tree = etree.HTML(page_text) try: videoUrl = tree.xpath("//div[@class=\"main-video-box\"]/div//@src")[0] name = tree.xpath("//h1[@class=\"video-tt\"]/text()")[0] video = requests.get(url=videoUrl, headers=headers).content path = r"C:\Users\Y_ch\Desktop\spider_test\data\video\pear\\" + name + ".mp4" with open(path, 'wb') as fp: fp.write(video) print(name + " Video download succeeded!") except IndexError as e: print(url) bro = webdriver.Chrome('./chromedriver.exe') url = getUrlList() get_video(url[1]) pool = Pool(len(url)) print(len(url)) pool.map(get_video,url) pool.close() pool.join() time.sleep(10) bro.close()
-
Initiate request:
- Request url through get method
-
Label positioning:
- Get the specified label element through the series of functions of find
-
Label interaction:
- Tag interaction via send_keys("xxx")
-
Execute js code:
- Execute js code for the page by executing execute_script ("")
-
Page forward and backward:
- back()
- forward()
-
Close browser:
- close()
-
Save screenshot of web page:
- save_screenshoot("./filepath")
iframe processing
-
If the located tag is nested in the sub page of iframe, bro.find series functions cannot be used to locate it directly. The following steps are required:
-
bro.switch_to.frame("ifrmaeResult") #Toggle frame box bro.find_element_by_id("1")
-
Action chain
-
When action processing is required in the browser, the action chain of webdriver is used for processing
-
Code example:
def drop_test(): bro = webdriver.Chrome("chromedriver.exe") bro.get("https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable") bro.switch_to.frame("iframeResult") div = bro.find_element_by_id("draggable") #Construct action chain action = ActionChains(bro) #Construct action chain instance action.click_and_hold(div) #Click and hold for i in range(5): #move_by_offset(): #xoffset,yoffset: pixels moving in two directions #perform(): #Execute now action.move_by_offset(xoffset=18,yoffset=0).perform() time.sleep(0.3) #Release action chain action.release() time.sleep(5) bro.close() print(div)
Headless browser
-
Make the browser behavior execute as no visual interface
-
Add code:
from selenium.webdriver.chrome.options import Options chrome_option = Options() chrome_option.add_argument("--headless") chrome_option.add_argument("--disable-gpu") bro = webdriver.Chrome("./chromedriver.exe",chrome_options=chrome_option) #Add the attribute of chrome_options to the instantiation of driver
selenium shielding avoidance
-
For some websites that refuse selenium's request, which makes selenium unable to obtain the server connection, you need to add corresponding codes to avoid it
-
Add code:
# Previous versions of chrome 79 def evade(): option = ChromeOptions() option.add_experimental_option("excludeSwitches",["enable-automation"]) bro = webdriver.Chrome("./chromedriver.exe",options=option)
#Versions after chrome 79 from selenuim import webdriver driver = webdriver.Chrome() driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", { "source": """ Object.defineProperty(navigator, 'webdriver', { get: () => undefined }) """ })
Scripy framework
Initial scene
-
What is a framework:
- A project template that integrates many functions and has strong universality
-
What is a scene:
- Crawler's encapsulated framework
- Function:
- High performance persistent storage
- Asynchronous data download
- High performance data parsing
- Distributed
-
Scratch installation:
- pip install wheel
-
stay https://www.lfd.uci.edu/ ~Find the twisted version of the corresponding version of python in gohlke / python LIBS / #twisted, put it in the specified directory, and execute
pip install <Twisted-20.3.0-cp36-cp36m-win_amd64.whl>
- pip install pywin32
- pip install scrapy
-
Creation of a scene project:
-
Enter the python environment containing the script
-
Execute scratch startprojecr xxxpro
-
Created successfully
-
-
Create the source file given by the crawler in the subdirectory of the scratch
Scratch genspider < initial URL >
-
Execution of works:
scrapy crawl
-
Explanation of relevant initial parameters:
#The name of the crawler file, which is the unique identifier of the crawler source file name = 'test' #The crawler file allows the requested url. If the requested url is not in the list, the request is not allowed (this parameter is not normally used) allowed_domains = ['www.xxx.com'] #In general, the parameter list needs to be annotated #The starting url of the crawler file, that is, the url that the crawler automatically accesses start_urls = ['http://www.xxx.com/']
-
Modify the execution parameter of robots.txt to false
ROBOTSTXT_OBEY = False #It needs to be modified to false, otherwise it will be rejected by the website
-
Hide the log contents in the returned response:
scrapy crawl --nolog
Defect: if the response is wrong, there is no prompt
For the defect of -- log, the following improvements are made:
Set: log in the setting file_ LEVEL = “ERROR”
-
The corresponding url obtained through the request is stored in the response in parse, parsed through response.xpath, and the parsed data is extracted through extract
def parse(self, response): div_list = response.xpath("//div[@class=\"content\"]/span/text()").extract() print(''.join(div_list))
Persistent storage of scratch data
-
Persistent storage of scratch:
-
Terminal based storage:
scrapy crawl -o
be careful:
1. You can only parse Functional**Return value**Store to**Local files (not directly stored in the database)**in 2. Can only be stored as the specified file type:['json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle']
-
Pipeline based storage:
-
Coding process:
- Data analysis
- Define relevant attributes in the item class for data encapsulation
- Encapsulate the parsed data into item
- Submit an object of type item to the pipeline for persistent storage
- Save the data in the process_item of the pipeline class
- Open pipeline in configuration file
-
-
Save instance locally
```python #item.py class QiuabaiproItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() content = scrapy.Field() #Data encapsulation ``` ```python #pipelines.py class QiuabaiproPipeline(object): fp = None def open_spider(self,spider): print("start") self.fp = open("./qiubi.txt",'w',encoding='utf-8') def process_item(self, item, spider): content = item["content"] self.fp.write(content) return item def close_spider(self,spider): print("finsih") self.fp.close() ``` ```python #Open the pipe #settings.py ITEM_PIPELINES = { 'qiuabaiPro.pipelines.QiuabaiproPipeline': 300, #The following value is the priority. The smaller the value, the higher the priority } ``` ```python #parse.py #Submit the pipeline through the yield keyword def parse(self, response): div_list = response.xpath("//div[@class=\"content\"]/span/text()").extract() yield div_list[0] return div_list ```
Database save instance
#pipelines.py class MysqlPipeline(object): conn = None cursor = None def open_spider(self,spider): self.conn = pymysql.Connect(host='localhost',port=3307,user="root",passwd="ych3362632",db="test",charset="utf8") #Here can only be utf8, not utf-8 def process_item(self,item,spider): self.cursor = self.conn.cursor() try: print(len(item["name"])) self.cursor.execute("insert into spider (`name`) values (\"%s\")" % item["name"]) self.conn.commit() except Exception as e: print(e) self.conn.rollback() return item def close_spider(self,spider): self.conn.close() self.cursor.close()
#settings.py ITEM_PIPELINES = { 'qiuabaiPro.pipelines.QiuabaiproPipeline': 300, 'qiuabaiPro.pipelines.MysqlPipeline': 301, #Add the newly created pipe class to the setting file. In addition, if the priority of the pipe class is low, the function of process_item in the high priority pipe class needs to return item to the low priority }
Storage summary
- There are two ways to implement persistent storage:
- Command line form (parse return is required, the storage type is fixed, and cannot exist in the database)
- Pipe form: all advantages except troublesome configuration
- Interview question: how to store one copy of the crawled data locally and one copy in the database:
- Create two pipe class files and set the created class in the configuration file
- If multiple pipeline class files are stored synchronously, the high priority process_item is required to return the item, so that the low priority pipeline class can obtain the item data
Total station data request
-
The initial URL is generally the URL of the website home page to be crawled, and the url_list is set through the relationship between index or website page number
-
Recursively obtain the complete content of the web page through the sweep.request method
class BeautiSpider(scrapy.Spider): name = 'beauti' # allowed_domains = ['www.xxx.com'] start_urls = ['https://www.duodia.com/daxuexiaohua/'] url = "https://www.duodia.com/daxuexiaohua/list_%d.html" pageNum = 1 def parse(self, response): div_list = response.xpath("//*[@id=\"main\"]/div[2]/article") for div in div_list: name = div.xpath(".//a/@title").extract() print("".join(name)) if self.pageNum <= 5: new_url = self.url % self.pageNum self.pageNum += 1 yield scrapy.Request(url=new_url,callback=self.parse) #Recursive call, and callback is specially used for data parsing
Five core components
[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-wzumztvq-16342192757) (C: \ users \ y_ch \ desktop \ spider_test \ data \ md_data \ 11. Webp)]
- Engine (Scrapy):
- It is used to process the data flow of the whole system and trigger things (core framework)
- Scheduler:
- It is used to receive the request sent by the engine, press it into the queue, and return it when the engine requests again. It can be imagined as a priority queue of URLs (URL or time link of the web page). It is up to him to decide what the next URL to grab is, and remove the duplicate URL at the same time
- Downloader:
- It is used to download the content of the web page and return the web page content to the spider (the scripy Downloader is based on the twisted whole efficient asynchronous model)
- Project pipeline:
- It is responsible for processing the entities extracted from the web page by the crawler, mainly the persistent entities, verifying the effectiveness of the entities and clarifying the unnecessary information. When the page is parsed by the crawler, it is sent to the project pipeline, and the data is processed in several specific sequences
- Spider:
- Crawlers are mainly used to extract the information they need from a specific web page, the so-called entity (item). Users can also extract links from it and let scripy continue to grab the next page
Request parameters
-
When crawling the data of the whole station, the crawling of the detail page requires the request parameters, that is, the item object is passed into different functions
-
Code implementation:
# allowed_domains = ['www.xxx.com'] start_urls = ['https://www.zhipin.com/job_detail/?query=python'] home_url = "https://www.zhipin.com/" def detail_parse(self,response): item = response.meta["item"] content = response.xpath("//*[@id=\"main\"]/div[3]/div/div[2]/div[2]/div[1]/div//text()").extract() item["content"] = content yield item def parse(self, response): print(response) li_list = response.xpath("//*[@id=\"main\"]/div/div[3]//li") for li in li_list: name_div = li.xpath("//span[@class=\"job-title\"]") title = name_div.xpath("/span[@class=\"job-name\"]/a/@title").extract() name = name_div.xpath("/span[@class=\"job-area-wrapper\"]/span/text()").extract() li_name = title + " " + name detail = li.xpath("//div[@class=\"primary-wrapper\"]/div/@href").extract() new_url = "https://www.zhipin.com/" + detail item = BoproItem() item["name"] = li_name yield scrapy.Request(url=new_url,callback=self.etail_parse,meta={"item":item}) #item is passed into other functions for use
Use of image pipeline
-
Automatically obtain and download the picture address by using the ImagesPipelines class in scene.pipelines.images
-
You need to override the functions in ImagesPipelines in the scene.pipelines.images
-
Set the storage path of the picture in setting
#pipelines.py from scrapy.pipelines.images import ImagesPipeline import scrapy class ImageLine(ImagesPipeline): #Request according to picture address def get_media_requests(self, item, info): yield scrapy.Request(item["src"][0]) #Remember the time of "scene. Request"!!!!!!!!!! #Specify picture storage path def file_path(self, request, response=None, info=None, *, item=None): return item["name"][0] + ".jpg" def item_completed(self, results, item, info): return item #Return to the next item to be executed
#settings.py ITEM_PIPELINES = { 'imagePro.pipelines.ImageLine': 300, } IMAGES_STORE = "./data/pic/beauty"
Middleware usage:
-
Intercept request:
-
UA camouflage: process_request
def process_request(self, request, spider): #Conduct UA camouflage request.headers["User-Agent"] = xxx return None
-
Proxy IP: process_exception
def process_exception(self, request, exception, spider): #IP replacement request.meta["proxy"] = xxx
-
-
Intercept response:
-
Tamper with the corresponding data, and the response object is process_response
def process_response(self, request, response, spider): #Select the specified object for modification #request via url #Specify the response through response if request.url in spider.href_list: #Get dynamically loaded pages bro = spider.bro bro.get(request.url) page_text = bro.page_source new_response = HtmlResponse(url=request.url,body=page_text,encoding="utf-8",request=request) return new_response else: return response
-