Crawler python series [from request to scratch framework] summary

Keywords: Python crawler

Reptile

Author: ychh_

Bedding content

Reptile classification

  • Universal crawler:

    • An important part of grasping system
  • Focus crawler:

    • Based on the universal crawler
    • What is captured is the local content of the page
  • Incremental crawler:

    • Detect the update of data in the website

Anti creep mechanism

  • Portal websites can prevent crawlers from stealing data by specifying corresponding policies
  • Anti crawling strategy: crack the anti crawling strategy and obtain data

Related agreements

  • robots.txt protocol:
    • Gentlemen's agreement. It specifies which data in the website can be crawled and which can not be crawled
  • http protocol:
    • Common communication protocols between client and server
  • Common request header information:
    • User agent: identity of request carrier
    • Connection: whether to disconnect or maintain the connection after the request is completed
  • Common response header information:
    • Content type: the data type of the corresponding client of the server
  • https protocol:
    • Secure Hypertext Transfer Protocol
  • Encryption method:
    • Symmetric key encryption:
      Both the ciphertext and the key are sent by the client to the server
      Defect: the key and ciphertext may be intercepted by the intermediary

    • Asymmetric key encryption:
      The ciphertext is sent by the client to the server
      The key is sent by the server to the client
      Defect: there is no guarantee that the key obtained by the client must be provided by the server

    • Certificate key encryption (https):
      The third-party authentication mechanism is used for key anti-counterfeiting authentication

requests module

requests function

Simulate browser send request

  • response return type:
    • Text: text format
    • json: json object
    • content: image format

UA camouflage (anti climbing mechanism)

If the portal detects that the request carrier is request instead of browser, it will refuse access

focused crawler

Data analysis and classification

  • regular
  • bs4
  • xpath

bs4

  • Data analysis principle
    1. Label positioning
    2. Extract the data value in the tag attribute

  • bs4 data analysis principle:

     1. instantiation  beautysoup Object and load the source data into the beautysoup in
     2. By calling beautysoup Object for label location and data extraction
    
  • Attribute positioning:

    • soup.tagName: find the attribute of the tag that appears for the first time
    • soup.find():
      1. find(tagName): equivalent to soup.tagName
      2. find(tagName,class / attr / id...): locate by attribute
    • soup.find_all(): find all tags (list) that meet the requirements, or locate them as attributes
    • soup.select():
      1. Label selector
      2. Level selector:
      -Parent label > child label (one layer, i.e.)
      -'spaces represent multiple layers, i.e
    • Attention: the results of find and select are not the same object
  • Get the content in the tag:

    • soup.text
    • soup.string
    • soup.get_text()
  • Code example (Three Kingdoms crawling)

    import requests
    import json
    from bs4 import BeautifulSoup
    
    if __name__ == "__main__":
    
        url = "https://www.shicimingju.com/book/sanguoyanyi.html"
    
        headers = {
            "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36"
        }
    
        response = requests.get(url=url,headers=headers)
        response.encoding = response.apparent_encoding #Set encoding format
        """
        among r.encoding According to the in the response header charset Judge the website code. If it is not set, it will be returned by default iso-8859-1 Coding, and r.apparent_encoding
     The coding is judged by the content of the web page r.encoding=r.apparent_encoding There will be no garbled code problem.
        """
    
        html = response.text
    
        # print(html)
        soup = BeautifulSoup(html,'lxml')
        muluList = soup.select(".book-mulu a")
        muluRecord = []
        for mulu in muluList:
            muluRecord.append(mulu.text)
        pageNum = len(muluRecord)
        dataTotalUrl = "https://www.shicimingju.com/book/sanguoyanyi/%d.html"
        for i,title in enumerate(muluRecord):
            dataUrl = dataTotalUrl%(i + 1)
            response = requests.get(url=dataUrl,headers=headers)
            response.encoding = response.apparent_encoding
            dataHtml = response.text
    
            dataSoup = BeautifulSoup(dataHtml,'lxml')
    
    
            data = dataSoup.find("div",class_="chapter_content").text
            data = data.replace("  ","\n")
            path = r"C:\Users\Y_ch\Desktop\spider_test\data\text\sanguo\\" + title + ".txt"
            with open(path,'w',encoding="utf-8") as fp:
                fp.write(data)
                print("The first%d Download completed"%(i + 1)
    
    
    

xpath

  • Data analysis principle:

    1. Instantiate the etree object and load the page source code data into the parsed object
    2. Call the method in etree and locate it with the xpath method in etree
  • Resolution method:

    1. Load the local html source data into etree
      • etree.parse(filepath)
    2. You can load the source data obtained on the Internet into etree
      • etree.HTML(text)
  • xpath usage:

    • Absolute path: / xx/xx/x
    • Omitted path: / / xx
    • Attribute location: / / tagName[@attr = ""]
    • Index location: / / tagName[@attr=""]/xx
    • Duplicate index: / / tagName[@attr]//p[pos]
    • Text acquisition: / / tagName/text()
    • Property acquisition: / / tagName/@attr
  • Code example (4K image crawling)

    import json
    from lxml import etree
    import requests
    
    if __name__ == "__main__":
        url = "https://pic.netbian.com/4kdongman/index_%d.html"
    
        headers = {
            "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36"
        }
        pageNum = 2
    
        for page in range(pageNum):
            if page == 0:
                new_url = "https://pic.netbian.com/4kdongman/"
            else:
                new_url = url % (page + 1)
    
            response = requests.get(url=new_url, headers=headers)
    
            html_code = response.text
    
            tree = etree.HTML(html_code)
    
            urlList = tree.xpath("//ul[@class=\"clearfix\"]//img//@src")
    
            urlHead = "https://pic.netbian.com"
            path = r"C:\Users\Y_ch\Desktop\spider_test\data\pic\4K\\"
            picUrlList = []
            for index, eachUrl in enumerate(urlList):
                picUrl = urlHead + eachUrl
                picUrlList.append(picUrl)
    
            for index, picUrl in enumerate(picUrlList):
                picReq = requests.get(url=picUrl, headers=headers)
                pic = picReq.content
    
                picPath = path + str(page)+ "." +str(index) + ".jpg"
                with open(picPath, 'wb') as fp:
                    fp.write(pic)
                    print("The first%d Page%d Pictures downloaded successfully!" % ((page + 1),index + 1))
    
    
    
    

Verification code identification

  • The verification code is the anti climbing mechanism of portal website

  • The img is obtained by crawling, and then the verification code is identified by the third-party verification code identification software

  • Code example

    import json
    import  requests
    from lxml import etree
    from verication import vercation
    
    if __name__ == "__main__":
        url = "https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx"
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36"
        }
        response = requests.get(url=url,headers=headers)
    
        tree = etree.HTML(response.text)
    
        varication_path = tree.xpath("//img[@id=\"imgCode\"]/@src")
        picUrl = "https://so.gushiwen.cn" + varication_path[0]
    
        pic = requests.get(url=picUrl,headers=headers).content
        print(vercation(pic=pic))
    
    
    
    
    
    #!/usr/bin/env python
    # coding:utf-8
    
    import requests
    from hashlib import md5
    
    class Chaojiying_Client(object):
    
        def __init__(self, username, password, soft_id):
            self.username = username
            password =  password.encode('utf8')
            self.password = md5(password).hexdigest()
            self.soft_id = soft_id
            self.base_params = {
                'user': self.username,
                'pass2': self.password,
                'softid': self.soft_id,
            }
            self.headers = {
                'Connection': 'Keep-Alive',
                'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
            }
    
        def PostPic(self, im, codetype):
            """
            im: Picture byte
            codetype: Topic type reference http://www.chaojiying.com/price.html
            """
            params = {
                'codetype': codetype,
            }
            params.update(self.base_params)
            files = {'userfile': ('ccc.jpg', im)}
            r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers)
            return r.json()
    
        def ReportError(self, im_id):
            """
            im_id:Pictures of wrong topics ID
            """
            params = {
                'id': im_id,
            }
            params.update(self.base_params)
            r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
            return r.json()
    
    
    def vercation(pic,picCode=1902,picMoudle=None):
        chaojiying = Chaojiying_Client('1325650083', 'ych3362632', '94271a5f53dc7b0e34efdb06a88692c1')
        if picMoudle == None:
            return chaojiying.PostPic(pic, picCode)["pic_str"]
        else :
            im = open(pic, 'rb').read()  # Local image file path to replace a.jpg. Sometimes WIN system needs to//
            return chaojiying.PostPic(im, picCode)["pic_str"]
    # if __name__ == '__main__':
    # 	chaojiying = Chaojiying_Client('1325650083', 'ych3362632', '94271a5f53dc7b0e34efdb06a88692c1')	#The user center > > software ID generates a replacement 96001
    # 	im = open('a.jpg', 'rb').read()													#Local image file path to replace a.jpg. Sometimes WIN system needs to//
    # 	print (chaojiying.PostPic(im, 1902))												#1902 verification code type official website > > price system version 3.4 + print should be added ()
    
    
    

agent

  • What are the agents:

    • proxy server
  • Role of the agent:

    • Break through its own IP limit
    • Hide your real IP
  • Agent related websites:

    • Fast agent
    • Xici agent
    • www.goubanjia.com
  • Transparency of agents:

    • Transparent: the server knows the proxy ip and the real ip
    • Anonymity: the server knows the proxy ip, but does not know the real ip
    • High concealment: the server knows neither the proxy ip nor the real ip
  • In python, the proxy ip is used as the proxies parameter as the request parameter of requests

  • http proxy can only request hettp server, and https proxy can only request hettps server

  • Code example

    from lxml import etree
    import requests
    
    if __name__ == "__main__":
        url = "https://www.baidu.com/s?wd=ip"
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36"
        }
        proxies = {
            "https":"221.110.147.50:3128"
        }
    
        response = requests.get(url=url,headers=headers,proxies=proxies)
    
        with open(r"C:\Users\Y_ch\Desktop\spider_test\dd.html",'w') as fp:
            fp.write(response.text)
    

Asynchronous crawler

  • Function: use asynchrony in crawler to realize high-performance data crawling operation

Asynchronous crawler mode

  • Multithreading (not recommended):
    • Benefits: threads or processes can be opened for related blocked operations, and blocked operations can be executed asynchronously
    • Disadvantages: threads and processes cannot be opened without restrictions
  • Thread pool:
    • Benefits: reduce the efficiency of creating and destroying threads or processes, so as to better reduce the overhead of the system
    • Disadvantages: the number of threads in the thread pool is online
  • Single thread + asynchronous coroutine

selenium module

  • Browser driver (Google):

    • http://chromedriver.storage.googleapis.com/index.html
  • Relationship between selenium and reptiles:

    • Conveniently obtain the data dynamically loaded in the website (js files that cannot be parsed by etree and soup can also be obtained)
    • Convenient simulated Login
  • Sample code (crawling pear video):

    from selenium import webdriver
    from lxml import etree
    import requests
    import time
    from multiprocessing.dummy import Pool
    """
        Using thread pool crawling, it is easy to be intercepted by anti crawler measures!!!
    
    
    """
    headers = {
            "Useer-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36"
        }
    
    def getUrlList():
        url = "https://www.pearvideo.com/category_5"
    
        response = requests.get(url=url, headers=headers)
        htmlHead = 'https://www.pearvideo.com/'
        initHtml = etree.HTML(response.text)
        videoUrlList = initHtml.xpath("//ul[@class=\"category-list clearfix\"]//a/@href")
        print((videoUrlList))
    
        videoHtml = []
        for each in videoUrlList:
            videoHtml.append(htmlHead + each)
    
        return videoHtml
    
    
    def get_video(url):
        if url == None:
            return
        bro.get(url=url)
        page_text = bro.page_source
        tree = etree.HTML(page_text)
    
        try:
            videoUrl = tree.xpath("//div[@class=\"main-video-box\"]/div//@src")[0]
            name = tree.xpath("//h1[@class=\"video-tt\"]/text()")[0]
            video = requests.get(url=videoUrl, headers=headers).content
            path = r"C:\Users\Y_ch\Desktop\spider_test\data\video\pear\\" + name + ".mp4"
            with open(path, 'wb') as fp:
                fp.write(video)
                print(name + " Video download succeeded!")
        except IndexError as e:
            print(url)
    
    
    
    bro = webdriver.Chrome('./chromedriver.exe')
    
    url = getUrlList()
    get_video(url[1])
    pool = Pool(len(url))
    print(len(url))
    pool.map(get_video,url)
    pool.close()
    pool.join()
    
    time.sleep(10)
    bro.close()
    
  • Initiate request:

    • Request url through get method
  • Label positioning:

    • Get the specified label element through the series of functions of find
  • Label interaction:

    • Tag interaction via send_keys("xxx")
  • Execute js code:

    • Execute js code for the page by executing execute_script ("")
  • Page forward and backward:

    • back()
    • forward()
  • Close browser:

    • close()
  • Save screenshot of web page:

    • save_screenshoot("./filepath")

iframe processing

  • If the located tag is nested in the sub page of iframe, bro.find series functions cannot be used to locate it directly. The following steps are required:

    • bro.switch_to.frame("ifrmaeResult") #Toggle frame box
      bro.find_element_by_id("1")	
      

Action chain

  • When action processing is required in the browser, the action chain of webdriver is used for processing

  • Code example:

    def drop_test():
        bro = webdriver.Chrome("chromedriver.exe")
        bro.get("https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable")
    
    
        bro.switch_to.frame("iframeResult")
        div = bro.find_element_by_id("draggable")
    
        #Construct action chain
        action = ActionChains(bro) #Construct action chain instance
        action.click_and_hold(div) #Click and hold
    
        for i in range(5):
            #move_by_offset():
                #xoffset,yoffset: pixels moving in two directions
            #perform():
                #Execute now
            action.move_by_offset(xoffset=18,yoffset=0).perform()
            time.sleep(0.3)
    
        #Release action chain
        action.release()
        time.sleep(5)
        bro.close()
        print(div)
    

Headless browser

  • Make the browser behavior execute as no visual interface

  • Add code:

    from selenium.webdriver.chrome.options import  Options
    
    chrome_option = Options()
    chrome_option.add_argument("--headless")
    chrome_option.add_argument("--disable-gpu")
    bro = webdriver.Chrome("./chromedriver.exe",chrome_options=chrome_option) #Add the attribute of chrome_options to the instantiation of driver
    
    

selenium shielding avoidance

  • For some websites that refuse selenium's request, which makes selenium unable to obtain the server connection, you need to add corresponding codes to avoid it

  • Add code:

    # Previous versions of chrome 79
    def evade():
        option = ChromeOptions()
        option.add_experimental_option("excludeSwitches",["enable-automation"])
        bro = webdriver.Chrome("./chromedriver.exe",options=option)
    
    
      #Versions after chrome 79
      from selenuim import webdriver
      driver = webdriver.Chrome()
      driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
        "source": """
          Object.defineProperty(navigator, 'webdriver', {
            get: () => undefined
          })
        """
      })
    

Scripy framework

Initial scene

  • What is a framework:

    • A project template that integrates many functions and has strong universality
  • What is a scene:

    • Crawler's encapsulated framework
    • Function:
      1. High performance persistent storage
      2. Asynchronous data download
      3. High performance data parsing
      4. Distributed
  • Scratch installation:

    • pip install wheel
    • stay https://www.lfd.uci.edu/ ~Find the twisted version of the corresponding version of python in gohlke / python LIBS / #twisted, put it in the specified directory, and execute

      pip install <Twisted-20.3.0-cp36-cp36m-win_amd64.whl>

  • pip install pywin32
  • pip install scrapy
  • Creation of a scene project:

    1. Enter the python environment containing the script

    2. Execute scratch startprojecr xxxpro

    3. Created successfully

  • Create the source file given by the crawler in the subdirectory of the scratch

    Scratch genspider < initial URL >

  • Execution of works:

    scrapy crawl

  • Explanation of relevant initial parameters:

    #The name of the crawler file, which is the unique identifier of the crawler source file
        name = 'test'
    
        #The crawler file allows the requested url. If the requested url is not in the list, the request is not allowed (this parameter is not normally used)
        allowed_domains = ['www.xxx.com'] #In general, the parameter list needs to be annotated
    
        #The starting url of the crawler file, that is, the url that the crawler automatically accesses
        start_urls = ['http://www.xxx.com/']
    
  • Modify the execution parameter of robots.txt to false

    ROBOTSTXT_OBEY = False #It needs to be modified to false, otherwise it will be rejected by the website
    
  • Hide the log contents in the returned response:

    scrapy crawl --nolog

    Defect: if the response is wrong, there is no prompt

    For the defect of -- log, the following improvements are made:

    Set: log in the setting file_ LEVEL = “ERROR”

  • The corresponding url obtained through the request is stored in the response in parse, parsed through response.xpath, and the parsed data is extracted through extract

        def parse(self, response):
            div_list = response.xpath("//div[@class=\"content\"]/span/text()").extract()
            print(''.join(div_list))
    

Persistent storage of scratch data

  • Persistent storage of scratch:

    • Terminal based storage:

      scrapy crawl -o

      be careful:

      1. You can only parse Functional**Return value**Store to**Local files (not directly stored in the database)**in
      2. Can only be stored as the specified file type:['json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle']
      
    • Pipeline based storage:

      • Coding process:

        1. Data analysis
        2. Define relevant attributes in the item class for data encapsulation
        3. Encapsulate the parsed data into item
        4. Submit an object of type item to the pipeline for persistent storage
        5. Save the data in the process_item of the pipeline class
        6. Open pipeline in configuration file

Save instance locally

  ```python
  #item.py
  class QiuabaiproItem(scrapy.Item):
      # define the fields for your item here like:
      # name = scrapy.Field()
      content = scrapy.Field() #Data encapsulation
  
  ```
  
  ```python
  #pipelines.py
  class QiuabaiproPipeline(object):
      fp = None
      def open_spider(self,spider):
          print("start")
          self.fp = open("./qiubi.txt",'w',encoding='utf-8')
  
      def process_item(self, item, spider):
          content = item["content"]
          self.fp.write(content)
          return item
  
      def close_spider(self,spider):
          print("finsih")
          self.fp.close()
  
  ```
  
  ```python
  #Open the pipe
  #settings.py
  ITEM_PIPELINES = {
     'qiuabaiPro.pipelines.QiuabaiproPipeline': 300, #The following value is the priority. The smaller the value, the higher the priority
  }
  
  ```
  
  ```python
  #parse.py
  #Submit the pipeline through the yield keyword
  def parse(self, response):
          div_list = response.xpath("//div[@class=\"content\"]/span/text()").extract()
          yield div_list[0]
  
          return div_list
  ```

Database save instance

#pipelines.py

class MysqlPipeline(object):
    conn = None
    cursor = None
    def open_spider(self,spider):
        self.conn = pymysql.Connect(host='localhost',port=3307,user="root",passwd="ych3362632",db="test",charset="utf8") #Here can only be utf8, not utf-8
    def process_item(self,item,spider):
        self.cursor =  self.conn.cursor()
        try:
            print(len(item["name"]))
            self.cursor.execute("insert into spider (`name`) values (\"%s\")" % item["name"])
            self.conn.commit()

        except Exception as e:
            print(e)
            self.conn.rollback()

        return item

    def close_spider(self,spider):
        self.conn.close()
        self.cursor.close()
#settings.py
ITEM_PIPELINES = {
   'qiuabaiPro.pipelines.QiuabaiproPipeline': 300, 
   'qiuabaiPro.pipelines.MysqlPipeline': 301,  #Add the newly created pipe class to the setting file. In addition, if the priority of the pipe class is low, the function of process_item in the high priority pipe class needs to return item to the low priority
}

Storage summary

  • There are two ways to implement persistent storage:
    • Command line form (parse return is required, the storage type is fixed, and cannot exist in the database)
    • Pipe form: all advantages except troublesome configuration
  • Interview question: how to store one copy of the crawled data locally and one copy in the database:
    • Create two pipe class files and set the created class in the configuration file
    • If multiple pipeline class files are stored synchronously, the high priority process_item is required to return the item, so that the low priority pipeline class can obtain the item data

Total station data request

  • The initial URL is generally the URL of the website home page to be crawled, and the url_list is set through the relationship between index or website page number

  • Recursively obtain the complete content of the web page through the sweep.request method

    class BeautiSpider(scrapy.Spider):
        name = 'beauti'
        # allowed_domains = ['www.xxx.com']
        start_urls = ['https://www.duodia.com/daxuexiaohua/']
        url = "https://www.duodia.com/daxuexiaohua/list_%d.html"
        pageNum = 1
        def parse(self, response):
            div_list = response.xpath("//*[@id=\"main\"]/div[2]/article")
            for div in div_list:
                name = div.xpath(".//a/@title").extract()
                print("".join(name))
    
            if self.pageNum <= 5:
                new_url = self.url % self.pageNum
                self.pageNum += 1
                yield scrapy.Request(url=new_url,callback=self.parse) #Recursive call, and callback is specially used for data parsing
    
    

Five core components

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-wzumztvq-16342192757) (C: \ users \ y_ch \ desktop \ spider_test \ data \ md_data \ 11. Webp)]

  • Engine (Scrapy):
    • It is used to process the data flow of the whole system and trigger things (core framework)
  • Scheduler:
    • It is used to receive the request sent by the engine, press it into the queue, and return it when the engine requests again. It can be imagined as a priority queue of URLs (URL or time link of the web page). It is up to him to decide what the next URL to grab is, and remove the duplicate URL at the same time
  • Downloader:
    • It is used to download the content of the web page and return the web page content to the spider (the scripy Downloader is based on the twisted whole efficient asynchronous model)
  • Project pipeline:
    • It is responsible for processing the entities extracted from the web page by the crawler, mainly the persistent entities, verifying the effectiveness of the entities and clarifying the unnecessary information. When the page is parsed by the crawler, it is sent to the project pipeline, and the data is processed in several specific sequences
  • Spider:
    • Crawlers are mainly used to extract the information they need from a specific web page, the so-called entity (item). Users can also extract links from it and let scripy continue to grab the next page

Request parameters

  • When crawling the data of the whole station, the crawling of the detail page requires the request parameters, that is, the item object is passed into different functions

  • Code implementation:

     # allowed_domains = ['www.xxx.com']
        start_urls = ['https://www.zhipin.com/job_detail/?query=python']
        home_url = "https://www.zhipin.com/"
    
    
        def detail_parse(self,response):
            item = response.meta["item"]
            content = response.xpath("//*[@id=\"main\"]/div[3]/div/div[2]/div[2]/div[1]/div//text()").extract()
            item["content"] = content
            yield item
    
    
        def parse(self, response):
            print(response)
            li_list = response.xpath("//*[@id=\"main\"]/div/div[3]//li")
            for li in li_list:
                name_div = li.xpath("//span[@class=\"job-title\"]")
                title = name_div.xpath("/span[@class=\"job-name\"]/a/@title").extract()
                name = name_div.xpath("/span[@class=\"job-area-wrapper\"]/span/text()").extract()
                li_name = title + " " + name
    
    
                detail = li.xpath("//div[@class=\"primary-wrapper\"]/div/@href").extract()
                new_url = "https://www.zhipin.com/" + detail
                item =  BoproItem()
                item["name"] = li_name
    
                yield scrapy.Request(url=new_url,callback=self.etail_parse,meta={"item":item}) #item is passed into other functions for use
    

Use of image pipeline

  • Automatically obtain and download the picture address by using the ImagesPipelines class in scene.pipelines.images

  • You need to override the functions in ImagesPipelines in the scene.pipelines.images

  • Set the storage path of the picture in setting

    #pipelines.py
    from scrapy.pipelines.images import ImagesPipeline
    import scrapy
    class ImageLine(ImagesPipeline):
    
        #Request according to picture address
        def get_media_requests(self, item, info):
            yield scrapy.Request(item["src"][0]) #Remember the time of "scene. Request"!!!!!!!!!!
    
        #Specify picture storage path
        def file_path(self, request, response=None, info=None, *, item=None):
            return item["name"][0] + ".jpg"
    
        def item_completed(self, results, item, info):
            return item #Return to the next item to be executed
    
    #settings.py
    ITEM_PIPELINES = {
       'imagePro.pipelines.ImageLine': 300,
    }
    IMAGES_STORE = "./data/pic/beauty"
    

Middleware usage:

  • Intercept request:

    • UA camouflage: process_request

       def process_request(self, request, spider): #Conduct UA camouflage
              request.headers["User-Agent"] = xxx
              return None
      
    • Proxy IP: process_exception

       def process_exception(self, request, exception, spider): #IP replacement
             request.meta["proxy"] = xxx
      
  • Intercept response:

    • Tamper with the corresponding data, and the response object is process_response

       def process_response(self, request, response, spider):
              #Select the specified object for modification
              #request via url
              #Specify the response through response
      
              if request.url in spider.href_list: #Get dynamically loaded pages
                  bro = spider.bro
                  bro.get(request.url)
                  page_text = bro.page_source
                  new_response = HtmlResponse(url=request.url,body=page_text,encoding="utf-8",request=request)
                  return new_response
              else:
                  return response
      
      

Posted by Vince889 on Thu, 14 Oct 2021 10:55:14 -0700