Crawler python series [from request to scratch framework] summary

Keywords: Python crawler

Reptile

Author: ychh_

Bedding content

Reptile classification

Universal crawler:
- An important part of grasping system
Focus crawler:
- Based on the universal crawler
- What is captured is the local content of the page
Incremental crawler:
- Detect the update of data in the website

Anti creep mechanism

Portal websites can prevent crawlers from stealing data by specifying corresponding policies
Anti crawling strategy: crack the anti crawling strategy and obtain data

Related agreements

robots.txt protocol:
- Gentlemen's agreement. It specifies which data in the website can be crawled and which can not be crawled
http protocol:
- Common communication protocols between client and server
Common request header information:
- User agent: identity of request carrier
- Connection: whether to disconnect or maintain the connection after the request is completed
Common response header information:
- Content type: the data type of the corresponding client of the server
https protocol:
- Secure Hypertext Transfer Protocol
Encryption method:
- Symmetric key encryption:
  Both the ciphertext and the key are sent by the client to the server
  Defect: the key and ciphertext may be intercepted by the intermediary
- Asymmetric key encryption:
  The ciphertext is sent by the client to the server
  The key is sent by the server to the client
  Defect: there is no guarantee that the key obtained by the client must be provided by the server
- Certificate key encryption (https):
  The third-party authentication mechanism is used for key anti-counterfeiting authentication

requests module

requests function

Simulate browser send request

response return type:
- Text: text format
- json: json object
- content: image format

UA camouflage (anti climbing mechanism)

If the portal detects that the request carrier is request instead of browser, it will refuse access

focused crawler

Data analysis and classification

regular
bs4
xpath

bs4

Data analysis principle
1. Label positioning
2. Extract the data value in the tag attribute

bs4 data analysis principle:

 1. instantiation  beautysoup Object and load the source data into the beautysoup in
 2. By calling beautysoup Object for label location and data extraction

Attribute positioning:
- soup.tagName: find the attribute of the tag that appears for the first time
- soup.find():
  1. find(tagName): equivalent to soup.tagName
  2. find(tagName,class / attr / id...): locate by attribute
- soup.find_all(): find all tags (list) that meet the requirements, or locate them as attributes
- soup.select():
  1. Label selector
  2. Level selector:
  -Parent label > child label (one layer, i.e.)
  -'spaces represent multiple layers, i.e
- Attention: the results of find and select are not the same object
Get the content in the tag:
- soup.text
- soup.string
- soup.get_text()

Code example (Three Kingdoms crawling)

import requests
import json
from bs4 import BeautifulSoup

if __name__ == "__main__":

    url = "https://www.shicimingju.com/book/sanguoyanyi.html"

    headers = {
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36"
    }

    response = requests.get(url=url,headers=headers)
    response.encoding = response.apparent_encoding #Set encoding format
    """
    among r.encoding According to the in the response header charset Judge the website code. If it is not set, it will be returned by default iso-8859-1 Coding, and r.apparent_encoding
 The coding is judged by the content of the web page r.encoding=r.apparent_encoding There will be no garbled code problem.
    """

    html = response.text

    # print(html)
    soup = BeautifulSoup(html,'lxml')
    muluList = soup.select(".book-mulu a")
    muluRecord = []
    for mulu in muluList:
        muluRecord.append(mulu.text)
    pageNum = len(muluRecord)
    dataTotalUrl = "https://www.shicimingju.com/book/sanguoyanyi/%d.html"
    for i,title in enumerate(muluRecord):
        dataUrl = dataTotalUrl%(i + 1)
        response = requests.get(url=dataUrl,headers=headers)
        response.encoding = response.apparent_encoding
        dataHtml = response.text

        dataSoup = BeautifulSoup(dataHtml,'lxml')


        data = dataSoup.find("div",class_="chapter_content").text
        data = data.replace("　　","\n")
        path = r"C:\Users\Y_ch\Desktop\spider_test\data\text\sanguo\\" + title + ".txt"
        with open(path,'w',encoding="utf-8") as fp:
            fp.write(data)
            print("The first%d Download completed"%(i + 1)

xpath

Data analysis principle:
1. Instantiate the etree object and load the page source code data into the parsed object
2. Call the method in etree and locate it with the xpath method in etree
Resolution method:
1. Load the local html source data into etree
  - etree.parse(filepath)
2. You can load the source data obtained on the Internet into etree
  - etree.HTML(text)
xpath usage:
- Absolute path: / xx/xx/x
- Omitted path: / / xx
- Attribute location: / / tagName[@attr = ""]
- Index location: / / tagName[@attr=""]/xx
- Duplicate index: / / tagName[@attr]//p[pos]
- Text acquisition: / / tagName/text()
- Property acquisition: / / tagName/@attr

Code example (4K image crawling)

import json
from lxml import etree
import requests

if __name__ == "__main__":
    url = "https://pic.netbian.com/4kdongman/index_%d.html"

    headers = {
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36"
    }
    pageNum = 2

    for page in range(pageNum):
        if page == 0:
            new_url = "https://pic.netbian.com/4kdongman/"
        else:
            new_url = url % (page + 1)

        response = requests.get(url=new_url, headers=headers)

        html_code = response.text

        tree = etree.HTML(html_code)

        urlList = tree.xpath("//ul[@class=\"clearfix\"]//img//@src")

        urlHead = "https://pic.netbian.com"
        path = r"C:\Users\Y_ch\Desktop\spider_test\data\pic\4K\\"
        picUrlList = []
        for index, eachUrl in enumerate(urlList):
            picUrl = urlHead + eachUrl
            picUrlList.append(picUrl)

        for index, picUrl in enumerate(picUrlList):
            picReq = requests.get(url=picUrl, headers=headers)
            pic = picReq.content

            picPath = path + str(page)+ "." +str(index) + ".jpg"
            with open(picPath, 'wb') as fp:
                fp.write(pic)
                print("The first%d Page%d Pictures downloaded successfully!" % ((page + 1),index + 1))

Verification code identification

The verification code is the anti climbing mechanism of portal website
The img is obtained by crawling, and then the verification code is identified by the third-party verification code identification software

Code example

import json
import  requests
from lxml import etree
from verication import vercation

if __name__ == "__main__":
    url = "https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36"
    }
    response = requests.get(url=url,headers=headers)

    tree = etree.HTML(response.text)

    varication_path = tree.xpath("//img[@id=\"imgCode\"]/@src")
    picUrl = "https://so.gushiwen.cn" + varication_path[0]

    pic = requests.get(url=picUrl,headers=headers).content
    print(vercation(pic=pic))





#!/usr/bin/env python
# coding:utf-8

import requests
from hashlib import md5

class Chaojiying_Client(object):

    def __init__(self, username, password, soft_id):
        self.username = username
        password =  password.encode('utf8')
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.base_params = {
            'user': self.username,
            'pass2': self.password,
            'softid': self.soft_id,
        }
        self.headers = {
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    def PostPic(self, im, codetype):
        """
        im: Picture byte
        codetype: Topic type reference http://www.chaojiying.com/price.html
        """
        params = {
            'codetype': codetype,
        }
        params.update(self.base_params)
        files = {'userfile': ('ccc.jpg', im)}
        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers)
        return r.json()

    def ReportError(self, im_id):
        """
        im_id:Pictures of wrong topics ID
        """
        params = {
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
        return r.json()


def vercation(pic,picCode=1902,picMoudle=None):
    chaojiying = Chaojiying_Client('1325650083', 'ych3362632', '94271a5f53dc7b0e34efdb06a88692c1')
    if picMoudle == None:
        return chaojiying.PostPic(pic, picCode)["pic_str"]
    else :
        im = open(pic, 'rb').read()  # Local image file path to replace a.jpg. Sometimes WIN system needs to//
        return chaojiying.PostPic(im, picCode)["pic_str"]
# if __name__ == '__main__':
# 	chaojiying = Chaojiying_Client('1325650083', 'ych3362632', '94271a5f53dc7b0e34efdb06a88692c1')	#The user center > > software ID generates a replacement 96001
# 	im = open('a.jpg', 'rb').read()													#Local image file path to replace a.jpg. Sometimes WIN system needs to//
# 	print (chaojiying.PostPic(im, 1902))												#1902 verification code type official website > > price system version 3.4 + print should be added ()

agent

What are the agents:
- proxy server
Role of the agent:
- Break through its own IP limit
- Hide your real IP
Agent related websites:
- Fast agent
- Xici agent
- www.goubanjia.com
Transparency of agents:
- Transparent: the server knows the proxy ip and the real ip
- Anonymity: the server knows the proxy ip, but does not know the real ip
- High concealment: the server knows neither the proxy ip nor the real ip
In python, the proxy ip is used as the proxies parameter as the request parameter of requests
http proxy can only request hettp server, and https proxy can only request hettps server

Code example

from lxml import etree
import requests

if __name__ == "__main__":
    url = "https://www.baidu.com/s?wd=ip"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36"
    }
    proxies = {
        "https":"221.110.147.50:3128"
    }

    response = requests.get(url=url,headers=headers,proxies=proxies)

    with open(r"C:\Users\Y_ch\Desktop\spider_test\dd.html",'w') as fp:
        fp.write(response.text)

Asynchronous crawler

Function: use asynchrony in crawler to realize high-performance data crawling operation

Asynchronous crawler mode

Multithreading (not recommended):
- Benefits: threads or processes can be opened for related blocked operations, and blocked operations can be executed asynchronously
- Disadvantages: threads and processes cannot be opened without restrictions
Thread pool:
- Benefits: reduce the efficiency of creating and destroying threads or processes, so as to better reduce the overhead of the system
- Disadvantages: the number of threads in the thread pool is online
Single thread + asynchronous coroutine

selenium module

Browser driver (Google):
- http://chromedriver.storage.googleapis.com/index.html
Relationship between selenium and reptiles:
- Conveniently obtain the data dynamically loaded in the website (js files that cannot be parsed by etree and soup can also be obtained)
- Convenient simulated Login

Sample code (crawling pear video):

from selenium import webdriver
from lxml import etree
import requests
import time
from multiprocessing.dummy import Pool
"""
    Using thread pool crawling, it is easy to be intercepted by anti crawler measures!!!


"""
headers = {
        "Useer-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36"
    }

def getUrlList():
    url = "https://www.pearvideo.com/category_5"

    response = requests.get(url=url, headers=headers)
    htmlHead = 'https://www.pearvideo.com/'
    initHtml = etree.HTML(response.text)
    videoUrlList = initHtml.xpath("//ul[@class=\"category-list clearfix\"]//a/@href")
    print((videoUrlList))

    videoHtml = []
    for each in videoUrlList:
        videoHtml.append(htmlHead + each)

    return videoHtml


def get_video(url):
    if url == None:
        return
    bro.get(url=url)
    page_text = bro.page_source
    tree = etree.HTML(page_text)

    try:
        videoUrl = tree.xpath("//div[@class=\"main-video-box\"]/div//@src")[0]
        name = tree.xpath("//h1[@class=\"video-tt\"]/text()")[0]
        video = requests.get(url=videoUrl, headers=headers).content
        path = r"C:\Users\Y_ch\Desktop\spider_test\data\video\pear\\" + name + ".mp4"
        with open(path, 'wb') as fp:
            fp.write(video)
            print(name + " Video download succeeded!")
    except IndexError as e:
        print(url)



bro = webdriver.Chrome('./chromedriver.exe')

url = getUrlList()
get_video(url[1])
pool = Pool(len(url))
print(len(url))
pool.map(get_video,url)
pool.close()
pool.join()

time.sleep(10)
bro.close()

Initiate request:
- Request url through get method
Label positioning:
- Get the specified label element through the series of functions of find
Label interaction:
- Tag interaction via send_keys("xxx")
Execute js code:
- Execute js code for the page by executing execute_script ("")
Page forward and backward:
- back()
- forward()
Close browser:
- close()
Save screenshot of web page:
- save_screenshoot("./filepath")

iframe processing

If the located tag is nested in the sub page of iframe, bro.find series functions cannot be used to locate it directly. The following steps are required:
- ```
bro.switch_to.frame("ifrmaeResult") #Toggle frame box
bro.find_element_by_id("1")	
```

Action chain

When action processing is required in the browser, the action chain of webdriver is used for processing

Code example:

def drop_test():
    bro = webdriver.Chrome("chromedriver.exe")
    bro.get("https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable")


    bro.switch_to.frame("iframeResult")
    div = bro.find_element_by_id("draggable")

    #Construct action chain
    action = ActionChains(bro) #Construct action chain instance
    action.click_and_hold(div) #Click and hold

    for i in range(5):
        #move_by_offset():
            #xoffset,yoffset: pixels moving in two directions
        #perform():
            #Execute now
        action.move_by_offset(xoffset=18,yoffset=0).perform()
        time.sleep(0.3)

    #Release action chain
    action.release()
    time.sleep(5)
    bro.close()
    print(div)

Headless browser

Make the browser behavior execute as no visual interface

Add code:

from selenium.webdriver.chrome.options import  Options

chrome_option = Options()
chrome_option.add_argument("--headless")
chrome_option.add_argument("--disable-gpu")
bro = webdriver.Chrome("./chromedriver.exe",chrome_options=chrome_option) #Add the attribute of chrome_options to the instantiation of driver

selenium shielding avoidance

For some websites that refuse selenium's request, which makes selenium unable to obtain the server connection, you need to add corresponding codes to avoid it

Add code:

# Previous versions of chrome 79
def evade():
    option = ChromeOptions()
    option.add_experimental_option("excludeSwitches",["enable-automation"])
    bro = webdriver.Chrome("./chromedriver.exe",options=option)

  #Versions after chrome 79
  from selenuim import webdriver
  driver = webdriver.Chrome()
  driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
    "source": """
      Object.defineProperty(navigator, 'webdriver', {
        get: () => undefined
      })
    """
  })

Scripy framework

Initial scene

What is a framework:
- A project template that integrates many functions and has strong universality
What is a scene:
- Crawler's encapsulated framework
- Function:
  1. High performance persistent storage
  2. Asynchronous data download
  3. High performance data parsing
  4. Distributed
Scratch installation:
- pip install wheel
- stay https://www.lfd.uci.edu/ ~Find the twisted version of the corresponding version of python in gohlke / python LIBS / #twisted, put it in the specified directory, and execute
  
  pip install <Twisted-20.3.0-cp36-cp36m-win_amd64.whl>

pip install pywin32

pip install scrapy

Creation of a scene project:
1. Enter the python environment containing the script
2. Execute scratch startprojecr xxxpro
3. Created successfully
Create the source file given by the crawler in the subdirectory of the scratch

Scratch genspider < initial URL >
Execution of works:

scrapy crawl

Explanation of relevant initial parameters:

#The name of the crawler file, which is the unique identifier of the crawler source file
    name = 'test'

    #The crawler file allows the requested url. If the requested url is not in the list, the request is not allowed (this parameter is not normally used)
    allowed_domains = ['www.xxx.com'] #In general, the parameter list needs to be annotated

    #The starting url of the crawler file, that is, the url that the crawler automatically accesses
    start_urls = ['http://www.xxx.com/']

Modify the execution parameter of robots.txt to false

ROBOTSTXT_OBEY = False #It needs to be modified to false, otherwise it will be rejected by the website

Hide the log contents in the returned response:

scrapy crawl --nolog

Defect: if the response is wrong, there is no prompt

For the defect of -- log, the following improvements are made:

Set: log in the setting file_ LEVEL = “ERROR”
The corresponding url obtained through the request is stored in the response in parse, parsed through response.xpath, and the parsed data is extracted through extract
```
    def parse(self, response):
        div_list = response.xpath("//div[@class=\"content\"]/span/text()").extract()
        print(''.join(div_list))
```

Persistent storage of scratch data

Persistent storage of scratch:
- Terminal based storage:
  
  scrapy crawl -o
  
  be careful:
```
1. You can only parse Functional**Return value**Store to**Local files (not directly stored in the database)**in
2. Can only be stored as the specified file type:['json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle']
```
- Pipeline based storage:
  - Coding process:
    1. Data analysis
    2. Define relevant attributes in the item class for data encapsulation
    3. Encapsulate the parsed data into item
    4. Submit an object of type item to the pipeline for persistent storage
    5. Save the data in the process_item of the pipeline class
    6. Open pipeline in configuration file

Save instance locally

  ```python
  #item.py
  class QiuabaiproItem(scrapy.Item):
      # define the fields for your item here like:
      # name = scrapy.Field()
      content = scrapy.Field() #Data encapsulation
  
  ```
  
  ```python
  #pipelines.py
  class QiuabaiproPipeline(object):
      fp = None
      def open_spider(self,spider):
          print("start")
          self.fp = open("./qiubi.txt",'w',encoding='utf-8')
  
      def process_item(self, item, spider):
          content = item["content"]
          self.fp.write(content)
          return item
  
      def close_spider(self,spider):
          print("finsih")
          self.fp.close()
  
  ```
  
  ```python
  #Open the pipe
  #settings.py
  ITEM_PIPELINES = {
     'qiuabaiPro.pipelines.QiuabaiproPipeline': 300, #The following value is the priority. The smaller the value, the higher the priority
  }
  
  ```
  
  ```python
  #parse.py
  #Submit the pipeline through the yield keyword
  def parse(self, response):
          div_list = response.xpath("//div[@class=\"content\"]/span/text()").extract()
          yield div_list[0]
  
          return div_list
  ```

Database save instance

#pipelines.py

class MysqlPipeline(object):
    conn = None
    cursor = None
    def open_spider(self,spider):
        self.conn = pymysql.Connect(host='localhost',port=3307,user="root",passwd="ych3362632",db="test",charset="utf8") #Here can only be utf8, not utf-8
    def process_item(self,item,spider):
        self.cursor =  self.conn.cursor()
        try:
            print(len(item["name"]))
            self.cursor.execute("insert into spider (`name`) values (\"%s\")" % item["name"])
            self.conn.commit()

        except Exception as e:
            print(e)
            self.conn.rollback()

        return item

    def close_spider(self,spider):
        self.conn.close()
        self.cursor.close()

#settings.py
ITEM_PIPELINES = {
   'qiuabaiPro.pipelines.QiuabaiproPipeline': 300, 
   'qiuabaiPro.pipelines.MysqlPipeline': 301,  #Add the newly created pipe class to the setting file. In addition, if the priority of the pipe class is low, the function of process_item in the high priority pipe class needs to return item to the low priority
}

Storage summary

There are two ways to implement persistent storage:
- Command line form (parse return is required, the storage type is fixed, and cannot exist in the database)
- Pipe form: all advantages except troublesome configuration
Interview question: how to store one copy of the crawled data locally and one copy in the database:
- Create two pipe class files and set the created class in the configuration file
- If multiple pipeline class files are stored synchronously, the high priority process_item is required to return the item, so that the low priority pipeline class can obtain the item data

Total station data request

The initial URL is generally the URL of the website home page to be crawled, and the url_list is set through the relationship between index or website page number

Recursively obtain the complete content of the web page through the sweep.request method

class BeautiSpider(scrapy.Spider):
    name = 'beauti'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.duodia.com/daxuexiaohua/']
    url = "https://www.duodia.com/daxuexiaohua/list_%d.html"
    pageNum = 1
    def parse(self, response):
        div_list = response.xpath("//*[@id=\"main\"]/div[2]/article")
        for div in div_list:
            name = div.xpath(".//a/@title").extract()
            print("".join(name))

        if self.pageNum <= 5:
            new_url = self.url % self.pageNum
            self.pageNum += 1
            yield scrapy.Request(url=new_url,callback=self.parse) #Recursive call, and callback is specially used for data parsing

Five core components

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-wzumztvq-16342192757) (C: \ users \ y_ch \ desktop \ spider_test \ data \ md_data \ 11. Webp)]

Engine (Scrapy):
- It is used to process the data flow of the whole system and trigger things (core framework)
Scheduler:
- It is used to receive the request sent by the engine, press it into the queue, and return it when the engine requests again. It can be imagined as a priority queue of URLs (URL or time link of the web page). It is up to him to decide what the next URL to grab is, and remove the duplicate URL at the same time
Downloader:
- It is used to download the content of the web page and return the web page content to the spider (the scripy Downloader is based on the twisted whole efficient asynchronous model)
Project pipeline:
- It is responsible for processing the entities extracted from the web page by the crawler, mainly the persistent entities, verifying the effectiveness of the entities and clarifying the unnecessary information. When the page is parsed by the crawler, it is sent to the project pipeline, and the data is processed in several specific sequences
Spider:
- Crawlers are mainly used to extract the information they need from a specific web page, the so-called entity (item). Users can also extract links from it and let scripy continue to grab the next page

Request parameters

When crawling the data of the whole station, the crawling of the detail page requires the request parameters, that is, the item object is passed into different functions

Code implementation:

 # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.zhipin.com/job_detail/?query=python']
    home_url = "https://www.zhipin.com/"


    def detail_parse(self,response):
        item = response.meta["item"]
        content = response.xpath("//*[@id=\"main\"]/div[3]/div/div[2]/div[2]/div[1]/div//text()").extract()
        item["content"] = content
        yield item


    def parse(self, response):
        print(response)
        li_list = response.xpath("//*[@id=\"main\"]/div/div[3]//li")
        for li in li_list:
            name_div = li.xpath("//span[@class=\"job-title\"]")
            title = name_div.xpath("/span[@class=\"job-name\"]/a/@title").extract()
            name = name_div.xpath("/span[@class=\"job-area-wrapper\"]/span/text()").extract()
            li_name = title + " " + name


            detail = li.xpath("//div[@class=\"primary-wrapper\"]/div/@href").extract()
            new_url = "https://www.zhipin.com/" + detail
            item =  BoproItem()
            item["name"] = li_name

            yield scrapy.Request(url=new_url,callback=self.etail_parse,meta={"item":item}) #item is passed into other functions for use

Use of image pipeline

Automatically obtain and download the picture address by using the ImagesPipelines class in scene.pipelines.images
You need to override the functions in ImagesPipelines in the scene.pipelines.images

Set the storage path of the picture in setting

#pipelines.py
from scrapy.pipelines.images import ImagesPipeline
import scrapy
class ImageLine(ImagesPipeline):

    #Request according to picture address
    def get_media_requests(self, item, info):
        yield scrapy.Request(item["src"][0]) #Remember the time of "scene. Request"!!!!!!!!!!

    #Specify picture storage path
    def file_path(self, request, response=None, info=None, *, item=None):
        return item["name"][0] + ".jpg"

    def item_completed(self, results, item, info):
        return item #Return to the next item to be executed

#settings.py
ITEM_PIPELINES = {
   'imagePro.pipelines.ImageLine': 300,
}
IMAGES_STORE = "./data/pic/beauty"

Middleware usage:

Intercept request:

UA camouflage: process_request

 def process_request(self, request, spider): #Conduct UA camouflage
        request.headers["User-Agent"] = xxx
        return None

Proxy IP: process_exception

 def process_exception(self, request, exception, spider): #IP replacement
       request.meta["proxy"] = xxx

Intercept response:

Tamper with the corresponding data, and the response object is process_response

 def process_response(self, request, response, spider):
        #Select the specified object for modification
        #request via url
        #Specify the response through response

        if request.url in spider.href_list: #Get dynamically loaded pages
            bro = spider.bro
            bro.get(request.url)
            page_text = bro.page_source
            new_response = HtmlResponse(url=request.url,body=page_text,encoding="utf-8",request=request)
            return new_response
        else:
            return response

Posted by Vince889 on Thu, 14 Oct 2021 10:55:14 -0700

Programmer Group