Reponse in Scrapy and its subclasses (TextResponse, HtmlResponse, XmlResponse)

Keywords: Python Selenium encoding Javascript

Today when scraping wallpaper (url: http://pic.netbian.com/4kmein... ) Some questions were rambled and recorded for future generations to discuss and draw lessons from history. * *

Because the website is rendered dynamically, scrapy is chosen to dock selenium (scrapy grabs web pages in a way similar to requests libraries, which directly simulates HTTP requests, while Scrapy cannot grab web pages rendered dynamically by JavaScript).

So in Downloader Middleware, you need to get a Request and return a Response. The problem is Response. By looking at official documents, you can find class scrapy. http. Response (url [, status = 200, headers = None, body = b', flags = None, request = None]), and then import Response from scrapy.http import Response.

Enter scrapy crawl girl
The following error is obtained:
*results=response.xpath('//[@id="main"]/div[3]/ul/lia/img')
raise NotSupported("Response content isn't text")
scrapy.exceptions.NotSupported: Response content isn't text**
Check the relevant code:

# middlewares.py
from scrapy import signals
from scrapy.http import Response
from scrapy.exceptions import IgnoreRequest
import selenium
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class Pic4KgirlDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        try:
            self.browser=selenium.webdriver.Chrome()
            self.wait=WebDriverWait(self.browser,10)
            
            self.browser.get(request.url)
            self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#main > div.page > a:nth-child(10)')))
            return Response(url=request.url,status=200,request=request,body=self.browser.page_source.encode('utf-8'))
        #except:
            #raise IgnoreRequest()
        finally:
            self.browser.close()

The inference problem lies in:
return Response(url=request.url,status=200,request=request,body=self.browser.page_source.encode('utf-8'))
Looking at the definition of Response class, we find that:

@property
    def text(self):
        """For subclasses of TextResponse, this will return the body
        as text (unicode object in Python 2 and str in Python 3)
        """
        raise AttributeError("Response content isn't text")

    def css(self, *a, **kw):
        """Shortcut method implemented only by responses whose content
        is text (subclasses of TextResponse).
        """
        raise NotSupported("Response content isn't text")

    def xpath(self, *a, **kw):
        """Shortcut method implemented only by responses whose content
        is text (subclasses of TextResponse).
        """
        raise NotSupported("Response content isn't text")

Explain that the Response class cannot be used directly, and that it needs to be inherited and rewritten before it can be used.

Response subclass:

**TextResponse object**
class scrapy.http.TextResponse(url[, encoding[, ...]])
**HtmlResponse object**
class scrapy.http.HtmlResponse(url[, ...])
**XmlResponse object**
class scrapy.http.XmlResponse(url [,... ] )

Examine the definition of TextResponse
from scrapy.http import TextResponse
Import TextResponse
find

class TextResponse(Response):

    _DEFAULT_ENCODING = 'ascii'

    def __init__(self, *args, **kwargs):
        self._encoding = kwargs.pop('encoding', None)
        self._cached_benc = None
        self._cached_ubody = None
        self._cached_selector = None
        super(TextResponse, self).__init__(*args, **kwargs)

Where the xpath method has been rewritten

@property
    def selector(self):
        from scrapy.selector import Selector
        if self._cached_selector is None:
            self._cached_selector = Selector(self)
        return self._cached_selector

    def xpath(self, query, **kwargs):
        return self.selector.xpath(query, **kwargs)

    def css(self, query):
        return self.selector.css(query)

So if the user wants to call the Response class, he must choose to call its subclass and override some methods.

Scrapy Crawler Introduction Tutorial 11 Request and Response (Request and Response)

scrapy document: https://doc.scrapy.org/en/lat...
Chinese Translation Documents: https://blog.csdn.net/Inke88/...

Posted by Cep on Sun, 10 Mar 2019 08:03:25 -0700