Today when scraping wallpaper (url: http://pic.netbian.com/4kmein... ) Some questions were rambled and recorded for future generations to discuss and draw lessons from history. * *
Because the website is rendered dynamically, scrapy is chosen to dock selenium (scrapy grabs web pages in a way similar to requests libraries, which directly simulates HTTP requests, while Scrapy cannot grab web pages rendered dynamically by JavaScript).
So in Downloader Middleware, you need to get a Request and return a Response. The problem is Response. By looking at official documents, you can find class scrapy. http. Response (url [, status = 200, headers = None, body = b', flags = None, request = None]), and then import Response from scrapy.http import Response.
Enter scrapy crawl girl
The following error is obtained:
*results=response.xpath('//[@id="main"]/div[3]/ul/lia/img')
raise NotSupported("Response content isn't text")
scrapy.exceptions.NotSupported: Response content isn't text**
Check the relevant code:
# middlewares.py from scrapy import signals from scrapy.http import Response from scrapy.exceptions import IgnoreRequest import selenium from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC class Pic4KgirlDownloaderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called try: self.browser=selenium.webdriver.Chrome() self.wait=WebDriverWait(self.browser,10) self.browser.get(request.url) self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#main > div.page > a:nth-child(10)'))) return Response(url=request.url,status=200,request=request,body=self.browser.page_source.encode('utf-8')) #except: #raise IgnoreRequest() finally: self.browser.close()
The inference problem lies in:
return Response(url=request.url,status=200,request=request,body=self.browser.page_source.encode('utf-8'))
Looking at the definition of Response class, we find that:
@property def text(self): """For subclasses of TextResponse, this will return the body as text (unicode object in Python 2 and str in Python 3) """ raise AttributeError("Response content isn't text") def css(self, *a, **kw): """Shortcut method implemented only by responses whose content is text (subclasses of TextResponse). """ raise NotSupported("Response content isn't text") def xpath(self, *a, **kw): """Shortcut method implemented only by responses whose content is text (subclasses of TextResponse). """ raise NotSupported("Response content isn't text")
Explain that the Response class cannot be used directly, and that it needs to be inherited and rewritten before it can be used.
Response subclass:
**TextResponse object** class scrapy.http.TextResponse(url[, encoding[, ...]]) **HtmlResponse object** class scrapy.http.HtmlResponse(url[, ...]) **XmlResponse object** class scrapy.http.XmlResponse(url [,... ] )
Examine the definition of TextResponse
from scrapy.http import TextResponse
Import TextResponse
find
class TextResponse(Response): _DEFAULT_ENCODING = 'ascii' def __init__(self, *args, **kwargs): self._encoding = kwargs.pop('encoding', None) self._cached_benc = None self._cached_ubody = None self._cached_selector = None super(TextResponse, self).__init__(*args, **kwargs)
Where the xpath method has been rewritten
@property def selector(self): from scrapy.selector import Selector if self._cached_selector is None: self._cached_selector = Selector(self) return self._cached_selector def xpath(self, query, **kwargs): return self.selector.xpath(query, **kwargs) def css(self, query): return self.selector.css(query)
So if the user wants to call the Response class, he must choose to call its subclass and override some methods.
Scrapy Crawler Introduction Tutorial 11 Request and Response (Request and Response)
scrapy document: https://doc.scrapy.org/en/lat...
Chinese Translation Documents: https://blog.csdn.net/Inke88/...