Scrapy Crawler and Case Analysis

Keywords: Java JSON curl ascii encoding

Due to the rapid development of the Internet, all the information is in a state of massive accumulation. We need to obtain a large amount of data from the outside world and filter the useless data in a large amount of data.We need to specify a crawl for our useful data, so there are now crawling techniques that allow us to quickly get the data we need.But in this crawling process, the information owner will crawl the crawler back, so we need to break these difficulties one by one.

Just a while ago, I did some crawl-related work. Here are some related experiences.

Code address for this case https://github.com/yangtao9502/ytaoCrawl

Here I use the Scrapy framework for crawling, development environment version number:

Scrapy       : 1.5.1
lxml         : 4.2.5.0
libxml2      : 2.9.8
cssselect    : 1.0.3
parsel       : 1.5.1
w3lib        : 1.20.0
Twisted      : 18.9.0
Python       : 3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)]
pyOpenSSL    : 18.0.0 (OpenSSL 1.1.1a  20 Nov 2018)
cryptography : 2.4.2
Platform     : Windows-10-10.0.15063-SP0

Local development environments recommend using Anaconda to install related environments. Otherwise, there may be conflicts of dependency packages. I'm sure you've experienced them and lost the interest of crawlers when you configure related environments.
This article extracts page data mainly using Xpath, so before proceeding with the case in this article, understand the basic use of Xpath.

Create Scrapy Project

It's easy for scrapy to create a project, just one command. Next, let's create the ytaoCrawl project:

scrapy startproject ytaoCrawl

Note that the project name must start with a letter and contain only letters, numbers, and underscores.
After successful creation, the interface displays:

The files that initialize the project are:

Purpose of each of these documents:

  • The spider directory is used to store crawler files.
  • The items.py file is the most object in which the crawler data is stored.
  • The middlewares.py file is a middleware processor in which transformations such as request and response are implemented.
  • The pipelines.py file is a data pipeline for data capture and transport.
  • The settings.py file is a configuration file in which some of the configurations in the crawler can be set.
  • The scrapy.cfg file is the configuration file deployed for the crawler.

Understand a few default generated files and look at the scrapy schematic diagram below for a better understanding.

This completes one of our scrapy crawler projects.

Create Spider

Let's start by creating a python file, ytaoSpider, which must inherit the scrapy.Spider class.Next, we will take the crawling of 58 rental information in Beijing as an example to analyze.

#!/usr/bin/python3
# -*- coding: utf-8 -*-
# 
# @Author  : YangTao
# @blog    : https://ytao.top
# 
import scrapy

class YtaoSpider(scrapy.Spider):
    # Define crawl name
    name = "crawldemo"
    # Allow crawling of domain names without links in start_urls
    allowed_domains = ["58.com"]
    # Start Crawling Links
    start_urls = [
        "https://bj.58.com/chuzu/?PGTID=0d100000-0038-e441-0c8a-adeb346199d8&ClickID=2"
    ]

    def download(self, response, fName):
        with open(fName + ".html", 'wb') as f:
            f.write(response.body)

    # response is the object that returns the capture
    def parse(self, response):
        # Download the Beijing Rental page to your local location for easy analysis
        self.download(response, "Rental in Beijing")

Start the crawl by executing the command, specifying the name of the crawl:

scrapy crawl crawldemo

When we have more than one crawl, we can get all the crawl names by scrapy list.

The mian function can also be used to start the development process in the editor:

if __name__ == '__main__':
    name = YtaoSpider.name
    cmd = 'scrapy crawl {0} '.format(name)
    cmdline.execute(cmd.split())

The page that generated our crawl will then be downloaded in the directory we started.

Page Flip Crawl

We only crawled to the first page above, but when we actually crawled the data, we were bound to involve paging, so we observed that the last page of the website would be displayed (58 would only show the first 70 pages of data), as shown in the figure.

The partial html code for paging is observed in the following figure.

Next, get the page number of the last page by Xpath and regular matching.

def pageNum(self, response):
    # Get the html code block for paging
    page_ele = response.xpath("//li[@id='pager_wrap']/div[@class='pager']")
    # Get text with page number digits regularly
    num_eles = re.findall(r">\d+<", page_ele.extract()[0].strip())
    # Find the largest one
    count = 0
    for num_ele in num_eles:
        num_ele = str(num_ele).replace(">", "").replace("<", "")
        num = int(num_ele)
        if num > count:
            count = num
    return count

By analyzing the rental link, we can see that the links of different page numbers are https://bj.58.com/chuzu/pn+num where num stands for page number. When we do different page number grabbing, we just need to change the page number, and the parse function can be changed to:

# Crawler link, no page number
target_url = "https://bj.58.com/chuzu/pn"

def parse(self, response):
        print("url: ", response.url)
        num = self.pageNum(response)
        # The start page is the first page, so filter out the first page as you iterate through it
        p = 1
        while p < num:
            p += 1
            try:
                # Stitch the next link
                url = self.target_url + str(p)
                # Make a grab of the next page
                yield Request(url, callback=self.parse)
            except BaseException as e:
                logging.error(e)
                print("Crawl data exception:", url)

After execution, the printed information is as follows:

Because crawlers are asynchronous, we do not print out ordered data.
The above is a traversal grab by getting the page number of the last page, but some websites do not have the page number of the last page, so we can use the next page to determine if the current page is the last page or if not, get the link carried by the next page to crawl.

get data

Here we get the title, area, location, neighborhood, and price information, and we need to create these fields in the item first, with less gossip and code.

# Avoid index out of bounds when xpath parses data
def xpath_extract(self, selector, index):
    if len(selector.extract()) > index:
        return selector.extract()[index].strip()
    return ""

def setData(self, response):
    items = []
    houses = response.xpath("//ul[@class='house-list']/li[@class='house-cell']")
    for house in houses:
        item = YtaocrawlItem()
        # Title
        item["title"] = self.xpath_extract(house.xpath("div[@class='des']/h2/a/text()"), 0)
        # The measure of area
        item["room"] = self.xpath_extract(house.xpath("div[@class='des']/p[@class='room']/text()"), 0)
        # position
        item["position"] = self.xpath_extract(house.xpath("div[@class='des']/p[@class='infor']/a/text()"), 0)
        # Residential quarters
        item["quarters"] = self.xpath_extract(house.xpath("div[@class='des']/p[@class='infor']/a/text()"), 1)
        money = self.xpath_extract(house.xpath("div[@class='list-li-right']/div[@class='money']/b/text()"), 0)
        unit = self.xpath_extract(house.xpath("div[@class='list-li-right']/div[@class='money']/text()"), 1)
        # Price
        item["price"] = money+unit
        items.append(item)
    return items

def parse(self, response):
    items = self.setData(response)
    for item in items:
        yield item
    
    # Next to the page flipping operation above...

So far, we get the data we want and print the item in parse to see the results.

Data warehousing

We've grabbed the data from the page, and then we're going to put the data into the database. Here we take MySQL storage as an example. In case of large data volume, we recommend using other storage products.
First, we set the ITEM_PIPELINES property in the settings.py configuration file, specifying the Pipeline processing class.

ITEM_PIPELINES = {
    # The smaller the value, the higher the priority call
   'ytaoCrawl.pipelines.YtaocrawlPipeline': 300,
}

Data persistence is handled in the YtaocrawlPipeline class, where the MySQL encapsulation tool class mysqlUtils code can be viewed in github.
Data is transferred to YtaocrawlPipeline#process_item by using yield eldin YtaoSpider#parse for processing.

class YtaocrawlPipeline(object):

    def process_item(self, item, spider):
        table = "crawl"
        item["id"] = str(uuid.uuid1())
        # If a link to the current crawl information exists in the library, delete the old one and save the new one
        list = select(str.format("select * from {0} WHERE url = '{1}'", table, item["url"]))
        if len(list) > 0:
            for o in list:
                delete_by_id(o[0], table)
        insert(item, table)
        return item

In the database, you can see that the data was successfully captured and put into the database.

Anti-crawl mechanism response

Since there is a need for data crawlers, there must be anti-picking measures to analyze current crawl cases.

Font Encryption

From the diagram of the database data above, you can see that there is chaotic code in the data. By looking at the law of data chaotic code, you can locate the number to be encrypted.

At the same time, you can see the \xa0 character from the printed data, which (representing a blank character) is converted to ASCII encoding in the range of ASCII characters 0x20~0x7e.

Because you know it's encrypted, when you look at font-family fonts on the downloaded page, you see the code shown in the following figure:

It is doubtful to see this fangchan-secret font, which is dynamically generated in js and stored in base64, decoding the following fonts.

if __name__ == '__main__':
    secret = "AAEAAAALAIAAAwAwR1NVQiCLJXoAAAE4AAAAVE9TLzL4XQjtAAABjAAAAFZjbWFwq8p/XQAAAhAAAAIuZ2x5ZuWIN0cAAARYAAADdGhlYWQXlvp9AAAA4AAAADZoaGVhCtADIwAAALwAAAAkaG10eC7qAAAAAAHkAAAALGxvY2ED7gSyAAAEQAAAABhtYXhwARgANgAAARgAAAAgbmFtZTd6VP8AAAfMAAACanBvc3QFRAYqAAAKOAAAAEUAAQAABmb+ZgAABLEAAAAABGgAAQAAAAAAAAAAAAAAAAAAAAsAAQAAAAEAAOOjpKBfDzz1AAsIAAAAAADaB9e2AAAAANoH17YAAP/mBGgGLgAAAAgAAgAAAAAAAAABAAAACwAqAAMAAAAAAAIAAAAKAAoAAAD/AAAAAAAAAAEAAAAKADAAPgACREZMVAAObGF0bgAaAAQAAAAAAAAAAQAAAAQAAAAAAAAAAQAAAAFsaWdhAAgAAAABAAAAAQAEAAQAAAABAAgAAQAGAAAAAQAAAAEERAGQAAUAAAUTBZkAAAEeBRMFmQAAA9cAZAIQAAACAAUDAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFBmRWQAQJR2n6UGZv5mALgGZgGaAAAAAQAAAAAAAAAAAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAAAAAABQAAAAMAAAAsAAAABAAAAaYAAQAAAAAAoAADAAEAAAAsAAMACgAAAaYABAB0AAAAFAAQAAMABJR2lY+ZPJpLnjqeo59kn5Kfpf//AACUdpWPmTyaS546nqOfZJ+Sn6T//wAAAAAAAAAAAAAAAAAAAAAAAAABABQAFAAUABQAFAAUABQAFAAUAAAACAAGAAQAAgAKAAMACQABAAcABQAAAQYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAAAAAAAiAAAAAAAAAAKAACUdgAAlHYAAAAIAACVjwAAlY8AAAAGAACZPAAAmTwAAAAEAACaSwAAmksAAAACAACeOgAAnjoAAAAKAACeowAAnqMAAAADAACfZAAAn2QAAAAJAACfkgAAn5IAAAABAACfpAAAn6QAAAAHAACfpQAAn6UAAAAFAAAAAAAAACgAPgBmAJoAvgDoASQBOAF+AboAAgAA/+YEWQYnAAoAEgAAExAAISAREAAjIgATECEgERAhIFsBEAECAez+6/rs/v3IATkBNP7S/sEC6AGaAaX85v54/mEBigGB/ZcCcwKJAAABAAAAAAQ1Bi4ACQAAKQE1IREFNSURIQQ1/IgBW/6cAicBWqkEmGe0oPp7AAEAAAAABCYGJwAXAAApATUBPgE1NCYjIgc1NjMyFhUUAgcBFSEEGPxSAcK6fpSMz7y389Hym9j+nwLGqgHButl0hI2wx43iv5D+69b+pwQAAQAA/+YEGQYnACEAABMWMzI2NRAhIzUzIBE0ISIHNTYzMhYVEAUVHgEVFAAjIiePn8igu/5bgXsBdf7jo5CYy8bw/sqow/7T+tyHAQN7nYQBJqIBFP9uuVjPpf7QVwQSyZbR/wBSAAACAAAAAARoBg0ACgASAAABIxEjESE1ATMRMyERNDcjBgcBBGjGvv0uAq3jxv58BAQOLf4zAZL+bgGSfwP8/CACiUVaJlH9TwABAAD/5gQhBg0AGAAANxYzMjYQJiMiBxEhFSERNjMyBBUUACEiJ7GcqaDEx71bmgL6/bxXLPUBEv7a/v3Zbu5mswEppA4DE63+SgX42uH+6kAAAAACAAD/5gRbBicAFgAiAAABJiMiAgMzNjMyEhUUACMiABEQACEyFwEUFjMyNjU0JiMiBgP6eYTJ9AIFbvHJ8P7r1+z+8wFhASClXv1Qo4eAoJeLhKQFRj7+ov7R1f762eP+3AFxAVMBmgHjLfwBmdq8lKCytAAAAAABAAAAAARNBg0ABgAACQEjASE1IQRN/aLLAkD8+gPvBcn6NwVgrQAAAwAA/+YESgYnABUAHwApAAABJDU0JDMyFhUQBRUEERQEIyIkNRAlATQmIyIGFRQXNgEEFRQWMzI2NTQBtv7rAQTKufD+3wFT/un6zf7+AUwBnIJvaJLz+P78/uGoh4OkAy+B9avXyqD+/osEev7aweXitAEohwF7aHh9YcJlZ/7qdNhwkI9r4QAAAAACAAD/5gRGBicAFwAjAAA3FjMyEhEGJwYjIgA1NAAzMgAREAAhIicTFBYzMjY1NCYjIga5gJTQ5QICZvHD/wABGN/nAQT+sP7Xo3FxoI16pqWHfaTSSgFIAS4CAsIBDNbkASX+lf6l/lP+MjUEHJy3p3en274AAAAAABAAxgABAAAAAAABAA8AAAABAAAAAAACAAcADwABAAAAAAADAA8AFgABAAAAAAAEAA8AJQABAAAAAAAFAAsANAABAAAAAAAGAA8APwABAAAAAAAKACsATgABAAAAAAALABMAeQADAAEECQABAB4AjAADAAEECQACAA4AqgADAAEECQADAB4AuAADAAEECQAEAB4A1gADAAEECQAFABYA9AADAAEECQAGAB4BCgADAAEECQAKAFYBKAADAAEECQALACYBfmZhbmdjaGFuLXNlY3JldFJlZ3VsYXJmYW5nY2hhbi1zZWNyZXRmYW5nY2hhbi1zZWNyZXRWZXJzaW9uIDEuMGZhbmdjaGFuLXNlY3JldEdlbmVyYXRlZCBieSBzdmcydHRmIGZyb20gRm9udGVsbG8gcHJvamVjdC5odHRwOi8vZm9udGVsbG8uY29tAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AFIAZQBnAHUAbABhAHIAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAVgBlAHIAcwBpAG8AbgAgADEALgAwAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AEcAZQBuAGUAcgBhAHQAZQBkACAAYgB5ACAAcwB2AGcAMgB0AHQAZgAgAGYAcgBvAG0AIABGAG8AbgB0AGUAbABsAG8AIABwAHIAbwBqAGUAYwB0AC4AaAB0AHQAcAA6AC8ALwBmAG8AbgB0AGUAbABsAG8ALgBjAG8AbQAAAAIAAAAAAAAAFAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACwECAQMBBAEFAQYBBwEIAQkBCgELAQwAAAAAAAAAAAAAAAAAAAAA"
    # Convert font file encoding to UTF-8 encoded byte objects
    bytes = secret.encode(encoding='UTF-8')
    # base64-bit decoding
    decodebytes = base64.decodebytes(bytes)
    # Initialize BytesIO with decodebytes, then use TTFont to parse the font library
    font = TTFont(BytesIO(decodebytes))
    # Mapping relationship of fonts
    font_map = font['cmap'].tables[0].ttFont.tables['cmap'].tables[0].cmap

    print(font_map)

By parsing the font using TTFont from the fontTools library, the font is mapped to the following font:

{
    38006: 'glyph00007',
    38287: 'glyph00005',
    39228: 'glyph00006',
    39499: 'glyph00003',
    40506: 'glyph00010',
    40611: 'glyph00001',
    40804: 'glyph00009',
    40850: 'glyph00004',
    40868: 'glyph00002',
    40869: 'glyph00008'
}

There are just ten maps, corresponding to the number of 0~9, but find the corresponding law, after 1~9, there is a 10, so what exactly is the law of the corresponding number here?The key corresponding to the above mapping is not a 16-digit ASCII code, but a pure number. Is it possible that it is a decimal code?
Next, we validate our hypothesis by converting hexadecimal code obtained on a page into decimal code and matching the data in the map. We find that the non-zero numeric part of the map's value is exactly 1 greater than the corresponding numeric character on the page. We know that the true value requires us to subtract 1 from the map value.
After coding

def decrypt(self, response, code):
    secret = re.findall("charset=utf-8;base64,(.*?)'\)", response.text)[0]
    code = self.secretfont(code, secret)
    return code

def secretfont(self, code, secret):
    # Convert font file encoding to UTF-8 encoded byte objects
    bytes = secret.encode(encoding='UTF-8')
    # base64-bit decoding
    decodebytes = base64.decodebytes(bytes)
    # Initialize BytesIO with decodebytes, then use TTFont to parse the font library
    font = TTFont(BytesIO(decodebytes))
    # Mapping relationship of fonts
    font_map = font['cmap'].tables[0].ttFont.tables['cmap'].tables[0].cmap
    chars = []
    for char in code:
        # Convert each character to decimal ASCII code
        decode = ord(char)
        # If there is an ASCII key in the mapping relationship, then this character has a corresponding font
        if decode in font_map:
            # Get the value of the map
            val = font_map[decode]
            # Get the numeric part by the rule, then subtract 1 to get the real value
            char = int(re.findall("\d+", val)[0]) - 1
        chars.append(char)
    return "".join(map(lambda s:str(s), chars))

Now, we will decrypt all the crawled data before viewing it:

In the image above, after decryption, the perfect solution to data chaos!

Authentication Code and Block IP

Authentication codes are generally divided into two categories, one is that at the beginning of entry, you must enter the verification code, the other is that after frequent requests, you need to verify the verification code before proceeding with the next request.
For the first, you must crack its authentication code to continue, and for the second, you can use a proxy to bypass authentication in addition to cracking it.
For anti-crawling that blocks IP, proxy can also be used to bypass it.For example, use the web crawler above, and when they recognize that I might be a crawler, they use a verification code to intercept it, as shown below:

Next, we use random User-Agent and proxy IP to bypass.
settings.USER_AGENT should be set first. Be careful not to mix User-Agent settings on PC and mobile, otherwise you will crawl the data abnormally, because the pages on different ends are different:

USER_AGENT = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.10 Safari/537.36",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    # ......
]    

Setting up a random User-Agent Middleware in a request

class RandomUserAgentMiddleware(object):
    def __init__(self, agents):
        self.agent = agents

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            agents=crawler.settings.get('USER_AGENT')
        )

    def process_request(self, request, spider):
        # Random Get Settings for a User-Agent
        request.headers.setdefault('User-Agent', random.choice(self.agent))

Setting up dynamic IP Middleware

class ProxyIPMiddleware(object):
    def __init__(self, ip=''):
        self.ip = ip

    def process_request(self, request, spider):
        # If the current address is redirected to the Authentication Code address, reapply using proxy ip
        if self.ban_url(request.url):
            # Get the redirected address
            redirect_urls = request.meta.get("redirect_urls")[0]
            # Change the current address redirected to Authentication Code to the original request address
            request._set_url(redirect_urls)
            # Set up dynamic proxies, which are generally generated online using interfaces
            request.meta["proxy"] = "http://%s" % (self.proxy_ip())

    def ban_url(self, url):
        # Authentication codes set in settings or banned page links, when encountered, the crawler will make a detour crawl
        dic = settings.BAN_URLS
        # Verify that the current request address is a verification code address
        for d in dic:
            if url.find(d) != -1:
                return True
        return False

    # Proxy dynamically generated ip:port
    def proxy_ip(self):
        # Simulate dynamic generation of proxy addresses
        ips = [
            "127.0.0.1:8888",
            "127.0.0.1:8889",
        ]
        return random.choice(ips);

    def process_response(self, request, response, spider):
        # Re-crawl if not responding successfully
        if response.status != 200:
            logging.error("Failure response: "+ str(response.status))
            return request
        return response

Finally, these middleware are opened in the settings configuration file.

DOWNLOADER_MIDDLEWARES = {
   'ytaoCrawl.middlewares.RandomUserAgentMiddleware': 500,
   'ytaoCrawl.middlewares.ProxyIPMiddleware': 501,
   'ytaoCrawl.middlewares.YtaocrawlDownloaderMiddleware': 543,
}

Setting up random User-Agent and dynamic IP bypass is now complete.

deploy

Using scrapyd to deploy crawler projects, crawlers can be remotely managed, such as startup, shutdown, log calls, and so on.
Before deploying, we need to install scrapyd using the command:

pip install scrapyd

After successful installation, you can see that the version is 1.2.1.

After deployment, we also need a client to access, where we need a scrapyd-client client client:

pip install scrapyd-client

Modify scrapy.cfg file

[settings]
default = ytaoCrawl.settings

[deploy:localytao]
url = http://localhost:6800/
project = ytaoCrawl

# deploy can be deployed in batches

Start scrapyd:

scrapyd

If it is Windows, create a scrapyd-deploy.bat file under X:\xx\Scripts

@echo off
"X:\xx\python.exe" "X:\xx\Scripts\scrapyd-deploy" %1 %2

The project is deployed to the Scrapyd service:

scrapyd-deploy localytao -p ytaoCrawl

remote boot
curl http://localhost:6800/schedule.json -d project=ytaoCrawl -d spider=ytaoSpider

After execution starts, you can view the crawler execution status and log at http://localhost:6800/

In addition to launching remote calls, Scrapyd also provides a rich API:

  • Crawler Status Query in Service curl http://localhost:6800/daemonstatus.json
  • Cancel crawler curl http://localhost:6800/cancel.json-d project=projectName-d job=jobId
  • Show item curl http://localhost:6800/listprojects.json
  • Delete project curl http://localhost:6800/delproject.json-d project=projectName
  • Show crawler curl http://localhost:6800/listspiders.json?project=projectName
  • Get all version numbers of the project curl http://localhost:6800/listversions.json?project=projectName
  • Delete the project version number curl http://localhost:6800/delversion.json-d project=projectName-d version=versionName

more details https://scrapyd.readthedocs.io/en/stable/api.html

summary

This article is limited in size, not all-round in the analysis process. Some websites are more difficult to reverse-crawl. As long as we analyze it, we can find solutions to crack it. Also, the data you see with your eyes is not necessarily the data you get. For example, some websites have dynamic html rendering, so we need to process this information well.When you go into crawler's world, you'll find it really interesting.Finally, I hope you don't face the prison crawlers, there are tens of millions of data, and you should abide by the first law and discipline.


Personal blog: https://ytao.top
My Public Number ytao

Posted by klance on Sat, 07 Dec 2019 04:35:57 -0800