The Essential Foundation and Skills of Python Reptiles

This document is based on Python 2.7 and has changed in Python 3.

Basic crawling of web pages

get method

import urllib2

url = "http://www.baidu.com"
response = urllib2.urlopen(url)
print response.read()

post method

import urllib
import urllib2

url = "http://abcde.com"
form = {'name':'abc','password':'1234'}
form_data = urllib.urlencode(form)
request = urllib2.Request(url,form_data)
response = urllib2.urlopen(request)
print response.read()

Using proxy IP

In the process of crawler development, IP is often blocked, so proxy IP is needed.

There is a ProxyHandler class in the urllib2 package, through which you can set up a proxy to access the Web page, as follows:

import urllib2

proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
response = urllib2.urlopen('http://www.baidu.com')
print response.read()

Cookies processing

Cooks are the data stored on the user's local terminal (usually encrypted) by some websites in order to identify users and track session s. python provides a cookielib module to process cookies. The main function of the cookielib module is to provide objects that can store cookies, so that it can be used in conjunction with urllib2 module. Access Internet resources.

Code snippet:

import urllib2, cookielib

cookie_support= urllib2.HTTPCookieProcessor(cookielib.CookieJar())
opener = urllib2.build_opener(cookie_support)
urllib2.install_opener(opener)
content = urllib2.urlopen('http://XXXX').read()

The key is CookieJar(), which manages HTTP cookie values, stores cookies generated by HTTP requests, and adds cookie objects to outgoing HTTP requests. The entire cookie is stored in memory, and after garbage collection of CookieJar instances, the cookie will also be lost, and all processes need not be operated separately.

Manual addition of cookie s

cookie = "PHPSESSID=91rurfqm2329bopnosfu4fvmu7; kmsign=55d2c12c9b1e3; KMUID=b6Ejc1XSwPq9o756AxnBAg="
request.add_header("Cookie", cookie)

Camouflage as a browser

Some websites are disgusted with the crawler's visits, so they refuse all requests to the crawler. So direct access to websites with urllib2 often results in HTTP Error 403: Forbidden

Pay special attention to some headers. Server will check these headers.

1.User-Agent Some Server s or Proxy check this value to determine whether it is a browser-initiated Request

2. When Content-Type uses the REST interface, the Server checks the value to determine how the content in the HTTP Body should be parsed.

This can be achieved by modifying the header in the http package. The code snippet is as follows:

import urllib2

headers = {
    'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
}
request = urllib2.Request(
    url = 'http://my.oschina.net/jhao104/blog?catalog=3463517',
    headers = headers
)
print urllib2.urlopen(request).read()

Form processing

It is necessary to fill in the form for login. How do you fill in the form? Firstly, the tool is used to intercept the contents of the form.

For example, I usually use the firefox+httpfox plug-in to see what packages I sent.

Let me give you an example. Take verycd as an example. First, find your own POST request and POST form items.

You can see that verycd needs to fill in username,password,continueURI,fk,login_submit, where FK is randomly generated (actually not very random, it looks like epoch time is generated by simple coding). It needs to be retrieved from the web page, that is to say, first visit the web page, intercept with regular expressions and other tools. Get the FK item in the returned data. ContinueURI, as its name implies, can be written casually, and login_submit is fixed, as can be seen from the source code. And username, password, that's obvious.

Okay, with the data to fill in, we're going to generate postdata

import urllib
postdata=urllib.urlencode({
    'username':'XXXXX',
    'password':'XXXXX',
    'continueURI':'http://www.verycd.com/',
    'fk':fk,
    'login_submit':'Sign in'
})

Then generate the http request and send the request:

req = urllib2.Request(
    url = 'http://secure.verycd.com/signin/*/http://www.verycd.com/',
    data = postdata
)
result = urllib2.urlopen(req).read()

Page parsing

Regular expressions are of course the most powerful way to parse pages, which are different for different users of different websites, so it is not necessary to explain too much, with two better websites:

Introduction to Regular Expressions: http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html

Regular expression online testing: http://tool.oschina.net/regex/

Next is the parsing library. There are two commonly used websites, lxml and Beautiful Soup.

lxml: http://my.oschina.net/jhao104/blog/639448

BeautifulSoup: http://cuiqingcai.com/1319.html

For these two libraries, my evaluation is that they are all HTML/XML processing libraries. Beautiful soup is implemented by python, which is inefficient, but has practical functions, such as obtaining the source code of an HTML node through result search; lxml is implemented in C language, which is efficient and supports Xpath.

6. Processing of Verification Code

For some simple verification codes, simple identification can be carried out. I have only done some simple verification code recognition. However, some anti-human authentication codes, such as 12306, can be manually coded through the code platform, of course, this is a fee.

7. gzip compression

Have you ever encountered some web pages, no matter how transcoding is a mess of code? Haha, that means you don't know that many web services have the ability to send compressed data, which can reduce the amount of data transmitted over the network line by more than 60%. This is especially true for XML web services, because the compression rate of XML data can be very high.

But the server will not send you compressed data unless you tell the server that you can process compressed data.

So you need to modify the code like this:

import urllib2, httplib
request = urllib2.Request('http://xxxx.com')
request.add_header('Accept-encoding', 'gzip')        1
opener = urllib2.build_opener()
f = opener.open(request)

This is the key: create the Request object, add an Accept-encoding header to tell the server that you can accept gzip compressed data

Then it decompresses the data:

import StringIO
import gzip

compresseddata = f.read() 
compressedstream = StringIO.StringIO(compresseddata)
gzipper = gzip.GzipFile(fileobj=compressedstream) 
print gzipper.read()

Multithread concurrent crawling

If a single thread is too slow, it needs multi-threading. Here we give a simple thread pool template. This program simply prints 1-10, but it can be seen that it is concurrent.

Although python's multi-threading is a chicken's rib, it can improve the efficiency to some extent for the crawler, which is a frequent type of network.

from threading import Thread
from Queue import Queue
from time import sleep
#q is the task queue
#NUM is the total number of concurrent threads
#How many tasks does JOBS have?
q = Queue()
NUM = 2
JOBS = 10
#Specific processing functions that handle individual tasks
def do_somthing_using(arguments):
    print arguments
#This is the work process, which is responsible for continuously retrieving data from the queue and processing it.
def working():
    while True:
        arguments = q.get()
        do_somthing_using(arguments)
        sleep(1)
        q.task_done()
#fork NUM threads waiting queue
for i in range(NUM):
    t = Thread(target=working)
    t.setDaemon(True)
    t.start()
#Queue JOBS
for i in range(JOBS):
    q.put(i)
#Waiting for all JOBS to complete
q.join()

Learning about multithreading modules:

Anti-theft Chain

Some websites have so-called anti-theft chain settings. In fact, it's very simple to check whether the referer site in the header you sent the request is his own. So we just need to change the referer of the headers to the website. Take cnbeta as an example:

headers = {
    'Referer':'http://www.cnbeta.com/articles'
}

headers are a dict data structure, and you can put any header you want to make some camouflage. For example, some smart websites like to peep at people's privacy. Others visit them through proxy. They just want to read X-Forwarded-For in header to see people's real IP. Without saying anything, they can change X-Forwarded-For directly. They can change it to anything else.

The ultimate trick

Sometimes the access will still be accounted for, so there is no way, honestly write all the headers you see in httpfox, that's generally enough. No, that's the ultimate trick. selenium directly controls the browser to access it. As long as the browser can do it, it can do it. Similarly, there are pamie, watir, and so on.

Design a simple multithreaded grab class

Still feel more comfortable in the "native" python like urllib. Imagine if you had a Fetcher class, you could call it that way.

f = Fetcher(threads=10) #Set the number of download threads to 10
for url in urls:
    f.push(url)  #Push all URLs into the download queue
while f.taskleft(): #If there are still unfinished Downloads
    content = f.pop()  #Remove the results from the download completion queue
    do_with(content) # Processing content

This multithreaded call is simple and straightforward, so let's design it like this. First of all, there are two queues. Queue is used to solve the problem. The basic architecture of multithreading is similar to "Skills Summary". push method and pop method are better handled. Both of them use Queue method directly, while taskleft is if there are "running tasks". Or "Tasks in the Queue" would be fine, so the code is as follows:

import urllib2
from threading import Thread,Lock
from Queue import Queue
import time
 
class Fetcher:
    def __init__(self,threads):
        self.opener = urllib2.build_opener(urllib2.HTTPHandler)
        self.lock = Lock() #thread lock
        self.q_req = Queue() #Task queue
        self.q_ans = Queue() #Completion queue
        self.threads = threads
        for i in range(threads):
            t = Thread(target=self.threadget)
            t.setDaemon(True)
            t.start()
        self.running = 0
 
    def __del__(self): #Deconstruction needs to wait for two queues to complete
        time.sleep(0.5)
        self.q_req.join()
        self.q_ans.join()
 
    def taskleft(self):
        return self.q_req.qsize()+self.q_ans.qsize()+self.running
 
    def push(self,req):
        self.q_req.put(req)
 
    def pop(self):
        return self.q_ans.get()
 
    def threadget(self):
        while True:
            req = self.q_req.get()
            with self.lock: #To ensure the atomicity of the operation, enter critical area
                self.running += 1
            try:
                ans = self.opener.open(req).read()
            except Exception, what:
                ans = ''
                print what
            self.q_ans.put((req,ans))
            with self.lock:
                self.running -= 1
            self.q_req.task_done()
            time.sleep(0.1) # don't spam
 
if __name__ == "__main__":
    links = [ 'http://www.verycd.com/topics/%d/'%i for i in range(5420,5430) ]
    f = Fetcher(threads=10)
    for url in links:
        f.push(url)
    while f.taskleft():
        url,content = f.pop()
        print url,len(content)

Some trivial experiences

1. Connection pool:

Like urllib2.urlopen, opener.open creates a new http request. Usually this is not a problem, because in a linear environment, a second may generate a request; however, in a multithreaded environment, it can be tens of hundreds of requests per second, so in a few minutes, the normal and rational server will block you.

However, in normal html requests, it is normal to maintain dozens of connections with the server at the same time, so it is possible to maintain a pool of HttpConnection manually, and then select the connection from the pool to connect each time you grab it.

Here is a clever way to use squid as a proxy server for grabbing, squid will automatically maintain the connection pool for you, but also with data caching function, and squid is my server must be installed on the east, why bother to write the connection pool?

2. Setting the stack size of threads

The setting of stack size will significantly affect the memory usage of python. If Python multithreads do not set this value, the program will occupy a large amount of memory, which is very fatal for vps of openvz. stack_size must be greater than 32768, and in fact it should always be more than 32768*2

from threading import stack_size
stack_size(32768*16)

3. Automatic Retry after Setting Failure

 def get(self,req,retries=3):
        try:
            response = self.opener.open(req)
            data = response.read()
        except Exception , what:
            print what,req
            if retries>0:
                return self.get(req,retries-1)
            else:
                print 'GET Failed',req
                return ''
        return data

4. Setting timeout

import socket
socket.setdefaulttimeout(10) #Connection timeout after setting 10 seconds

5. Landing

The login is simpler, first add cookie support to build_opener, refer to the "summary" article; if you want to login to VeryCD, add an empty method login to Fetcher and call it in init(), then inherit the Fetcher class and override the login method:

def login(self,username,password):
    import urllib
    data=urllib.urlencode({'username':username,
                           'password':password,
                           'continue':'http://www.verycd.com/',
                           'login_submit':u'Sign in'.encode('utf-8'),
                           'save_cookie':1,})
    url = 'http://www.verycd.com/signin'
    self.opener.open(url,data).read()

So when Fetcher initializes, it automatically logs on to VeryCD.

Reference material:

http://www.pythonclub.org/python-network-application/observer-spider

https://my.oschina.net/jhao104/blog/647308

Posted by weemee500 on Tue, 16 Jul 2019 13:48:30 -0700

Programmer Group