21 - URL crawling, Python asynchronous, Supervisor installation and configuration, etc

Keywords: Python supervisor pip MongoDB

1. summary

Pseudo code can be referred to "Twenty climbing URL and initial assumption of participle sentiment analysis" , only to find that this can be done for a whole day, but most of the time is used to deploy the server.

Code placed in Large same-sex dating websites
In fact, it's very simple to write programs according to pseudo code, but there are many places that are stuck. You need to learn to sell now. The following are the points of learning now.

2. Key technical points

2.1 Python Class

As for Python classes, there is not much difference for other scripting languages, but I don't know if I'm used to JS closures. There will be many inexplicable errors when I write python.
In fact, there is no need for inheritance, so it's just for class writing.

2.2 asynchronous

Asynchrony is a crash when used in Python, so I don't know why so many people like python (or because it's naturally synchronous, so many people like it? Or just because it has a lot of bags.
asyncio is the most asynchronous Python.
About asyncio, I feel the most desperate is that I don't know how to line up at all! I tried two ways to start it:

# ---------Do not understand the following, failed to try-------------
class classSpy():
    def __init__(self, arrInProxy):
        self.arrProxy = iter(arrInProxy)

    def __aiter__(self):
        return self

    async def __anext__(self):
        try:
            eleProxy = next(self.arrProxy)
        except StopIteration:
            raise StopAsyncIteration
        return eleProxy


arrTmp = []
arrTmp.append(1)
arrTmp.append(2)
arrTmp.append(3)
arrTmp.append(4)
arrTmp.append(5)
arrTmp.append(6)


async def run():
    print('Begin : '+time.strftime('%Y-%m-%d %H:%M:%S'))
    async for eleBe in classSpy(arrTmp):
        await asyncio.sleep(random.randint(1, 3))
        print('  Now : ' + str(eleBe) + ' , time: ' +
              time.strftime('%Y-%m-%d %H:%M:%S'))
    print('End : '+time.strftime('%Y-%m-%d %H:%M:%S'))
loop = asyncio.get_event_loop()
loop.run_until_complete(run())
# ------------------------------------------


# ---------The following asynchronous attempt succeeded-------------
arrTmp = []
arrTmp.append(1)
arrTmp.append(2)
arrTmp.append(3)
arrTmp.append(4)
arrTmp.append(5)
arrTmp.append(6)


async def run(eleBe, inSemaphore):
    async with inSemaphore:
        await asyncio.sleep(random.randint(1, 3))
        print('  Now : ' + str(eleBe) + ' , time: ' +
              time.strftime('%Y-%m-%d %H:%M:%S'))


def funDone(waittask):
    print('Callback End : '+time.strftime('%Y-%m-%d %H:%M:%S'))


print('Begin : '+time.strftime('%Y-%m-%d %H:%M:%S'))

# -------------Call mode 1----------------


async def main():
    semaphore = asyncio.Semaphore(2)
    waittask = asyncio.gather(*([run(proxy, semaphore) for proxy in arrTmp]))
    waittask.add_done_callback(funDone)
    await asyncio.gather(waittask)
asyncio.run(main())
# -------------------------------------


# -------------Call mode 2----------------
loop = asyncio.get_event_loop()
semaphore = asyncio.Semaphore(2)
waittask = asyncio.gather(*([run(proxy, semaphore) for proxy in arrTmp]))
waittask.add_done_callback(funDone)
loop.run_until_complete(waittask)
loop.close()
# -------------------------------------


print('Program End : '+time.strftime('%Y-%m-%d %H:%M:%S'))
# ------------------------------------------


I tried it for a long time, and finally I had to refer to the official documents. The reference documents are as follows:

It's very difficult to start the real asynchronous processing finally, such as method 2 above. However, when multithreading is asynchronous, it is found that Loop does not support simultaneous and multiple calls in the same program. After a long time of searching, the following methods are finally found:

2.2 Python link MongoDB

This is a big hole, for example

  1. Find() function in Python or find()
  2. The findOne() function is find one () in Python
  3. The delete many() function in Python is delete'many()
  4. Because Python has no JSON, it can only use dictionaries, such as: {$lt:10} to be written as {'$lt': 10}

References are:
Certain information

2.3 web page request and analysis

This is more convenient, just use what others have written. The reference materials are as follows:
Information of others

2.4 timeout repeat request

When there is a timeout for the current referenced agent, request another proxy server, but there seems to be no specific processing method, so only reference materials, make a While to determine whether it is timeout, and relevant codes are as follows:
Reference articles

3. Deploy Python program

This is really a huge pit. I don't want to make it, but I can't. It's already started.
I use PM2 for NodeJS. Actually, PM2 also has some pits. For example, after starting a project, you need to Save and then Start up to achieve the environment of Startup, but you need to configure an environment configuration in the middle, otherwise you can't Start it after Startup.
Python's program management software has a supervisor, which has been used for a long time, but reconfiguration is still very troublesome.

3.1 about installation

When installing this software, I found that Tencent ECS didn't pre install Pip. After installing Pip, I installed supervisor and found that many things were not configured at all until I found this article:
Detailed tutorial of centos7 Installation supervisor
But I've installed it and used a method that I don't recommend.

3.2 about configuration

It doesn't matter. After the installation, why is the configuration so troublesome. Because I want to see the output of my program, I must output the output of the managed program to a directory. It's troublesome to configure. The following is the configuration that can mainly implement my functions.
Configuration introduction

[program:spylink]
command=python3 /home/Berry/Repositories/SpyTheLink/run.py ; This is my path, because I use Python3 It's written, so it's more important
directory=/home/Berry/Repositories/SpyTheLink              ; directory to cwd to before exec (def no cwd),This needs to be configured, otherwise the configuration file cannot find the path
autostart=true       ; start at supervisord start (default: true) ,Of course
startsecs=10         ; number of secs prog must stay running (def. 1)
autorestart=true     ; whether/when to restart (default: unexpected)
startretries=0       ; max # of serial start failures (default 3)
user=Berry          ; setuid to this UNIX account to run the program,use Root Can't run I don't know why, it seems that there is no set to run? Or running environment problems?
priority=999         ; the relative start priority (default 999)
redirect_stderr=false ; redirect proc stderr to stdout (default false),Because I don't need to see it stderr It mainly depends on the output (the error is a normal program)
stdout_logfile_maxbytes=10MB  ; max # logfile bytes b4 rotation (default 50MB)
stdout_logfile_backups = 10   ; # of stdout logfile backups (default 10)
stdout_logfile=/home/Berry/Repositories/SpyTheLink/logs/spylink.out
stderr_logfile=/home/Berry/Repositories/SpyTheLink/logs/spylink.err       ; stderr log path, NONE for none; default AUTO
stderr_logfile_maxbytes=10MB   ; max # logfile bytes b4 rotation (default 50MB)
stderr_logfile_backups=10     ; # of stderr logfile backups (default 10)
loglevel=info         ;(log level;default info; others: debug,warn,trace),Because I need to output information, so I don't need anything else. The default is good

4. postscript

We need to make a Django as a server-side manual judge and record. I may need to think about how I can do the next step first.

Published 15 original articles, won praise 4, visited 4992
Private letter follow

Posted by vinny69 on Mon, 03 Feb 2020 01:10:29 -0800