A check on timeout

Keywords: Python Nginx Excel supervisor

Recently, some users always reported that "failure - server problem" will occur when they download large Excel from Redash, and the time from point to error prompt is almost the same. First, I checked the error log of Nginx, and it showed that upstream prematurely closed connection while sending to client. The first reaction should be caused by timeout.

1. Modify the Redash configuration

  • . env file modify log level to debug
  • . env file plus timeout configuration
  • Add - t 600 parameter after the start command of redish.

Ten minutes should be enough! After restarting the repash process, try again. It's not easy! OK, maybe the configuration parameter is wrong. Change it to -- timeout 600 and try again. It's not easy to use! continue...

2. Modify Nginx configuration

  • Request timeout: keepalive ﹣ timeout, client ﹣ header ﹣ timeout, client ﹣ body ﹣ timeout
  • Time settings for backend server to process requests: proxy ﹣ connect ﹣ timeout, proxy ﹣ read ﹣ timeout

Restart Nginx, download or fail!!! It seems that it's not caused by overtime?!

3. View the resources used by the process

First execution:
top
Try the download operation again. It is found that the COMMANDCPU named gunicorn (the server of repash is started with gunicorn) occupies 100% of the CPU. After a certain period of time, the process disappears. After the new process is started, the CPU occupation returns to the normal value. Let's see what this process does.

4. Process tracking

Track the gunicorn process with high CPU usage:

$ strace -T -tt -e trace=all -p process ID

The explicit process is always in the system call of read and write, and the last line is output
+++ killed by SIGKILL +++
After that, the tracking stops. Did it trigger the ulimit limit of the system and then be killed by the system?

5. Set ulimit parameter

Setting gunicorn to run the user's ulimit and try again did not solve the problem. It doesn't seem to be the problem... Who kill ed that?

6. Capture kill signal

Using auditctl, add a capture rule:

$ auditctl -a exit,always -F arch=b64 -S kill -F a1=9

Download the file, wait for the process to be killed, and then explicitly capture the result:

$ ausearch -sc kill

Output:

time->Fri Dec  6 16:13:26 2019
type=PROCTITLE msg=audit(1575620006.444:103711): proctitle=2F6F70742F6D6F64756C65732F7265646173682D372E302E302F7265646173682F62696E2F707974686F6E322E37002F6F70742F7265646173682F7265646173682F62696E2F67756E69636F726E002D62003132372E302E302E313A35303030002D2D6E616D6500726564617368002D770034002D2D6D61782D726571756573
type=OBJ_PID msg=audit(1575620006.444:103711): opid=11646 oauid=0 ouid=1001 oses=14406 ocomm="gunicorn"
type=SYSCALL msg=audit(1575620006.444:103711): arch=c000003e syscall=62 success=yes exit=0 a0=2d7e a1=9 a2=0 a3=0 items=0 ppid=11490 pid=11494 auid=0 uid=1001 gid=1001 euid=1001 suid=1001 fsuid=1001 egid=1001 sgid=1001 fsgid=1001 tty=(none) ses=14406 comm="gunicorn" exe="/opt/modules/redash-7.0.0/redash/bin/python2.7" key=(null)

It turns out that the father process killed...

7. View log

Enter the supervisor console and print the output of the gunicorn process through tail -f.

[CRITICAL] WORKER TIMEOUT (pid:15577)
[INFO] Worker exiting (pid: 15577)
[INFO] Booting worker with pid: 15646

Timeout... The reason is the first guess of the timeout problem.

The execution time given by gunicorn to the child process is 30 seconds. If it exceeds this limit, it will be kill ed by the parent process. But timeout configuration does not work...

Posted by Crayon Violent on Tue, 10 Dec 2019 19:19:45 -0800