Troubleshooting process of excessive consumption of a file handle

Keywords: Linux JSON lsof Docker

Background:

With the business iteration, some projects are reconstructed with nodejs and deployed to k8s environment. In order to facilitate analysis, the last version of the code increases the function of output log.


Phenomenon:

Half a day after the launch, it was found that there was an alarm of too many open files in the R & D feedback, which affected the user experience.  

At the same time, the operation and maintenance view monitoring shows that the use of file handles has increased dramatically in a short period of time, as shown in the following figure:


O & M view file handle usage of k8s node

ulimit - n - view the maximum handle available to the current user

sysctl - a | grep fs.file-max ා view kernel level file handle maximum limit

cat / proc / sys / FS / file NR - view the maximum number of file handles currently in use and the kernel level file handle limit
 As you can see, the number of used file handles of cat / proc / sys / FS / file NR of the k8s node is almost full.



Fast solutions on the operation and maintenance side:

vim / etc/sysctl.conf add a line of configuration
 fs.file-max = 13129438 ා (if this value is not specified manually, linux will automatically set it according to the hardware configuration of each server. You can see the hosts with 64G and 128G memory, which are different.)
sysctl - p causes the above operation to adjust the file handle to take effect immediately

Then, remove this node from the k8s cluster (and rush the pod to other normal nodes). Unable to release the file handle, we can only release it by restarting the server. And guess that the new nodejs project may not log correctly.



The following is the lsof information captured in the k8s worker node13 node (as long as the k8s worker node with this abnormal pod runs, you can run lsof to see. After all, if there is a handle that is not closed, all the pods in this series must have problems):

Lsof > / TMP / lsof ා, the file size is almost 2GB (this process is relatively long, which may take 5-10 minutes)


[root@k8s-worker-node-13 ~]# grep  REG /tmp/lsof | grep -c java

3422


[root@k8s-worker-node-13 ~]# grep  REG /tmp/lsof | egrep -c  '\bnode\b'

14621626


[root@k8s-worker-node-13 ~] (grep REG /tmp/lsof | egrep '\ bnode\b' | less) view the filtered log files

TID Listed as empty
COMMAND   PID         TID  USER   FD       TYPE            DEVICE SIZE/OFF      NODE   NAME
node      16966            root  458w      REG              253,1 671641730    1572898 /usr/src/app/log/test-app/test-app-2020-03-11.json
node      16966            root  459w      REG              253,1 671641730    1572898 /usr/src/app/log/test-app/test-app-2020-03-11.json
node      16966            root  460w      REG              253,1 671641730    1572898 /usr/src/app/log/test-app/test-app-2020-03-11.json
node      16966            root  461w      REG              253,1 671641730    1572898 /usr/src/app/log/test-app/test-app-2020-03-11.json
node      16966            root  462w      REG              253,1 671641730    1572898 /usr/src/app/log/test-app/test-app-2020-03-11.json
node      16966            root  463w      REG              253,1 671641730    1572898 /usr/src/app/log/test-app/test-app-2020-03-11.json
node      16966            root  464w      REG              253,1 671641730    1572898 /usr/src/app/log/test-app/test-app-2020-03-11.json
node      16966            root  465w      REG              253,1 671641730    1572898 /usr/src/app/log/test-app/test-app-2020-03-11.json
node      16966            root  466w      REG              253,1 671641730    1572898 /usr/src/app/log/test-app/test-app-2020-03-11.json
node      16966            root  467w      REG              253,1 671641730    1572898 /usr/src/app/log/test-app/test-app-2020-03-11.json
node      16966            root  468w      REG              253,1 671641730    1572898 /usr/src/app/log/test-app/test-app-2020-03-11.json
node      16966            root  469w      REG              253,1 671641730    1572898 /usr/src/app/log/test-app/test-app-2020-03-11.json
node      16966            root  470w      REG              253,1 671641730    1572898 /usr/src/app/log/test-app/test-app-2020-03-11.json
node      16966            root  471w      REG              253,1 671641730    1572898 /usr/src/app/log/test-app/test-app-2020-03-11.json
node      16966            root  472w      REG              253,1 671641730    1572898 /usr/src/app/log/test-app/test-app-2020-03-11.json
node      16966            root  473w      REG              253,1 671641730    1572898 /usr/src/app/log/test-app/test-app-2020-03-11.json
. . . . .  Omit parts . . . . . 
node      20927 20955      root  mem       REG             253,16              6823953 /usr/lib/libgcc_s.so.1 (stat: No such file or directory)
node      20927 20955      root  mem       REG             253,16              6823955 /usr/lib/libstdc++.so.6.0.22 (stat: No such file or directory)
node      20927 20955      root  mem       REG             253,16              6030733 /lib/ld-musl-x86_64.so.1 (stat: No such file or directory)
node      20927 20955      root  mem       REG             253,16              6030679 /etc/localtime (path dev=253,1, inode=788097)
node      20927 20955      root   23w      REG              0,460    717345    9178728 /tmp/access-20200311.log
node      20927 20955      root   27w      REG              0,460    209117    9178762 /tmp/tracing-20200311.log
node      20927 20955      root   29w      REG              0,460    209117    9178762 /tmp/tracing-20200311.log
node      20927 20956      root  txt       REG              0,460  40885256    7079234 /usr/local/bin/node
node      20927 20956      root  mem       REG             253,16              7079234 /usr/local/bin/node (stat: No such file or directory)
node      20927 20956      root  mem       REG             253,16              6823953 /usr/lib/libgcc_s.so.1 (stat: No such file or directory)
node      20927 20956      root  mem       REG             253,16              6823955 /usr/lib/libstdc++.so.6.0.22 (stat: No such file or directory)
node      20927 20956      root  mem       REG             253,16              6030733 /lib/ld-musl-x86_64.so.1 (stat: No such file or directory)
node      20927 20956      root  mem       REG             253,16              6030679 /etc/localtime (path dev=253,1, inode=788097)
node      20927 20957      root  txt       REG              0,460  40885256    7079234 /usr/local/bin/node
node      20927 20957      root  mem       REG             253,16              7079234 /usr/local/bin/node (stat: No such file or directory)
node      20927 20957      root  mem       REG             253,16              6823953 /usr/lib/libgcc_s.so.1 (stat: No such file or directory)
node      20927 20957      root  mem       REG             253,16              6823955 /usr/lib/libstdc++.so.6.0.22 (stat: No such file or directory)
node      20927 20957      root  mem       REG             253,16              6030733 /lib/ld-musl-x86_64.so.1 (stat: No such file or directory)
node      20927 20957      root  mem       REG             253,16              6030679 /etc/localtime (path dev=253,1, inode=788097)
node      20927 20957      root   23w      REG              0,460    717345    9178728 /tmp/access-20200311.log
node      20927 20957      root   27w      REG              0,460    209117    9178762 /tmp/tracing-20200311.log
node      20927 20957      root   29w      REG              0,460    209117    9178762 /tmp/tracing-20200311.log
node:Log  20927 21094      root  txt       REG              0,460  40885256    7079234 /usr/local/bin/node
node:Log  20927 21094      root  mem       REG             253,16              7079234 /usr/local/bin/node (stat: No such file or directory)
node:Log  20927 21094      root  mem       REG             253,16              6823953 /usr/lib/libgcc_s.so.1 (stat: No such file or directory)
node:Log  20927 21094      root  mem       REG             253,16              6823955 /usr/lib/libstdc++.so.6.0.22 (stat: No such file or directory)
node:Log  20927 21094      root  mem       REG             253,16              6030733 /lib/ld-musl-x86_64.so.1 (stat: No such file or directory)




Solutions on the development side:

Back off the service, and check whether there is a problem with the log in the code.  
Later, the next day, the development fed back that there was a problem in their previous log writing. They all kept opening the file without closing, resulting in the file handle not being released.



Optimization scheme of operation and maintenance side:

1. Add the related monitoring (node ﹣ exporter is OK)
Monitoring expression: node ﹣ filefd ﹣ allocated / node ﹣ filefd ﹣ maximum * 100 > 70 triggers an alarm, indicating that the file handle occupies more than 70%, which requires operation and maintenance intervention to view and analyze

2. Limit the kernel parameters in docker image (it hasn't been tested to see if it works, to be verified in actual combat)
Reason: docker image is also a thin version of linux. We found that the default fs.file-max and ulimt-n settings in the image of the production environment are very large. We can consider lowering ulimit-n to 65535 and fs.file-max to 655350.
In this way, even if there is a problem with this pod, it can only affect itself, but not other pods running on the host.






Posted by scottreid1974 on Thu, 12 Mar 2020 20:15:05 -0700