Change from: Love is open source
Understanding OOM killer
Recently, one VPS customer complained about MySQL hanging up for no reason, and another complained that VPS often crashed. It is a common Out of Memory problem to log on to the terminal and look at it. This is usually due to the lack of system memory caused by a large number of memory requests from applications at a certain time, which usually triggers Out of Memory (OOM) killer in the Linux kernel. OOM killer kills a process to free up memory for the system, so as not to cause the system to collapse immediately. If you check the relevant log file (/ var/log/messages), you will see similar Out of memory: Kill process information as follows:
...
Out of memory: Kill process 9682 (mysqld) score 9 or sacrifice child
Killed process 9682, UID 27, (mysqld) total-vm:47388kB, anon-rss:3744kB, file-rss:80kB
httpd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
httpd cpuset=/ mems_allowed=0
Pid: 8911, comm: httpd Not tainted 2.6.32-279.1.1.el6.i686 #1
...
21556 total pagecache pages
21049 pages in swap cache
Swap cache stats: add 12819103, delete 12798054, find 3188096/4634617
Free swap = 0kB
Total swap = 524280kB
131071 pages RAM
0 pages HighMem
3673 pages reserved
67960 pages shared
124940 pages non-shared
The Linux kernel allocates memory according to the requirements of the application. Usually, the application allocates memory but does not actually use it all. In order to improve performance, this part of memory can be used for other purposes. This part of memory belongs to each process. It is more troublesome for the kernel to recycle directly, so the kernel adopts an over-commit memory. This method can indirectly utilize this part of "idle" memory and improve the efficiency of the overall memory. Generally speaking, this is not a problem, but when most applications run out of their own memory, the trouble arises because the memory requirements of these applications add up to more than the capacity of physical memory (including swap), and the kernel (OOM killer) has to kill some processes to make room for the system to run properly. It may be easier to understand with the example of banks. Some people are not afraid when they withdraw money. The banks have enough deposits to cope with. When the people of the whole country (or the vast majority) withdraw money and everyone wants to withdraw their own money, the trouble of banks comes. In fact, banks do not have so much money for everyone.
The process of selecting and killing a process can refer to the kernel source code linux/mm/oom_kill.c. When the system memory is insufficient, out_of_memory() is triggered, and then select_bad_process() is called to select a "bad" process to kill. How to judge and select a "bad" process? The selection process is determined by oom_badness(), and the algorithm and ideas are simple and simple: the worst process is the one that occupies the most memory.
/**
* oom_badness - heuristic function to determine which candidate task to kill
* @p: task struct of which task we should calculate
* @totalpages: total present RAM allowed for page allocation
*
* The heuristic for determining which task to kill is made to be as simple and
* predictable as possible. The goal is to return the highest value for the
* task consuming the most memory to avoid subsequent oom failures.
*/
unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
const nodemask_t *nodemask, unsigned long totalpages)
{
long points;
long adj;
if (oom_unkillable_task(p, memcg, nodemask))
return 0;
p = find_lock_task_mm(p);
if (!p)
return 0;
adj = (long)p->signal->oom_score_adj;
if (adj == OOM_SCORE_ADJ_MIN) {
task_unlock(p);
return 0;
}
/*
* The baseline for the badness score is the proportion of RAM that each
* task's rss, pagetable and swap space use.
*/
points = get_mm_rss(p->mm) + p->mm->nr_ptes +
get_mm_counter(p->mm, MM_SWAPENTS);
task_unlock(p);
/*
* Root processes get 3% bonus, just like the __vm_enough_memory()
* implementation used by LSMs.
*/
if (has_capability_noaudit(p, CAP_SYS_ADMIN))
adj -= 30;
/* Normalize to oom_score_adj units */
adj *= totalpages / 1000;
points += adj;
/*
* Never return 0 for an eligible task regardless of the root bonus and
* oom_score_adj (oom_score_adj can't be OOM_SCORE_ADJ_MIN here).
*/
return points > 0 ? points : 1;
}
Understanding this algorithm will help us understand why MySQL can be shot while lying down, because it is always the largest (generally speaking, it occupies the most memory on the system), so if Out of Memeory (OOM) is always the unfortunate first to be killed. The easiest way to solve this problem is to increase memory, or find ways to optimize MySQL to take up less memory. Besides optimizing MySQL, you can also optimize the system (Debian 5, CentOS 5.x), so that the system uses as little memory as possible so that applications (such as MySQL) can use more memory. Another temporary way is to adjust the kernel parameters. Make MySQL processes not easy to find by OOM killer.
Configure OOM killer
We can adjust the behavior of OOM killer through some kernel parameters to avoid the system killing process. For example, we can trigger the kernel panic immediately after triggering OOM, and restart the system automatically after 10 seconds.
# sysctl -w vm.panic_on_oom=1
vm.panic_on_oom = 1
# sysctl -w kernel.panic=10
kernel.panic = 10
# echo "vm.panic_on_oom=1" >> /etc/sysctl.conf
# echo "kernel.panic=10" >> /etc/sysctl.conf
From the oom_kill.c code above, you can see that oom_badness() scores each process and decides which process to kill according to the level of points. The points can be adjusted according to adj. The process with root authority is usually considered important and should not be easily killed, so when scoring, you can get a 3% discount (adj-= 30; the lower the score, the less likely it is to be killed). In user space, we can determine which processes are not so easily killed by OOM killer by manipulating the oom_adj kernel parameters of each process. For example, if you don't want the MySQL process to be killed easily, you can find the process number that MySQL runs and adjust oom_score_adj to -15 (note that the smaller the points, the less likely they are to be killed):
# ps aux | grep mysqld
mysql 2196 1.6 2.1 623800 44876 ? Ssl 09:42 0:00 /usr/sbin/mysqld
# cat /proc/2196/oom_score_adj
0
# echo -15 > /proc/2196/oom_score_adj
Of course, OOM killer can be completely shut down if necessary (not recommended for production environments):
# sysctl -w vm.overcommit_memory=2
# echo "vm.overcommit_memory=2" >> /etc/sysctl.conf