Linux page allocation failure problem handling - zone_reclaim_mode

Keywords: github Linux RHEL PostgreSQL

Label

PostgreSQL, Linux, page allocation failure, memory

background

Linux kernel allocation fails.

After using a certain amount of memory, HANG.

There may be similar errors in dmesg, system HANG live, can not connect, need to restart to solve.

page allocation failure  
  
  
Oct 24 11:27:42  kernel: : [21289.479063] python2.6: page allocation failure. order:1, mode:0x20  
  
kernel: swapper: page allocation failure. order:1, mode:0x20  
kernel: Pid: 0, comm: swapper Not tainted 2.6.32-358.2.1.el6.x86_64 #1  
kernel: Call Trace:  
kernel: <IRQ>  [<ffffffff8112c207>] ? __alloc_pages_nodemask+0x757/0x8d0  
kernel: [<ffffffff81166ab2>] ? kmem_getpages+0x62/0x170  
kernel: [<ffffffff811676ca>] ? fallback_alloc+0x1ba/0x270  
kernel: [<ffffffff8116711f>] ? cache_grow+0x2cf/0x320  
kernel: [<ffffffff81167449>] ? ____cache_alloc_node+0x99/0x160  
kernel: [<ffffffff811683cb>] ? kmem_cache_alloc+0x11b/0x190  
kernel: [<ffffffff81439d58>] ? sk_prot_alloc+0x48/0x1c0  
kernel: [<ffffffff8143ae32>] ? sk_clone+0x22/0x2e0  
kernel: [<ffffffff81489d66>] ? inet_csk_clone+0x16/0xd0  
kernel: [<ffffffff814a2c73>] ? tcp_create_openreq_child+0x23/0x450  
kernel: [<ffffffff814a046d>] ? tcp_v4_syn_recv_sock+0x4d/0x310  
kernel: [<ffffffff814a2a16>] ? tcp_check_req+0x226/0x460  
kernel: [<ffffffff8149ff0b>] ? tcp_v4_do_rcv+0x35b/0x430  
kernel: [<ffffffff81082034>] ? mod_timer+0x144/0x220  
kernel: [<ffffffff814a171e>] ? tcp_v4_rcv+0x4fe/0x8d0  
kernel: [<ffffffff814a171e>] ? tcp_v4_rcv+0x4fe/0x8d0  
kernel: [<ffffffff8147f50d>] ? ip_local_deliver_finish+0xdd/0x2d0  
kernel: [<ffffffff8147f798>] ? ip_local_deliver+0x98/0xa0  
kernel: [<ffffffff8147ec5d>] ? ip_rcv_finish+0x12d/0x440  
kernel: [<ffffffff8147f1e5>] ? ip_rcv+0x275/0x350  
kernel: [<ffffffff814483bb>] ? __netif_receive_skb+0x4ab/0x750  
kernel: [<ffffffff8144a798>] ? netif_receive_skb+0x58/0x60  
kernel: [<ffffffffa008b975>] ? vmxnet3_rq_rx_complete+0x365/0x890 [vmxnet3]  
kernel: [<ffffffff8128d2b0>] ? swiotlb_map_page+0x0/0x100  
kernel: [<ffffffffa008c0f3>] ? vmxnet3_poll_rx_only+0x43/0xc0 [vmxnet3]  
kernel: [<ffffffff8144cf63>] ? net_rx_action+0x103/0x2f0  
kernel: [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0  
kernel: [<ffffffff810e1720>] ? handle_IRQ_event+0x60/0x170  
kernel: [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30  
kernel: [<ffffffff8100de05>] ? do_softirq+0x65/0xa0  
kernel: [<ffffffff81076d95>] ? irq_exit+0x85/0x90  
kernel: [<ffffffff81516f15>] ? do_IRQ+0x75/0xf0  
kernel: [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11  
kernel: <EOI>  [<ffffffff8103b90b>] ? native_safe_halt+0xb/0x10  
kernel: [<ffffffff8101495d>] ? default_idle+0x4d/0xb0  
kernel: [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110  
kernel: [<ffffffff81506d9c>] ? start_secondary+0x2ac/0x2ef  

Solution - Upgrading Kernel Version

1. Upgrade to kernel-2.6.32-358.el6 or higher. (But it can't be solved thoroughly, it's just to alleviate the problem.)

Update to kernel-2.6.32-358.el6 or higher, which contains the enhancement described in the Root Cause section below.  
  
Please note, this update (or newer) does not completely eliminate the possibility of the occurrence of the page allocation failure.  
The below mentioned workaround also works in 2.6.32-358.el6 and newer if the issue still persists even after the update.  

Solution - Modifying Kernel Parameters

vi /etc/sysctl.conf or vi /etc/sysctl.d/xxx.conf  
  
vm.zone_reclaim_mode = 1  
vm.min_free_kbytes = 512000  
  
sysctl -w vm.zone_reclaim_mode=1  
sysctl -w vm.min_free_kbytes=512000  
The following tunables can be used in an attempt to alleviate or prevent the reported condition:  
  
Increase vm.min_free_kbytes value, for example to a higher value than a single allocation request.  
Change vm.zone_reclaim_mode to 1 if it's set to zero, so the system can reclaim back memory from cached memory.  
Both settings can be set in /etc/sysctl.conf, and loaded using sysctl -p /etc/sysctl.conf.  
  
For more information on these tunables, install the kernel-doc package and refer to file   
  
/usr/share/doc/kernel-doc-2.6.32/Documentation/sysctl/vm.txt.  

Root cause

Prior to version 6.4, kswapd will not process

Before RHEL 6.4, kswapd does not try to free contiguous pages.

This can cause GFP_ATOMIC allocations requests to fail repeatedly,
when nothing else in the system defragments memory.

With RHEL 6.4 and newer, kswapd will compact (defragment) free memory, when required.

Please note that allocation failures can still happen.

For example, when a larger burst of GFP_ATOMIC allocations occur which kswapd may struggle to keep up with.

However, these allocations should eventually succeed.

There are also other more specific cases that can result in page allocation failures and cause additional issues.
Please refer to the following articles for more information

Zone_reclaim_mode Interpretation

Zone_reclaim_mode allows someone to set more or less aggressive approaches to  
reclaim memory when a zone runs out of memory. If it is set to zero then no  
zone reclaim occurs. Allocations will be satisfied from other zones / nodes  
in the system.  
  
This is value ORed together of  
  
1 = Zone reclaim on  
2 = Zone reclaim writes dirty pages out  
4 = Zone reclaim swaps pages  
  
zone_reclaim_mode is set during bootup to 1 if it is determined that pages  
from remote zones will cause a measurable performance reduction. The  
page allocator will then reclaim easily reusable pages (those page  
cache pages that are currently not used) before allocating off node pages.  
  
0: It may be beneficial to switch off zone reclaim if the system is  
used for a file server and all of memory should be used for caching files  
from disk. In that case the caching effect is more important than  
data locality.  
  
1: Allowing zone reclaim to write out pages stops processes that are  
writing large amounts of data from dirtying pages on other nodes. Zone  
reclaim will write out dirty pages if a zone fills up and so effectively  
throttle the process. This may decrease the performance of a single process  
  
2: since it cannot use all of system memory to buffer the outgoing writes  
anymore but it preserve the memory on other nodes so that the performance  
of other processes running on other nodes will not be affected.  
  
4: Allowing regular swap effectively restricts allocations to the local  
node unless explicitly overridden by memory policies or cpuset  
configurations.  

Reference resources

http://www.zbuse.com/2014/07/837.html

https://serverfault.com/questions/236170/page-allocation-failure-am-i-running-out-of-memory

https://access.redhat.com/solutions/90883

Linux page allocation failure problem handling - lowmem_reserve_ratio

Posted by kaen on Thu, 07 Feb 2019 20:30:17 -0800