Preface
As you may know, there is a common tool for analyzing Kernel Dump called Crash. When I first started to learn how to analyze Kernel Dump, I spent a lot of time toggling about the use of this tool, but I always forgot its function.Later, one time when I participated in an internal sharing, I was very impressed by a big guy who said, "You don't have to remember how these tools work. You'll guess what they do when you really start using them."
This article tries to string together the knowledge needed to learn Kernel Dump by analyzing a real case. Although some knowledge may only be used in this case, I believe the methods used can be applied everywhere.
rise
There is a VM downtime on the line. Just grabbed dump and got a test machine to start the analysis.The first thing you need is the symbol for the kernel version. If you don't know the version of the kernel beforehand, you can get the current version of the kernel from `strings corefile | grep'Linux version', such as 3.10.0-862.14.4.el7.x86_64
After getting the kernel version, download symbols from the appropriate distribution and system architecture to a specific symbols publishing page, where the distribution is Centos and you can access http://debuginfo.centos.org/ Download.If it's a Ubuntu release, you can go to http://ddebs.ubuntu.com/ Download.It's easy to find the symbols for the specified kernel version by simply searching for them with the kernel version 3.10.0-862.14.4.el7.x86_64. Usually we only need two of the following three symbols, but I can't remember which one, so I'll download and install all three: kernel-debug-debuginfo-3.10.0-862.14.4.7.x86_64.rpm, ke-debuginfo-3.10.Rnel-debuginfo-3.10.0-862.14.4.el7.x86_64.rpm, kernel-debuginfo-common-x86_64-3.10.0-862.14.4.el7.x86_64.rpm.The symbols of common need to be installed before other symbols can be installed due to dependencies. Additionally, if the kernel version on the test machine is newer than that of corefile, the --force option is required to install.
Undertaking
After the symbols are installed, you can load the corefile and symbols through crash.
[root@iZxxx1Z crash]# crash /lib/debug/lib/modules/3.10.0-862.14.4.el7.x86_64/vmlinux i-2xxxpe.236725344.208176-2019-02-27-22\:44\:20 crash 7.2.3-8.el7 Copyright (C) 2002-2017 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb (GDB) 7.6 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... WARNING: kernel relocated [168MB]: patching 82671 gdb minimal_symbol values KERNEL: /lib/debug/lib/modules/3.10.0-862.14.4.el7.x86_64/vmlinux DUMPFILE: i-2xxxpe.236725344.208176-2019-02-27-22:44:20 CPUS: 4 [OFFLINE: 3] DATE: Wed Feb 27 22:44:18 2019 UPTIME: 11:34:59 LOAD AVERAGE: 0.20, 0.07, 0.06 TASKS: 436 NODENAME: iZ2xxxpeZ RELEASE: 3.10.0-862.14.4.el7.x86_64 VERSION: #1 SMP Wed Sep 26 15:12:11 UTC 2018 MACHINE: x86_64 (2500 Mhz) MEMORY: 16 GB PANIC: "kernel BUG at drivers/virtio/virtio_ring.c:278!" PID: 278 COMMAND: "kworker/2:1H" TASK: ffff917c6d3b3f40 [THREAD_INFO: ffff917c64798000] CPU: 2 STATE: TASK_RUNNING (PANIC)
When crash loads a corefile, crash automatically outputs a piece of information that contains some basic information about the system, such as CPU, memory, architecture, and so on.If you want to see this information later in the analysis, you can view it through the sys command.The information we need to focus on here is
PANIC: "kernel BUG at drivers/virtio/virtio_ring.c:278!"
This information tells us that the system triggered a Bug in line 278 of the file drivers/virtio/virtio_ring.c. The reason the system knows it is a Bug is because the big man who wrote this code buried a detection point here, which we will see in the source code later.
After reading the above information, it is my habit to first see what the system was doing at that time, and then see the call stack through the bt command:
crash> bt PID: 278 TASK: ffff917c6d3b3f40 CPU: 2 COMMAND: "kworker/2:1H" #0 [ffff917c6479b8a0] die at ffffffff8b82f96b #1 [ffff917c6479b8d0] do_trap at ffffffff8bf1cea0 #2 [ffff917c6479b920] do_invalid_op at ffffffff8b82c284 #3 [ffff917c6479b9d0] invalid_op at ffffffff8bf28aee [exception RIP: virtqueue_add+1186] RIP: ffffffffc023a382 RSP: ffff917c6479ba80 RFLAGS: 00010097 RAX: 0000000000000081 RBX: ffff917c6c67d000 RCX: 0000000000000002 RDX: 0000000000000081 RSI: ffff917c6479bc00 RDI: ffff917c6c67d000 RBP: ffff917c6479bae8 R8: 0000000000000001 R9: ffff917c6b48a380 R10: ffff917c6b410000 R11: 0000000000000002 R12: ffff917c6479bc18 R13: ffff917c6479bc18 R14: ffff917c6479bc00 R15: 0000000000000001 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #4 [ffff917c6479ba80] __kmalloc at ffffffff8b9fbeb0 #5 [ffff917c6479baf0] virtqueue_add_sgs at ffffffffc023a437 [virtio_ring] #6 [ffff917c6479bb48] __virtblk_add_req at ffffffffc03874b2 [virtio_blk] #7 [ffff917c6479bc68] virtio_queue_rq at ffffffffc03876f9 [virtio_blk] #8 [ffff917c6479bcb8] blk_mq_dispatch_rq_list at ffffffff8bb2a8b6 #9 [ffff917c6479bd50] blk_mq_do_dispatch_sched at ffffffff8bb2f76e #10 [ffff917c6479bd98] blk_mq_sched_dispatch_requests at ffffffff8bb3012e #11 [ffff917c6479bde8] __blk_mq_run_hw_queue at ffffffff8bb29362 #12 [ffff917c6479be10] blk_mq_run_work_fn at ffffffff8bb295c5 #13 [ffff917c6479be20] process_one_work at ffffffff8b8b613f #14 [ffff917c6479be68] worker_thread at ffffffff8b8b71d6 #15 [ffff917c6479bec8] kthread at ffffffff8b8bdf21
The information printed in the figure above contains the function call stack and the values of each register, so here are a few of the more important registers.RIP points to the address of the instruction being executed. Before the downtime occurs, the last function executed by the system is virtqueue_add, resulting in the statement being downloaded at virtqueue_add+1186.Based on the x86_64 Linux system Function Call Convention RDI, RSI, RDX, RCX, R8, R9 are the first six parameters of the incoming function. If there are more than six parameters, the seventh parameter or more will be passed through the stack.Note that the value of the register may change during the actual function execution.
Now let's see what the ffffffffc023a382+1186 line of code is, and you can see the corresponding assembly through the dis command.But when we executed dis virtqueue_add+1186, we found an error, the content of which was a duplicate of symbol s.
crash> dis virtqueue_add+1186 dis: virtqueue_add+1186: duplicate text symbols found: ffffffffc0239ee0 (t) virtqueue_add [virtio_ring] ffffffffc023ab73 (t) virtqueue_add [virtio_ring]
We can calculate which symbol corresponds to the current virtqueue_add by the RIP value. The calculation method is simple, simply subtract the RIP value from the offset 1186.
crash> p/x 0xffffffffc023a382-1186 $1 = 0xffffffffc0239ee0
dis ffffffffc0239ee0 allows you to print out the assembly of virtqueue_add, found virtqueue_add+1186
0xffffffffc023a378 <virtqueue_add+1176>: mov 0x60(%rbx),%eax 0xffffffffc023a37b <virtqueue_add+1179>: jmpq 0xffffffffc023a299 <virtqueue_add+953> 0xffffffffc023a380 <virtqueue_add+1184>: ud2 0xffffffffc023a382 <virtqueue_add+1186>: ud2
This line assembly is actually ud2, which is an undefined statement, and it's the statement that caused the system to go down.At this point, you should usually look up to see what instructions have been executed before.The last instruction here is jmpq, this is an unconditional jump statement, jump to virtqueue_add+953, that is, the last instruction of ud2 directive must not be jmpq, you can guess there should be a jump directly in front of here, look up to find it
0xffffffffc0239ef4 <virtqueue_add+20>: mov %rsi,-0x48(%rbp) 0xffffffffc0239ef8 <virtqueue_add+24>: mov %edx,-0x2c(%rbp) 0xffffffffc0239efb <virtqueue_add+27>: mov %ecx,-0x38(%rbp) 0xffffffffc0239efe <virtqueue_add+30>: mov %r8d,-0x58(%rbp) 0xffffffffc0239f02 <virtqueue_add+34>: mov %r9,-0x60(%rbp) 0xffffffffc0239f06 <virtqueue_add+38>: je 0xffffffffc023a384 <virtqueue_add+1188> 0xffffffffc0239f0c <virtqueue_add+44>: cmpb $0x0,0x59(%rdi) 0xffffffffc0239f10 <virtqueue_add+48>: mov %rdi,%rbx 0xffffffffc0239f13 <virtqueue_add+51>: jne 0xffffffffc023a05d <virtqueue_add+381> 0xffffffffc0239f19 <virtqueue_add+57>: mov -0x2c(%rbp),%eax 0xffffffffc0239f1c <virtqueue_add+60>: cmp %eax,0x38(%rdi) 0xffffffffc0239f1f <virtqueue_add+63>: jb 0xffffffffc023a382 <virtqueue_add+1186>
You can see that the jumped statement is JB 0xffffffffc023a382 <virtqueue_add+1186>, and the jumped condition is that%eax is less than [%rdi+0x38].What are the values stored in%eax and [%rdi + 0x38]?This is the time to look at the source code.You can see the source file of the function through dis-l, but executing dis-l ffffffc0239ee0 directly will find that there is no information about the source file, which is usually the case when the corresponding debug module is not imported.You can see the current module through the mod command, find the module we need, find the directory of the module through mod -s, and import it through the same command, such as:
crash> mod -s virtio_ring MODULE NAME SIZE OBJECT FILE ffffffffc023c0e0 virtio_ring 22746 /usr/lib/debug/usr/lib/modules/3.10.0-862.14.4.el7.x86_64/kernel/drivers/virtio/virtio_ring.ko.debug crash> mod -s virtio_ring /usr/lib/debug/usr/lib/modules/3.10.0-862.14.4.el7.x86_64/kernel/drivers/virtio/virtio_ring.ko.debug MODULE NAME SIZE OBJECT FILE ffffffffc023c0e0 virtio_ring 22746 /usr/lib/debug/usr/lib/modules/3.10.0-862.14.4.el7.x86_64/kernel/drivers/virtio/virtio_ring.ko.debug
After importing all the missing modules, execute dis-l ffffffc0239ee0 again to see the corresponding source files.Part of the virtqueue_add source code is as follows:
241 static inline int virtqueue_add(struct virtqueue *_vq, 242 struct scatterlist *sgs[], 243 unsigned int total_sg, 244 unsigned int out_sgs, 245 unsigned int in_sgs, 246 void *data, 247 gfp_t gfp) 248 { 249 struct vring_virtqueue *vq = to_vvq(_vq); 250 struct scatterlist *sg; 251 struct vring_desc *desc; 252 unsigned int i, n, avail, descs_used, uninitialized_var(prev), err_idx; 253 int head; 254 bool indirect; 255 256 START_USE(vq); 257 258 BUG_ON(data == NULL); 259 260 if (unlikely(vq->broken)) { 261 END_USE(vq); 262 return -EIO; 263 } 264 265 #ifdef DEBUG 266 { 267 ktime_t now = ktime_get(); 268 269 /* No kick or get, with .1 second between? Warn. */ 270 if (vq->last_add_time_valid) 271 WARN_ON(ktime_to_ms(ktime_sub(now, vq->last_add_time)) 272 > 100); 273 vq->last_add_time = now; 274 vq->last_add_time_valid = true; 275 } 276 #endif 277 278 BUG_ON(total_sg > vq->vring.num); 279 BUG_ON(total_sg == 0); 280 281 head = vq->free_head; 282 283 /* If the host supports indirect descriptor tables, and we have multiple 284 * buffers, then go indirect. FIXME: tune this threshold */ 285 if (vq->indirect && total_sg > 1 && vq->vq.num_free) 286 desc = alloc_indirect(_vq, total_sg, gfp); 287 else 288 desc = NULL; 289 290 if (desc) { 291 /* Use a single buffer which doesn't continue */ 292 indirect = true; 293 /* Set up rest to use this indirect table. */ 294 i = 0; 295 descs_used = 1; 296 } else { 297 indirect = false; 298 desc = vq->vring.desc; 299 i = head; 300 descs_used = total_sg; 301 }
By combining the information printed by dis-l ffffffffc0239ee0 with previous analysis, we can see that: 1. The first five parameters of virtqueue_add are struct virtqueue, struct scatterlist, unsigned int, unsigned int, unsigned int, corresponding to the values of the five registers: RDI, RSI, RDX, RCX, R8.2. The statement triggering the bug is BUG_ON (total_sg > vq->vring.num) on line 278;
The structure of the first parameter can be resolved by struct virtqueue ffff917c6c67d000:
crash> struct virtqueue ffff917c6c67d000 struct virtqueue { list = { next = 0xffff917bbefa82c8, prev = 0xffff917bbefa82c8 }, callback = 0xffffffffc0387000, name = 0xffff917c6c7a1dcc "req.0", vdev = 0xffff917bbefa8000, index = 0, num_free = 1, priv = 0xffffadff81b5e010 }
Back to the%eax and [%rdi + 0x38] we just discussed, [%rdi + 0x38] is essentially the value of offset 0x38 in virtqueue, and the offsets of virtqueue members can be printed out by struct-o virtqueue:
crash> struct -o virtqueue struct virtqueue { [0] struct list_head list; [16] void (*callback)(struct virtqueue *); [24] const char *name; [32] struct virtio_device *vdev; [40] unsigned int index; [44] unsigned int num_free; [48] void *priv; } SIZE: 56
There is another problem here. 0x38 is 56 decimal, and the size of this structure is only 56 bytes in total. Is that an overflow?After reading the code carefully, you can see that there is a sentence struct vring_virtqueue *vq = to_v vq(_vq);, to_vq is a macro, defined as #define to_vq(_vq) container_of (_vq, struct vring_virtqueue, vq), which is actually an encapsulation of container_of. The function of container_of is very close to the literal meaning, where the virtqueue type _vq variable is actually VA member variable of VQ variable of ring_virtqueue type, VQ is calculated by container_of(_vq, struct vring_virtqueue, vq).We look at the structure of vring_virtqueue by struct-o vring_virtqueue:
crash> struct -o vring_virtqueue struct vring_virtqueue { [0] struct virtqueue vq; [56] struct vring vring; [88] bool weak_barriers; [89] bool broken; [90] bool indirect; [91] bool event; [92] unsigned int free_head; [96] unsigned int num_added; [100] u16 last_used_idx; [104] bool (*notify)(struct virtqueue *); [112] bool we_own_ring; [120] size_t queue_size_in_bytes; [128] dma_addr_t queue_dma_addr; [136] struct vring_desc_state desc_state[]; } SIZE: 136
You can see that the virtqueue structure is actually offset to 0 by vring_virtqueue, so it can be resolved directly by struct vring_virtqueue ffff917c6c67d000
vring_virtqueue structure:
crash> struct vring_virtqueue ffff917c6c67d000 struct vring_virtqueue { vq = { list = { next = 0xffff917bbefa82c8, prev = 0xffff917bbefa82c8 }, callback = 0xffffffffc0387000, name = 0xffff917c6c7a1dcc "req.0", vdev = 0xffff917bbefa8000, index = 0, num_free = 1, priv = 0xffffadff81b5e010 }, vring = { num = 128, desc = 0xffff917c6bf68000, avail = 0xffff917c6bf68800, used = 0xffff917c6bf69000 }, weak_barriers = true, broken = false, indirect = true, event = false, free_head = 94, num_added = 0, last_used_idx = 38531, notify = 0xffffffffc0249a50, we_own_ring = true, queue_size_in_bytes = 5126, queue_dma_addr = 17917444096, desc_state = 0xffff917c6c67d088 }
So [%rdi + 0x38] actually gets the value of the first member with an offset of 0 in the vring structure, which is 128.Now we have obtained the value of BUG_ON (total_sg > vq->vring.num) in the statement triggering the bug in this way; vq->vring.num; and total_sg is actually the third parameter of virtqueue_add, which is stored in the RDX register as 0x81, or 129 in decimal.
close
So far, I have only analyzed the relevant variables in these data structures, and have not explained the meaning of these variables or functions. Now we have verified the condition total_sg > vq->vring.num that triggers BUG, but why is this happening?To analyze the values of these variables, you need to go back to the function that called the function, and eventually you may need to go back all the way to the application that originated the IO request, which is obviously a very cumbersome thing.To put it another way, total_sg is the total number of scatterlists required by the vm. The scatterlist is a structure related to physical memory. It can be simply understood as the physical memory required by the vm, while vring is the carrier of data transmission in front and back of virtio. Here, the ability of front and back data transmission can be simply understood.The intuitive feeling is that total_sg really shouldn't be larger than vring.num, but in fact total_sg is only 1 greater than vring.num, and the value of total_sg here is different from what I just thought, since total_sg is of unsigned int type and the comparison here is total_sg > vq->vring.num, so I started thinking that total_sg overflowed.At this point of analysis, I tried BUG_ON (total_sg > vq->vring.num); I threw it into Google and found that this line of code has been patch ed out in a version, BUG_ON has been changed to a lower level WARN_ON_ONCE in the latest kernel, and the condition has been changed to total_sg > vq->vring.num && vq->indirect.That is, in the case of indirect descriptors, it is allowed to occur in total_sg > vq->vring.num, so how can I verify if the VM uses indirect descriptors?In fact, the member indirect in vring_virtqueue indicates whether indirect descriptors are used or not, and the value of indirect was printed when struct vring_virtqueue ffff917c6c67d000 was executed on it. The VM does use indirect descriptors, so this bug is actually triggered unreasonably, just by judging total_sg > vq->vring.numDowntime conditions are too strict.
summary
Even if this downtime problem is analyzed here, the final solution is to upgrade the kernel. Considering that the patch is not currently accepted by Centos authorities, it needs to be manually compiled and repaired or upgraded through a third-party repo.In fact, many strange problems can be solved by upgrading the kernel, but the latest kernels may also encounter strange problems. Who knows if the next time a kernel bug is discovered, the patch is written and finally accepted by the community?I hope you can get some benefit from this article.Where the article was written badly or incorrectly, please give your advice.