Learn to analyze Kernel Dump from a virtualization front-end Bug

Preface

As you may know, there is a common tool for analyzing Kernel Dump called Crash. When I first started to learn how to analyze Kernel Dump, I spent a lot of time toggling about the use of this tool, but I always forgot its function.Later, one time when I participated in an internal sharing, I was very impressed by a big guy who said, "You don't have to remember how these tools work. You'll guess what they do when you really start using them."
This article tries to string together the knowledge needed to learn Kernel Dump by analyzing a real case. Although some knowledge may only be used in this case, I believe the methods used can be applied everywhere.

rise

There is a VM downtime on the line. Just grabbed dump and got a test machine to start the analysis.The first thing you need is the symbol for the kernel version. If you don't know the version of the kernel beforehand, you can get the current version of the kernel from `strings corefile | grep'Linux version', such as 3.10.0-862.14.4.el7.x86_64

After getting the kernel version, download symbols from the appropriate distribution and system architecture to a specific symbols publishing page, where the distribution is Centos and you can access http://debuginfo.centos.org/ Download.If it's a Ubuntu release, you can go to http://ddebs.ubuntu.com/ Download.It's easy to find the symbols for the specified kernel version by simply searching for them with the kernel version 3.10.0-862.14.4.el7.x86_64. Usually we only need two of the following three symbols, but I can't remember which one, so I'll download and install all three: kernel-debug-debuginfo-3.10.0-862.14.4.7.x86_64.rpm, ke-debuginfo-3.10.Rnel-debuginfo-3.10.0-862.14.4.el7.x86_64.rpm, kernel-debuginfo-common-x86_64-3.10.0-862.14.4.el7.x86_64.rpm.The symbols of common need to be installed before other symbols can be installed due to dependencies. Additionally, if the kernel version on the test machine is newer than that of corefile, the --force option is required to install.

Undertaking

After the symbols are installed, you can load the corefile and symbols through crash.

[root@iZxxx1Z crash]# crash /lib/debug/lib/modules/3.10.0-862.14.4.el7.x86_64/vmlinux i-2xxxpe.236725344.208176-2019-02-27-22\:44\:20

crash 7.2.3-8.el7
Copyright (C) 2002-2017  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.

GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

WARNING: kernel relocated [168MB]: patching 82671 gdb minimal_symbol values

      KERNEL: /lib/debug/lib/modules/3.10.0-862.14.4.el7.x86_64/vmlinux
    DUMPFILE: i-2xxxpe.236725344.208176-2019-02-27-22:44:20
        CPUS: 4 [OFFLINE: 3]
        DATE: Wed Feb 27 22:44:18 2019
      UPTIME: 11:34:59
LOAD AVERAGE: 0.20, 0.07, 0.06
       TASKS: 436
    NODENAME: iZ2xxxpeZ
     RELEASE: 3.10.0-862.14.4.el7.x86_64
     VERSION: #1 SMP Wed Sep 26 15:12:11 UTC 2018
     MACHINE: x86_64  (2500 Mhz)
      MEMORY: 16 GB
       PANIC: "kernel BUG at drivers/virtio/virtio_ring.c:278!"
         PID: 278
     COMMAND: "kworker/2:1H"
        TASK: ffff917c6d3b3f40  [THREAD_INFO: ffff917c64798000]
         CPU: 2
       STATE: TASK_RUNNING (PANIC)

When crash loads a corefile, crash automatically outputs a piece of information that contains some basic information about the system, such as CPU, memory, architecture, and so on.If you want to see this information later in the analysis, you can view it through the sys command.The information we need to focus on here is
PANIC: "kernel BUG at drivers/virtio/virtio_ring.c:278!"
This information tells us that the system triggered a Bug in line 278 of the file drivers/virtio/virtio_ring.c. The reason the system knows it is a Bug is because the big man who wrote this code buried a detection point here, which we will see in the source code later.
After reading the above information, it is my habit to first see what the system was doing at that time, and then see the call stack through the bt command:

crash> bt
PID: 278    TASK: ffff917c6d3b3f40  CPU: 2   COMMAND: "kworker/2:1H"
 #0 [ffff917c6479b8a0] die at ffffffff8b82f96b
 #1 [ffff917c6479b8d0] do_trap at ffffffff8bf1cea0
 #2 [ffff917c6479b920] do_invalid_op at ffffffff8b82c284
 #3 [ffff917c6479b9d0] invalid_op at ffffffff8bf28aee
    [exception RIP: virtqueue_add+1186]
    RIP: ffffffffc023a382  RSP: ffff917c6479ba80  RFLAGS: 00010097
    RAX: 0000000000000081  RBX: ffff917c6c67d000  RCX: 0000000000000002
    RDX: 0000000000000081  RSI: ffff917c6479bc00  RDI: ffff917c6c67d000
    RBP: ffff917c6479bae8   R8: 0000000000000001   R9: ffff917c6b48a380
    R10: ffff917c6b410000  R11: 0000000000000002  R12: ffff917c6479bc18
    R13: ffff917c6479bc18  R14: ffff917c6479bc00  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #4 [ffff917c6479ba80] __kmalloc at ffffffff8b9fbeb0
 #5 [ffff917c6479baf0] virtqueue_add_sgs at ffffffffc023a437 [virtio_ring]
 #6 [ffff917c6479bb48] __virtblk_add_req at ffffffffc03874b2 [virtio_blk]
 #7 [ffff917c6479bc68] virtio_queue_rq at ffffffffc03876f9 [virtio_blk]
 #8 [ffff917c6479bcb8] blk_mq_dispatch_rq_list at ffffffff8bb2a8b6
 #9 [ffff917c6479bd50] blk_mq_do_dispatch_sched at ffffffff8bb2f76e
#10 [ffff917c6479bd98] blk_mq_sched_dispatch_requests at ffffffff8bb3012e
#11 [ffff917c6479bde8] __blk_mq_run_hw_queue at ffffffff8bb29362
#12 [ffff917c6479be10] blk_mq_run_work_fn at ffffffff8bb295c5
#13 [ffff917c6479be20] process_one_work at ffffffff8b8b613f
#14 [ffff917c6479be68] worker_thread at ffffffff8b8b71d6
#15 [ffff917c6479bec8] kthread at ffffffff8b8bdf21

The information printed in the figure above contains the function call stack and the values of each register, so here are a few of the more important registers.RIP points to the address of the instruction being executed. Before the downtime occurs, the last function executed by the system is virtqueue_add, resulting in the statement being downloaded at virtqueue_add+1186.Based on the x86_64 Linux system Function Call Convention RDI, RSI, RDX, RCX, R8, R9 are the first six parameters of the incoming function. If there are more than six parameters, the seventh parameter or more will be passed through the stack.Note that the value of the register may change during the actual function execution.
Now let's see what the ffffffffc023a382+1186 line of code is, and you can see the corresponding assembly through the dis command.But when we executed dis virtqueue_add+1186, we found an error, the content of which was a duplicate of symbol s.

crash> dis virtqueue_add+1186
dis: virtqueue_add+1186: duplicate text symbols found:
ffffffffc0239ee0 (t) virtqueue_add [virtio_ring]
ffffffffc023ab73 (t) virtqueue_add [virtio_ring]

We can calculate which symbol corresponds to the current virtqueue_add by the RIP value. The calculation method is simple, simply subtract the RIP value from the offset 1186.

crash> p/x 0xffffffffc023a382-1186
$1 = 0xffffffffc0239ee0

dis ffffffffc0239ee0 allows you to print out the assembly of virtqueue_add, found virtqueue_add+1186

0xffffffffc023a378 <virtqueue_add+1176>:        mov    0x60(%rbx),%eax
0xffffffffc023a37b <virtqueue_add+1179>:        jmpq   0xffffffffc023a299 <virtqueue_add+953>
0xffffffffc023a380 <virtqueue_add+1184>:        ud2
0xffffffffc023a382 <virtqueue_add+1186>:        ud2

This line assembly is actually ud2, which is an undefined statement, and it's the statement that caused the system to go down.At this point, you should usually look up to see what instructions have been executed before.The last instruction here is jmpq, this is an unconditional jump statement, jump to virtqueue_add+953, that is, the last instruction of ud2 directive must not be jmpq, you can guess there should be a jump directly in front of here, look up to find it

0xffffffffc0239ef4 <virtqueue_add+20>:  mov    %rsi,-0x48(%rbp)
0xffffffffc0239ef8 <virtqueue_add+24>:  mov    %edx,-0x2c(%rbp)
0xffffffffc0239efb <virtqueue_add+27>:  mov    %ecx,-0x38(%rbp)
0xffffffffc0239efe <virtqueue_add+30>:  mov    %r8d,-0x58(%rbp)
0xffffffffc0239f02 <virtqueue_add+34>:  mov    %r9,-0x60(%rbp)
0xffffffffc0239f06 <virtqueue_add+38>:  je     0xffffffffc023a384 <virtqueue_add+1188>
0xffffffffc0239f0c <virtqueue_add+44>:  cmpb   $0x0,0x59(%rdi)
0xffffffffc0239f10 <virtqueue_add+48>:  mov    %rdi,%rbx
0xffffffffc0239f13 <virtqueue_add+51>:  jne    0xffffffffc023a05d <virtqueue_add+381>
0xffffffffc0239f19 <virtqueue_add+57>:  mov    -0x2c(%rbp),%eax
0xffffffffc0239f1c <virtqueue_add+60>:  cmp    %eax,0x38(%rdi)
0xffffffffc0239f1f <virtqueue_add+63>:  jb     0xffffffffc023a382 <virtqueue_add+1186>

You can see that the jumped statement is JB 0xffffffffc023a382 <virtqueue_add+1186>, and the jumped condition is that%eax is less than [%rdi+0x38].What are the values stored in%eax and [%rdi + 0x38]?This is the time to look at the source code.You can see the source file of the function through dis-l, but executing dis-l ffffffc0239ee0 directly will find that there is no information about the source file, which is usually the case when the corresponding debug module is not imported.You can see the current module through the mod command, find the module we need, find the directory of the module through mod -s, and import it through the same command, such as:

crash> mod -s virtio_ring
     MODULE       NAME                      SIZE  OBJECT FILE
ffffffffc023c0e0  virtio_ring              22746  /usr/lib/debug/usr/lib/modules/3.10.0-862.14.4.el7.x86_64/kernel/drivers/virtio/virtio_ring.ko.debug
crash> mod -s virtio_ring /usr/lib/debug/usr/lib/modules/3.10.0-862.14.4.el7.x86_64/kernel/drivers/virtio/virtio_ring.ko.debug
     MODULE       NAME                      SIZE  OBJECT FILE
ffffffffc023c0e0  virtio_ring              22746  /usr/lib/debug/usr/lib/modules/3.10.0-862.14.4.el7.x86_64/kernel/drivers/virtio/virtio_ring.ko.debug

After importing all the missing modules, execute dis-l ffffffc0239ee0 again to see the corresponding source files.Part of the virtqueue_add source code is as follows:

    241 static inline int virtqueue_add(struct virtqueue *_vq,
    242                                 struct scatterlist *sgs[],
    243                                 unsigned int total_sg,
    244                                 unsigned int out_sgs,
    245                                 unsigned int in_sgs,
    246                                 void *data,
    247                                 gfp_t gfp)
    248 {
    249         struct vring_virtqueue *vq = to_vvq(_vq);
    250         struct scatterlist *sg;
    251         struct vring_desc *desc;
    252         unsigned int i, n, avail, descs_used, uninitialized_var(prev), err_idx;
    253         int head;
    254         bool indirect;
    255
    256         START_USE(vq);
    257
    258         BUG_ON(data == NULL);
    259
    260         if (unlikely(vq->broken)) {
    261                 END_USE(vq);
    262                 return -EIO;
    263         }
    264
    265 #ifdef DEBUG
    266         {
    267                 ktime_t now = ktime_get();
    268
    269                 /* No kick or get, with .1 second between?  Warn. */
    270                 if (vq->last_add_time_valid)
    271                         WARN_ON(ktime_to_ms(ktime_sub(now, vq->last_add_time))
    272                                             > 100);
    273                 vq->last_add_time = now;
    274                 vq->last_add_time_valid = true;
    275         }
    276 #endif
    277
    278         BUG_ON(total_sg > vq->vring.num);
    279         BUG_ON(total_sg == 0);
    280
    281         head = vq->free_head;
    282
    283         /* If the host supports indirect descriptor tables, and we have multiple
    284          * buffers, then go indirect. FIXME: tune this threshold */
    285         if (vq->indirect && total_sg > 1 && vq->vq.num_free)
    286                 desc = alloc_indirect(_vq, total_sg, gfp);
    287         else
    288                 desc = NULL;
    289
    290         if (desc) {
    291                 /* Use a single buffer which doesn't continue */
    292                 indirect = true;
    293                 /* Set up rest to use this indirect table. */
    294                 i = 0;
    295                 descs_used = 1;
    296         } else {
    297                 indirect = false;
    298                 desc = vq->vring.desc;
    299                 i = head;
    300                 descs_used = total_sg;
    301         }

By combining the information printed by dis-l ffffffffc0239ee0 with previous analysis, we can see that: 1. The first five parameters of virtqueue_add are struct virtqueue, struct scatterlist, unsigned int, unsigned int, unsigned int, corresponding to the values of the five registers: RDI, RSI, RDX, RCX, R8.2. The statement triggering the bug is BUG_ON (total_sg > vq->vring.num) on line 278;
The structure of the first parameter can be resolved by struct virtqueue ffff917c6c67d000:

crash> struct virtqueue ffff917c6c67d000
struct virtqueue {
  list = {
    next = 0xffff917bbefa82c8,
    prev = 0xffff917bbefa82c8
  },
  callback = 0xffffffffc0387000,
  name = 0xffff917c6c7a1dcc "req.0",
  vdev = 0xffff917bbefa8000,
  index = 0,
  num_free = 1,
  priv = 0xffffadff81b5e010
}

Back to the%eax and [%rdi + 0x38] we just discussed, [%rdi + 0x38] is essentially the value of offset 0x38 in virtqueue, and the offsets of virtqueue members can be printed out by struct-o virtqueue:

crash> struct -o virtqueue
struct virtqueue {
   [0] struct list_head list;
  [16] void (*callback)(struct virtqueue *);
  [24] const char *name;
  [32] struct virtio_device *vdev;
  [40] unsigned int index;
  [44] unsigned int num_free;
  [48] void *priv;
}
SIZE: 56

There is another problem here. 0x38 is 56 decimal, and the size of this structure is only 56 bytes in total. Is that an overflow?After reading the code carefully, you can see that there is a sentence struct vring_virtqueue *vq = to_v vq(_vq);, to_vq is a macro, defined as #define to_vq(_vq) container_of (_vq, struct vring_virtqueue, vq), which is actually an encapsulation of container_of. The function of container_of is very close to the literal meaning, where the virtqueue type _vq variable is actually VA member variable of VQ variable of ring_virtqueue type, VQ is calculated by container_of(_vq, struct vring_virtqueue, vq).We look at the structure of vring_virtqueue by struct-o vring_virtqueue:

crash> struct -o vring_virtqueue
struct vring_virtqueue {
    [0] struct virtqueue vq;
   [56] struct vring vring;
   [88] bool weak_barriers;
   [89] bool broken;
   [90] bool indirect;
   [91] bool event;
   [92] unsigned int free_head;
   [96] unsigned int num_added;
  [100] u16 last_used_idx;
  [104] bool (*notify)(struct virtqueue *);
  [112] bool we_own_ring;
  [120] size_t queue_size_in_bytes;
  [128] dma_addr_t queue_dma_addr;
  [136] struct vring_desc_state desc_state[];
}
SIZE: 136

You can see that the virtqueue structure is actually offset to 0 by vring_virtqueue, so it can be resolved directly by struct vring_virtqueue ffff917c6c67d000
vring_virtqueue structure:

crash> struct vring_virtqueue ffff917c6c67d000
struct vring_virtqueue {
  vq = {
    list = {
      next = 0xffff917bbefa82c8,
      prev = 0xffff917bbefa82c8
    },
    callback = 0xffffffffc0387000,
    name = 0xffff917c6c7a1dcc "req.0",
    vdev = 0xffff917bbefa8000,
    index = 0,
    num_free = 1,
    priv = 0xffffadff81b5e010
  },
  vring = {
    num = 128,
    desc = 0xffff917c6bf68000,
    avail = 0xffff917c6bf68800,
    used = 0xffff917c6bf69000
  },
  weak_barriers = true,
  broken = false,
  indirect = true,
  event = false,
  free_head = 94,
  num_added = 0,
  last_used_idx = 38531,
  notify = 0xffffffffc0249a50,
  we_own_ring = true,
  queue_size_in_bytes = 5126,
  queue_dma_addr = 17917444096,
  desc_state = 0xffff917c6c67d088
}

So [%rdi + 0x38] actually gets the value of the first member with an offset of 0 in the vring structure, which is 128.Now we have obtained the value of BUG_ON (total_sg > vq->vring.num) in the statement triggering the bug in this way; vq->vring.num; and total_sg is actually the third parameter of virtqueue_add, which is stored in the RDX register as 0x81, or 129 in decimal.

close

So far, I have only analyzed the relevant variables in these data structures, and have not explained the meaning of these variables or functions. Now we have verified the condition total_sg > vq->vring.num that triggers BUG, but why is this happening?To analyze the values of these variables, you need to go back to the function that called the function, and eventually you may need to go back all the way to the application that originated the IO request, which is obviously a very cumbersome thing.To put it another way, total_sg is the total number of scatterlists required by the vm. The scatterlist is a structure related to physical memory. It can be simply understood as the physical memory required by the vm, while vring is the carrier of data transmission in front and back of virtio. Here, the ability of front and back data transmission can be simply understood.The intuitive feeling is that total_sg really shouldn't be larger than vring.num, but in fact total_sg is only 1 greater than vring.num, and the value of total_sg here is different from what I just thought, since total_sg is of unsigned int type and the comparison here is total_sg > vq->vring.num, so I started thinking that total_sg overflowed.At this point of analysis, I tried BUG_ON (total_sg > vq->vring.num); I threw it into Google and found that this line of code has been patch ed out in a version, BUG_ON has been changed to a lower level WARN_ON_ONCE in the latest kernel, and the condition has been changed to total_sg > vq->vring.num && vq->indirect.That is, in the case of indirect descriptors, it is allowed to occur in total_sg > vq->vring.num, so how can I verify if the VM uses indirect descriptors?In fact, the member indirect in vring_virtqueue indicates whether indirect descriptors are used or not, and the value of indirect was printed when struct vring_virtqueue ffff917c6c67d000 was executed on it. The VM does use indirect descriptors, so this bug is actually triggered unreasonably, just by judging total_sg > vq->vring.numDowntime conditions are too strict.

summary

Even if this downtime problem is analyzed here, the final solution is to upgrade the kernel. Considering that the patch is not currently accepted by Centos authorities, it needs to be manually compiled and repaired or upgraded through a third-party repo.In fact, many strange problems can be solved by upgrading the kernel, but the latest kernels may also encounter strange problems. Who knows if the next time a kernel bug is discovered, the patch is written and finally accepted by the community?I hope you can get some benefit from this article.Where the article was written badly or incorrectly, please give your advice.

Posted by NickG21 on Thu, 16 May 2019 02:04:55 -0700

Programmer Group