Debugging the kernel using Ftrace

preface

"Ftrace is a tracking tool built into the Linux kernel. Many Linux distributions have enabled various configurations of ftrace in the latest versions. One of the benefits of ftrace to Linux is that it can view what happens inside the kernel, better find the exact cause of the crash, and help developers create the right solution. This article is divided into two parts: the first part The second part discusses how to interact with ftrace in user space, the method to stop the tracker faster, debugging crash and finding out which kernel functions occupy the largest stack area. "

Kernel configuration options for Ftrace

    CONFIG_FUNCTION_TRACER
    CONFIG_FUNCTION_GRAPH_TRACER
    CONFIG_STACK_TRACER
    CONFIG_DYNAMIC_FTRACE

One of the most powerful trackers of ftrace is the function tracker. It uses the - pg option of GCC to call a special function "mcount()" for each function in the kernel. When config is configured_ DYNAMIC_ During ftrace, when the system is started, calling the "mcount()" function will be converted into NOP null instructions, so as to ensure that the system runs with 100% performance. During compilation, the "mcount()" call sites are recorded. The list converts these call sites into NOP instructions when the system starts; When function or function graph tracker is enabled, convert all call points recorded in the list into tracking call functions. In addition, CONFIG_DYNAMIC_FTRACE provides the ability to filter which function should be tracked.

debugfs file interface

After the compilation option of Ftrace is configured in the system, users can work through the debugfs file system interface provided by Ftrace. Usually, the directory is "/ sys/kernel/debug/tracing" (you can also mount manually).

[root@VM-centos ~]# cd /sys/kernel/debug/tracing

[root@VM-centos tracing]# ls
available_events            dyn_ftrace_total_info     kprobe_profile       saved_tgids         snapshot            trace_marker_raw     uprobe_events
available_filter_functions  enabled_functions         max_graph_depth      set_event           stack_max_size      trace_options        uprobe_profile
available_tracers           error_log                 options              set_event_pid       stack_trace         trace_pipe
buffer_percent              events                    per_cpu              set_ftrace_filter   stack_trace_filter  trace_stat
buffer_size_kb              free_buffer               printk_formats       set_ftrace_notrace  timestamp_mode      tracing_cpumask
buffer_total_size_kb        function_profile_enabled  README               set_ftrace_pid      trace               tracing_max_latency
current_tracer              instances                 saved_cmdlines       set_graph_function  trace_clock         tracing_on
dynamic_events              kprobe_events             saved_cmdlines_size  set_graph_notrace   trace_marker        tracing_thresh

View available trackers

available_ The tracker file can be used to view the system. When money is available, which trackers are available:

[root@VM-centos tracing]# cat available_tracers
blk function_graph wakeup_dl wakeup_rt wakeup function nop

[root@VM-centos tracing]# pwd
/sys/kernel/debug/tracing

[root@VM-centos tracing]# cat available_tracers
blk function_graph wakeup_dl wakeup_rt wakeup function nop

Activate function tracker

echo the tracker name to be activated to current_ In the tracer file, you can activate this tracker.

[root@VM-centos tracing]# echo function > current_tracer 

[root@VM-centos tracing]# cat current_tracer 
function

[root@VM-centos tracing]# cat trace | head -20
# tracer: function
#
# entries-in-buffer/entries-written: 409933/78187643   #P:8
#
#                                _-----=> irqs-off
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| /     delay
#           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
#              | |         |   ||||      |         |
           sleep-430870  [003] .... 128241.811737: lookup_fast <-walk_component
           sleep-430870  [003] .... 128241.811737: __d_lookup_rcu <-lookup_fast
           sleep-430870  [003] .... 128241.811737: __follow_mount_rcu.isra.53.part.54 <-lookup_fast
           sleep-430870  [003] .... 128241.811737: inode_permission <-link_path_walk.part.60
           sleep-430870  [003] .... 128241.811737: generic_permission <-inode_permission
           sleep-430870  [003] .... 128241.811737: walk_component <-link_path_walk.part.60

Activate the function graph tracker

The function graph tracker created by Frederic Weisbecker can track function entry and exit, which enables the tracker to understand the depth of the called function. The function graph tracker can make it easier for people to track the execution control flow in the kernel. The function graph tracker hijacks the return address of the function and inserts the tracking callback function when the function exits. This destroys the branch prediction of the CPU and leads to more overhead than the function tracker. The closest real time overhead only appears in the leaf function.

[root@VM-centos tracing]# cat available_tracers 
blk function_graph wakeup_dl wakeup_rt wakeup function nop

[root@VM-centos tracing]# echo function_graph > current_tracer
 
[root@VM-centos tracing]# cat trace | head -50
# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 1)   0.121 us    |                    __tlb_remove_page_size();
 1)   0.110 us    |                    vm_normal_page();
 1)   0.110 us    |                    mark_page_accessed();
 1)               |                    page_remove_rmap() {
 1)               |                      lock_page_memcg() {
 1)   0.110 us    |                        lock_page_memcg.part.42();
 1)   0.320 us    |                      }    //lock_ page_ Execution time of memcg
 1)   0.110 us    |                      PageHuge();
 1)               |                      unlock_page_memcg() {
 1)   0.110 us    |                        __unlock_page_memcg();
 1)   0.321 us    |                      }    //unlock_ page_ Execution time of memcg
 1)   1.182 us    |                    }      //page_ remove_ Execution time of rmap

Using trace_printk function

printk() function can provide good debugging log function, but there are often the following problems:

Insert print log into frequently called functions, which seriously delays the system response. In addition, the system starts and prints logs to the terminal, resulting in slow startup.
The debugging information is mixed with the logs of other subsystems, which is inconvenient to view.
If you are debugging high-capacity areas such as timer interrupts, schedulers, or networks, printk() may cause the system to get into trouble, and may even create real-time locks.

Ftrace introduces a new form of printk (), called trace_printk(), which can be used like printk (), can also be used in any context (interrupt control program, NMI program, and scheduler). trace_ The advantage of printk () is that it does not output to the terminal. On the contrary, it writes to the ring buffer of ftrace and can be read through the trace file. Using trace_printk() only takes about one tenth of a microsecond to write to the ring buffer, but with printk(), especially when writing to the serial console, each write may take several milliseconds. trace_ The performance advantages of printk () make it possible to record the most sensitive areas in the kernel with little impact. In addition, in the case of nop tracer, you can also see the output in trace, so that it will not be disturbed by the function call information.

For example, insert trace into the kernel subsystem or inner kernel module_ Printk function:

trace_printk("read foo %d out of bar %p\n", bar->foo, bar);

By viewing the trace file, you can see the trace log information.

[tracing]# cat trace
# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
           <...>-10690 [003] 17279.332920: : read foo 10 out of bar ffff880013a5bef8

trace_ The output of printk () appears in any tracker, even the function and function graph trackers.

    [tracing]# echo function_graph > current_tracer
    [tracing]# insmod ~/modules/foo.ko
    [tracing]# cat trace
    # tracer: function_graph
    #
    # CPU  DURATION                  FUNCTION CALLS
    # |     |   |                     |   |   |   |
     3) + 16.283 us   |      }
     3) + 17.364 us   |    }
     3)               |    do_one_initcall() {
     3)               |      /* read foo 10 out of bar ffff88001191bef8 */
     3)   4.221 us    |    }
     3)               |    __wake_up() {
     3)   0.633 us    |      _spin_lock_irqsave();
     3)   0.538 us    |      __wake_up_common();
     3)   0.563 us    |      _spin_unlock_irqrestore();

trace_ The output of printk is more like the output annotation of function and function graph.

Start and pause tracking

Sometimes, you just want to chase a specific code path, or you just want to track what happens when you run a specific test_ On is used to prohibit recording data to the ring buffer:

    [tracing]# echo 0 > tracing_on

This prevents Ftrace from flushing records into the ring buffer, but all other calls to the tracker still occur and still incur a lot of overhead. If you need to re enable the ring buffer, you only need to write "1" to the file.

    [tracing]# echo 1 > tracing_on

Note that it is important to leave a space between the number and the greater than sign ">". Otherwise, you may be writing standard input or output to the file.

    [tracing]# echo 0> tracing_on   /* this will not work! */

Recommended process for using tracker

    [tracing]# echo 0 > tracing_on
    [tracing]# echo function_graph > current_tracer
    [tracing]# echo 1 > tracing_on; run_test; echo 0 > tracing_on

The first line prohibits the ring buffer from recording any data, Next, enable the function graph tracker. The overhead of the function graph tracker remains, but nothing is recorded into the trace buffer. The last line enables the ring buffer, runs the test program, and then disables the ring buffer. This narrows the data range that the function graph tracker washes into the ring buffer, including only tracking run_test program data.

The above method of disabling the ring buffer may not be fast enough to run_ Between the execution of the test user process and the closing of the ring buffer, the kernel may also execute other kernel function paths and flush into the ring buffer, which may overflow the ring buffer and lose relevant data. How to close the ring buffer from the user program will be discussed later.

Method of writing Ftrace ring buffer in user space

To help synchronize operations in user space and kernel space, trace is created_ Marker file. It provides a method to write Ftrace ring buffer from user space. The tag will then appear in the trace file to give the location of a specific event in the trace file.

    [tracing]# echo hello world > trace_marker
    [tracing]# cat trace
    # tracer: nop
    #
    #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
    #              | |       |          |         |
               <...>-3718  [001]  5546.183420: 0: hello world

User program reading and writing tracing_on and trace_ Method of marker file

If there is a problem with the user program and you need to find out what happened to the kernel part of the user program at a specific location, you can open these files to prepare the file descriptor when the user program starts:

    int trace_fd = -1;
    int marker_fd = -1;

    int main(int argc, char **argv)
    {
	    char *debugfs;
	    char path[256];
	    [...]

	    debugfs = find_debugfs();
	    if (debugfs) {
		    strcpy(path, debugfs);  /* BEWARE buffer overflow */
		    strcat(path,"/tracing/tracing_on");
		    trace_fd = open(path, O_WRONLY);
		    if (trace_fd >= 0)
			    write(trace_fd, "1", 1);

		    strcpy(path, debugfs);
		    strcat(path,"/tracing/trace_marker");
		    marker_fd = open(path, O_WRONLY);

Then, at a key location in the code, you can place a tag to show where the application is currently located:

    if (marker_fd >= 0)
	    write(marker_fd, "In critical area\n", 17);

    if (critical_function() < 0) {
	    /* we failed! */
	    if (trace_fd >= 0)
		    write(trace_fd, "0", 1);
    }

When critical is detected_ When there is a problem with the execution of the function() function, the ASCII character "0" is immediately written to trace_fd file descriptor will stop tracking. The tracker needs to be disabled only when a problem is detected in order to track and record the history that caused the error. If interval tracking is required in the user program, it can write ASCII "1" to trace_fd enable tracing again.

You can refer to this simple example again. It uses the above initialization process (this example does not add error checking):

    req.tv_sec = 0;
    req.tv_nsec = 1000;
    write(marker_fd, "before nano\n", 12);
    nanosleep(&req, NULL);
    write(marker_fd, "after nano\n", 11);
    write(trace_fd, "0", 1);

The output results are as follows. Please note that for trace_ The writing of the marker file is displayed as a comment in the function graph tracker.

    [...]
     0)               |      __kmalloc() {
     0)   0.528 us    |        get_slab();
     0)   2.271 us    |      }
     0)               |      /* before nano */
     0)               |      kfree() {
     0)   0.475 us    |        __phys_addr();
     0)   2.062 us    |      }
     0)   0.608 us    |      inotify_inode_queue_event();
     0)   0.485 us    |      __fsnotify_parent();
    [...]
     1)   0.523 us    |          _spin_unlock();
     0)   0.495 us    |    current_kernel_time();
     1)               |          it_real_fn() {
     0)   1.602 us    |  }
     1)   0.728 us    |            __rcu_read_lock();
     0)               |  sys_nanosleep() {
     0)               |    hrtimer_nanosleep() {
     0)   0.526 us    |      hrtimer_init();
     1)   0.418 us    |            __rcu_read_lock();
     0)               |      do_nanosleep() {
     1)   1.114 us    |            _spin_lock_irqsave();
    [...]
     0)               |      __kmalloc() {
     1)   2.760 us    |  }
     0)   0.556 us    |        get_slab();
     1)               |  mwait_idle() {
     0)   1.851 us    |      }
     0)               |      /* after nano */
     0)               |      kfree() {
     0)   0.486 us    |        __phys_addr();

The first column of the output result represents the number of the CPU core. When CPU traces are interleaved in this way, it may be difficult to read traces. The grep tool can easily filter it, or you can use per_ trace file of CPU:

    [tracing]# ls /sys/kernel/tracing/per_cpu
    cpu0  cpu1  cpu2  cpu3  cpu4  cpu5  cpu6  cpu7

There is a trace file in each of these CPU directories, which only displays the trace of the CPU. To better understand the records of the function graph tracker without interference from other CPUs, just look at the per_cpu/cpu0/trace.

    [tracing]# cat per_cpu/cpu0/trace
     0)               |      __kmalloc() {
     0)   0.528 us    |        get_slab();
     0)   2.271 us    |      }
     0)               |      /* before nano */
     0)               |      kfree() {
     0)   0.475 us    |        __phys_addr();
     0)   2.062 us    |      }
     0)   0.608 us    |      inotify_inode_queue_event();
     0)   0.485 us    |      __fsnotify_parent();
     0)   0.488 us    |      inotify_dentry_parent_queue_event();
     0)   1.106 us    |      fsnotify();
    [...]
     0)   0.721 us    |    _spin_unlock_irqrestore();
     0)   3.380 us    |  }
     0)               |  audit_syscall_entry() {
     0)   0.495 us    |    current_kernel_time();
     0)   1.602 us    |  }
     0)               |  sys_nanosleep() {
     0)               |    hrtimer_nanosleep() {
     0)   0.526 us    |      hrtimer_init();
     0)               |      do_nanosleep() {
     0)               |        hrtimer_start_range_ns() {
     0)               |          __hrtimer_start_range_ns() {
     0)               |            lock_hrtimer_base() {
     0)   0.866 us    |              _spin_lock_irqsave();
    [...]
     0)               |      __kmalloc() {
     0)               |        get_slab() {
     0)   1.851 us    |      }
     0)               |      /* after nano */
     0)               |      kfree() {
     0)   0.486 us    |        __phys_addr();

During the development of kernel driver, there may be strange errors in the test program. Maybe the driver falls asleep and never wakes up. When kernel events occur, it is difficult to try to disable the tracker from user space, which usually leads to ring buffer overflow and loss of related information, and then the user can stop tracking.

There are two functions in the kernel that work well: tracing_on() and tracing_off(). These two functions are like echo ing "1" or "0" to tracing, respectively_ On file. If there are some conditions in the kernel that can be checked, you can stop the tracker by adding the following:

    if (test_for_error())
	    tracing_off();

Next, add a few traces_ Printk(), recompile and boot the kernel. You can then enable the function or function graph tracker and wait for the error condition to occur. Checking tracing_ The on file will let you know when the error condition occurs. When the kernel calls tracing_ When off(), it switches from "1" to "0". After checking the trace, or saving it in another file:

cat trace > ~/trace.sav

You can continue tracking to check for another hit. To do this, just in tracing_ echo "1" in on, tracking will continue.

ftrace_dump_on_oops

Sometimes the kernel crashes. Using kdump/kexec with the crash utility is a valuable way to check the system state at the crash point, but it can't see what happened before the event that caused the crash. Configure Ftrace in the kernel boot parameters and enable ftrace_dump_on_oops, or through / proc / sys / kernel / Ftrace_ dump_ on_ echo "1" in oops will enable Ftrace to dump the entire tracked ring buffer to the console in ASCII format during oops or panic. Outputting the console to the serial log makes it easier to debug crashes, and you can now trace the events that caused the crash.

Dumping to the console can take a long time because the default Ftrace ring buffer exceeds 1 Megabyte per CPU. To reduce the size of the ring buffer, write the number of kilobytes you want the ring buffer to the buffer_size_kb. Note that this value is the total size of each CPU, not the ring buffer.

    [tracing]# echo 50 > buffer_size_kb

The above will reduce the Ftrace ring buffer to 50 KB per CPU. You can also dump the Ftrace buffer to the console using sysrq-z. To select a specific location for the kernel dump, the kernel can call Ftrace directly_ dump(). Note that this may permanently disable Ftrace and may require a reboot to enable it again. This is because ftrace_dump() reads the ring buffer. The ring buffer is written to all contexts (interrupt, NMI, scheduling), but the reading of the ring buffer needs to be locked. In order to be able to execute ftrace_dump() locking is disabled and the buffer may eventually be broken after output.

    /*
     * The following code will lock up the box, so we dump out the
     * trace before we hit that location.
     */
    ftrace_dump();

    /* code that locks up */

stack tracker

The final topic to discuss is the ability to check the kernel stack size and how much stack space each function uses. Enabling the stack tracker (config_stack_tracker) will show where the stack is most used.

The stack tracker is built from the function tracker as the infrastructure. It does not use the Ftrace ring buffer, but does use the function tracker to stake every function call. Because it uses the function tracker as the infrastructure, there is no overhead when it is not enabled. To enable the stack tracker, echo "1" to / proc/sys/kernel/ stack_tracer_enabled. To see the maximum stack size during startup, add "stacktrace" to the kernel startup parameter.

The stack tracker checks the size of the stack at each function call. If it is greater than the last recorded maximum, it will record it in the stack trace and update the maximum with the new size. To view the current maximum, view the stack_max_size file.

    [tracing]# echo 1 > /proc/sys/kernel/stack_tracer_enabled
    [tracing]# cat stack_max_size
    2928
    [tracing]# cat stack_trace
            Depth    Size   Location    (34 entries)
            -----    ----   --------
      0)     2952      16   mempool_alloc_slab+0x15/0x17
      1)     2936     144   mempool_alloc+0x52/0x104
      2)     2792      16   scsi_sg_alloc+0x4a/0x4c [scsi_mod]
      3)     2776     112   __sg_alloc_table+0x62/0x103
    [...]
     13)     2072      48   __elv_add_request+0x98/0x9f
     14)     2024     112   __make_request+0x43e/0x4bb
     15)     1912     224   generic_make_request+0x424/0x471
     16)     1688      80   submit_bio+0x108/0x115
     17)     1608      48   submit_bh+0xfc/0x11e
     18)     1560     112   __block_write_full_page+0x1ee/0x2e8
     19)     1448      80   block_write_full_page_endio+0xff/0x10e
     20)     1368      16   block_write_full_page+0x15/0x17
     21)     1352      16   blkdev_writepage+0x18/0x1a
     22)     1336      32   __writepage+0x1a/0x40
     23)     1304     304   write_cache_pages+0x241/0x3c1
     24)     1000      16   generic_writepages+0x27/0x29
    [...]
     30)      424      64   bdi_writeback_task+0x3f/0xb0
     31)      360      48   bdi_start_fn+0x76/0xd7
     32)      312     128   kthread+0x7f/0x87
     33)      184     184   child_rip+0xa/0x20

This not only gives you the maximum stack size found, but also shows a breakdown of the stack size used by each function. Note that write_cache_pages has the largest stack, using 304 bytes, followed by generic_make_request, using a 224 byte stack. To reset the maximum value, echo "0" to stack_max_size file.

    [tracing]# echo 0 > stack_max_size

Keeping running for a while will show where the kernel uses too many stacks. But remember that the stack tracker has no overhead only when it is not enabled. When it runs, the performance will decrease.

Note that when the kernel uses a separate stack, the stack tracker does not track the maximum stack size. Because an interrupt has its own stack, it does not track stack usage there. The reason is that when the stack is not the stack of the current task, there is no simple way to quickly see what is at the top of the stack. When using split stack, the process stack may be two pages, while the interrupt stack may be only one page. This may be fixed in the future, but keep this in mind when using the stack tracker.

summary

Ftrace is a very powerful tool and is easy to configure without additional tools. Everything shown in this tutorial can be used on embedded devices with only Busybox installed. Leveraging the ftrace infrastructure should reduce the time required to debug hard to find competitive conditions. Say goodbye to printk () boldly, because connect the function and function graph trackers with trace_printk() and tracing_ Using off () together can become the main tool for debugging Linux kernel.

Reference catalogue

Debugging the kernel using Ftrace - part 1 [LWN.net]https://lwn.net/Articles/365835/Debugging the kernel using Ftrace - part 2 [LWN.net]https://lwn.net/Articles/366796/Secrets of the Ftrace function tracer [LWN.net]https://lwn.net/Articles/370423/linux/ftrace.rst at master · torvalds/linux · GitHubLinux kernel source tree. Contribute to torvalds/linux development by creating an account on GitHub.https://github.com/torvalds/linux/blob/master/Documentation/trace/ftrace.rst

Posted by Rhiknow on Sun, 21 Nov 2021 12:39:28 -0800

Programmer Group