preface
"Ftrace is a tracking tool built into the Linux kernel. Many Linux distributions have enabled various configurations of ftrace in the latest versions. One of the benefits of ftrace to Linux is that it can view what happens inside the kernel, better find the exact cause of the crash, and help developers create the right solution. This article is divided into two parts: the first part The second part discusses how to interact with ftrace in user space, the method to stop the tracker faster, debugging crash and finding out which kernel functions occupy the largest stack area. "
Kernel configuration options for Ftrace
CONFIG_FUNCTION_TRACER CONFIG_FUNCTION_GRAPH_TRACER CONFIG_STACK_TRACER CONFIG_DYNAMIC_FTRACE
One of the most powerful trackers of ftrace is the function tracker. It uses the - pg option of GCC to call a special function "mcount()" for each function in the kernel. When config is configured_ DYNAMIC_ During ftrace, when the system is started, calling the "mcount()" function will be converted into NOP null instructions, so as to ensure that the system runs with 100% performance. During compilation, the "mcount()" call sites are recorded. The list converts these call sites into NOP instructions when the system starts; When function or function graph tracker is enabled, convert all call points recorded in the list into tracking call functions. In addition, CONFIG_DYNAMIC_FTRACE provides the ability to filter which function should be tracked.
debugfs file interface
After the compilation option of Ftrace is configured in the system, users can work through the debugfs file system interface provided by Ftrace. Usually, the directory is "/ sys/kernel/debug/tracing" (you can also mount manually).
[root@VM-centos ~]# cd /sys/kernel/debug/tracing [root@VM-centos tracing]# ls available_events dyn_ftrace_total_info kprobe_profile saved_tgids snapshot trace_marker_raw uprobe_events available_filter_functions enabled_functions max_graph_depth set_event stack_max_size trace_options uprobe_profile available_tracers error_log options set_event_pid stack_trace trace_pipe buffer_percent events per_cpu set_ftrace_filter stack_trace_filter trace_stat buffer_size_kb free_buffer printk_formats set_ftrace_notrace timestamp_mode tracing_cpumask buffer_total_size_kb function_profile_enabled README set_ftrace_pid trace tracing_max_latency current_tracer instances saved_cmdlines set_graph_function trace_clock tracing_on dynamic_events kprobe_events saved_cmdlines_size set_graph_notrace trace_marker tracing_thresh
View available trackers
available_ The tracker file can be used to view the system. When money is available, which trackers are available:
[root@VM-centos tracing]# cat available_tracers blk function_graph wakeup_dl wakeup_rt wakeup function nop [root@VM-centos tracing]# pwd /sys/kernel/debug/tracing [root@VM-centos tracing]# cat available_tracers blk function_graph wakeup_dl wakeup_rt wakeup function nop
Activate function tracker
echo the tracker name to be activated to current_ In the tracer file, you can activate this tracker.
[root@VM-centos tracing]# echo function > current_tracer [root@VM-centos tracing]# cat current_tracer function [root@VM-centos tracing]# cat trace | head -20 # tracer: function # # entries-in-buffer/entries-written: 409933/78187643 #P:8 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | sleep-430870 [003] .... 128241.811737: lookup_fast <-walk_component sleep-430870 [003] .... 128241.811737: __d_lookup_rcu <-lookup_fast sleep-430870 [003] .... 128241.811737: __follow_mount_rcu.isra.53.part.54 <-lookup_fast sleep-430870 [003] .... 128241.811737: inode_permission <-link_path_walk.part.60 sleep-430870 [003] .... 128241.811737: generic_permission <-inode_permission sleep-430870 [003] .... 128241.811737: walk_component <-link_path_walk.part.60
Activate the function graph tracker
The function graph tracker created by Frederic Weisbecker can track function entry and exit, which enables the tracker to understand the depth of the called function. The function graph tracker can make it easier for people to track the execution control flow in the kernel. The function graph tracker hijacks the return address of the function and inserts the tracking callback function when the function exits. This destroys the branch prediction of the CPU and leads to more overhead than the function tracker. The closest real time overhead only appears in the leaf function.
[root@VM-centos tracing]# cat available_tracers blk function_graph wakeup_dl wakeup_rt wakeup function nop [root@VM-centos tracing]# echo function_graph > current_tracer [root@VM-centos tracing]# cat trace | head -50 # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 1) 0.121 us | __tlb_remove_page_size(); 1) 0.110 us | vm_normal_page(); 1) 0.110 us | mark_page_accessed(); 1) | page_remove_rmap() { 1) | lock_page_memcg() { 1) 0.110 us | lock_page_memcg.part.42(); 1) 0.320 us | } //lock_ page_ Execution time of memcg 1) 0.110 us | PageHuge(); 1) | unlock_page_memcg() { 1) 0.110 us | __unlock_page_memcg(); 1) 0.321 us | } //unlock_ page_ Execution time of memcg 1) 1.182 us | } //page_ remove_ Execution time of rmap
Using trace_printk function
printk() function can provide good debugging log function, but there are often the following problems:
- Insert print log into frequently called functions, which seriously delays the system response. In addition, the system starts and prints logs to the terminal, resulting in slow startup.
- The debugging information is mixed with the logs of other subsystems, which is inconvenient to view.
- If you are debugging high-capacity areas such as timer interrupts, schedulers, or networks, printk() may cause the system to get into trouble, and may even create real-time locks.
Ftrace introduces a new form of printk (), called trace_printk(), which can be used like printk (), can also be used in any context (interrupt control program, NMI program, and scheduler). trace_ The advantage of printk () is that it does not output to the terminal. On the contrary, it writes to the ring buffer of ftrace and can be read through the trace file. Using trace_printk() only takes about one tenth of a microsecond to write to the ring buffer, but with printk(), especially when writing to the serial console, each write may take several milliseconds. trace_ The performance advantages of printk () make it possible to record the most sensitive areas in the kernel with little impact. In addition, in the case of nop tracer, you can also see the output in trace, so that it will not be disturbed by the function call information.
For example, insert trace into the kernel subsystem or inner kernel module_ Printk function:
trace_printk("read foo %d out of bar %p\n", bar->foo, bar);
By viewing the trace file, you can see the trace log information.
[tracing]# cat trace # tracer: nop # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | <...>-10690 [003] 17279.332920: : read foo 10 out of bar ffff880013a5bef8
trace_ The output of printk () appears in any tracker, even the function and function graph trackers.
[tracing]# echo function_graph > current_tracer [tracing]# insmod ~/modules/foo.ko [tracing]# cat trace # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 3) + 16.283 us | } 3) + 17.364 us | } 3) | do_one_initcall() { 3) | /* read foo 10 out of bar ffff88001191bef8 */ 3) 4.221 us | } 3) | __wake_up() { 3) 0.633 us | _spin_lock_irqsave(); 3) 0.538 us | __wake_up_common(); 3) 0.563 us | _spin_unlock_irqrestore();
trace_ The output of printk is more like the output annotation of function and function graph.
Start and pause tracking
Sometimes, you just want to chase a specific code path, or you just want to track what happens when you run a specific test_ On is used to prohibit recording data to the ring buffer:
[tracing]# echo 0 > tracing_on
This prevents Ftrace from flushing records into the ring buffer, but all other calls to the tracker still occur and still incur a lot of overhead. If you need to re enable the ring buffer, you only need to write "1" to the file.
[tracing]# echo 1 > tracing_on
Note that it is important to leave a space between the number and the greater than sign ">". Otherwise, you may be writing standard input or output to the file.
[tracing]# echo 0> tracing_on /* this will not work! */
Recommended process for using tracker
[tracing]# echo 0 > tracing_on [tracing]# echo function_graph > current_tracer [tracing]# echo 1 > tracing_on; run_test; echo 0 > tracing_on
The first line prohibits the ring buffer from recording any data, Next, enable the function graph tracker. The overhead of the function graph tracker remains, but nothing is recorded into the trace buffer. The last line enables the ring buffer, runs the test program, and then disables the ring buffer. This narrows the data range that the function graph tracker washes into the ring buffer, including only tracking run_test program data.
The above method of disabling the ring buffer may not be fast enough to run_ Between the execution of the test user process and the closing of the ring buffer, the kernel may also execute other kernel function paths and flush into the ring buffer, which may overflow the ring buffer and lose relevant data. How to close the ring buffer from the user program will be discussed later.
Method of writing Ftrace ring buffer in user space
To help synchronize operations in user space and kernel space, trace is created_ Marker file. It provides a method to write Ftrace ring buffer from user space. The tag will then appear in the trace file to give the location of a specific event in the trace file.
[tracing]# echo hello world > trace_marker [tracing]# cat trace # tracer: nop # # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | <...>-3718 [001] 5546.183420: 0: hello world
User program reading and writing tracing_on and trace_ Method of marker file
If there is a problem with the user program and you need to find out what happened to the kernel part of the user program at a specific location, you can open these files to prepare the file descriptor when the user program starts:
int trace_fd = -1; int marker_fd = -1; int main(int argc, char **argv) { char *debugfs; char path[256]; [...] debugfs = find_debugfs(); if (debugfs) { strcpy(path, debugfs); /* BEWARE buffer overflow */ strcat(path,"/tracing/tracing_on"); trace_fd = open(path, O_WRONLY); if (trace_fd >= 0) write(trace_fd, "1", 1); strcpy(path, debugfs); strcat(path,"/tracing/trace_marker"); marker_fd = open(path, O_WRONLY);
Then, at a key location in the code, you can place a tag to show where the application is currently located:
if (marker_fd >= 0) write(marker_fd, "In critical area\n", 17); if (critical_function() < 0) { /* we failed! */ if (trace_fd >= 0) write(trace_fd, "0", 1); }
When critical is detected_ When there is a problem with the execution of the function() function, the ASCII character "0" is immediately written to trace_fd file descriptor will stop tracking. The tracker needs to be disabled only when a problem is detected in order to track and record the history that caused the error. If interval tracking is required in the user program, it can write ASCII "1" to trace_fd enable tracing again.
You can refer to this simple example again. It uses the above initialization process (this example does not add error checking):
req.tv_sec = 0; req.tv_nsec = 1000; write(marker_fd, "before nano\n", 12); nanosleep(&req, NULL); write(marker_fd, "after nano\n", 11); write(trace_fd, "0", 1);
The output results are as follows. Please note that for trace_ The writing of the marker file is displayed as a comment in the function graph tracker.
[...] 0) | __kmalloc() { 0) 0.528 us | get_slab(); 0) 2.271 us | } 0) | /* before nano */ 0) | kfree() { 0) 0.475 us | __phys_addr(); 0) 2.062 us | } 0) 0.608 us | inotify_inode_queue_event(); 0) 0.485 us | __fsnotify_parent(); [...] 1) 0.523 us | _spin_unlock(); 0) 0.495 us | current_kernel_time(); 1) | it_real_fn() { 0) 1.602 us | } 1) 0.728 us | __rcu_read_lock(); 0) | sys_nanosleep() { 0) | hrtimer_nanosleep() { 0) 0.526 us | hrtimer_init(); 1) 0.418 us | __rcu_read_lock(); 0) | do_nanosleep() { 1) 1.114 us | _spin_lock_irqsave(); [...] 0) | __kmalloc() { 1) 2.760 us | } 0) 0.556 us | get_slab(); 1) | mwait_idle() { 0) 1.851 us | } 0) | /* after nano */ 0) | kfree() { 0) 0.486 us | __phys_addr();
The first column of the output result represents the number of the CPU core. When CPU traces are interleaved in this way, it may be difficult to read traces. The grep tool can easily filter it, or you can use per_ trace file of CPU:
[tracing]# ls /sys/kernel/tracing/per_cpu cpu0 cpu1 cpu2 cpu3 cpu4 cpu5 cpu6 cpu7
There is a trace file in each of these CPU directories, which only displays the trace of the CPU. To better understand the records of the function graph tracker without interference from other CPUs, just look at the per_cpu/cpu0/trace.
[tracing]# cat per_cpu/cpu0/trace 0) | __kmalloc() { 0) 0.528 us | get_slab(); 0) 2.271 us | } 0) | /* before nano */ 0) | kfree() { 0) 0.475 us | __phys_addr(); 0) 2.062 us | } 0) 0.608 us | inotify_inode_queue_event(); 0) 0.485 us | __fsnotify_parent(); 0) 0.488 us | inotify_dentry_parent_queue_event(); 0) 1.106 us | fsnotify(); [...] 0) 0.721 us | _spin_unlock_irqrestore(); 0) 3.380 us | } 0) | audit_syscall_entry() { 0) 0.495 us | current_kernel_time(); 0) 1.602 us | } 0) | sys_nanosleep() { 0) | hrtimer_nanosleep() { 0) 0.526 us | hrtimer_init(); 0) | do_nanosleep() { 0) | hrtimer_start_range_ns() { 0) | __hrtimer_start_range_ns() { 0) | lock_hrtimer_base() { 0) 0.866 us | _spin_lock_irqsave(); [...] 0) | __kmalloc() { 0) | get_slab() { 0) 1.851 us | } 0) | /* after nano */ 0) | kfree() { 0) 0.486 us | __phys_addr();
During the development of kernel driver, there may be strange errors in the test program. Maybe the driver falls asleep and never wakes up. When kernel events occur, it is difficult to try to disable the tracker from user space, which usually leads to ring buffer overflow and loss of related information, and then the user can stop tracking.
There are two functions in the kernel that work well: tracing_on() and tracing_off(). These two functions are like echo ing "1" or "0" to tracing, respectively_ On file. If there are some conditions in the kernel that can be checked, you can stop the tracker by adding the following:
if (test_for_error()) tracing_off();
Next, add a few traces_ Printk(), recompile and boot the kernel. You can then enable the function or function graph tracker and wait for the error condition to occur. Checking tracing_ The on file will let you know when the error condition occurs. When the kernel calls tracing_ When off(), it switches from "1" to "0". After checking the trace, or saving it in another file:
cat trace > ~/trace.sav
You can continue tracking to check for another hit. To do this, just in tracing_ echo "1" in on, tracking will continue.
ftrace_dump_on_oops
Sometimes the kernel crashes. Using kdump/kexec with the crash utility is a valuable way to check the system state at the crash point, but it can't see what happened before the event that caused the crash. Configure Ftrace in the kernel boot parameters and enable ftrace_dump_on_oops, or through / proc / sys / kernel / Ftrace_ dump_ on_ echo "1" in oops will enable Ftrace to dump the entire tracked ring buffer to the console in ASCII format during oops or panic. Outputting the console to the serial log makes it easier to debug crashes, and you can now trace the events that caused the crash.
Dumping to the console can take a long time because the default Ftrace ring buffer exceeds 1 Megabyte per CPU. To reduce the size of the ring buffer, write the number of kilobytes you want the ring buffer to the buffer_size_kb. Note that this value is the total size of each CPU, not the ring buffer.
[tracing]# echo 50 > buffer_size_kb
The above will reduce the Ftrace ring buffer to 50 KB per CPU. You can also dump the Ftrace buffer to the console using sysrq-z. To select a specific location for the kernel dump, the kernel can call Ftrace directly_ dump(). Note that this may permanently disable Ftrace and may require a reboot to enable it again. This is because ftrace_dump() reads the ring buffer. The ring buffer is written to all contexts (interrupt, NMI, scheduling), but the reading of the ring buffer needs to be locked. In order to be able to execute ftrace_dump() locking is disabled and the buffer may eventually be broken after output.
/* * The following code will lock up the box, so we dump out the * trace before we hit that location. */ ftrace_dump(); /* code that locks up */
stack tracker
The final topic to discuss is the ability to check the kernel stack size and how much stack space each function uses. Enabling the stack tracker (config_stack_tracker) will show where the stack is most used.
The stack tracker is built from the function tracker as the infrastructure. It does not use the Ftrace ring buffer, but does use the function tracker to stake every function call. Because it uses the function tracker as the infrastructure, there is no overhead when it is not enabled. To enable the stack tracker, echo "1" to / proc/sys/kernel/ stack_tracer_enabled. To see the maximum stack size during startup, add "stacktrace" to the kernel startup parameter.
The stack tracker checks the size of the stack at each function call. If it is greater than the last recorded maximum, it will record it in the stack trace and update the maximum with the new size. To view the current maximum, view the stack_max_size file.
[tracing]# echo 1 > /proc/sys/kernel/stack_tracer_enabled [tracing]# cat stack_max_size 2928 [tracing]# cat stack_trace Depth Size Location (34 entries) ----- ---- -------- 0) 2952 16 mempool_alloc_slab+0x15/0x17 1) 2936 144 mempool_alloc+0x52/0x104 2) 2792 16 scsi_sg_alloc+0x4a/0x4c [scsi_mod] 3) 2776 112 __sg_alloc_table+0x62/0x103 [...] 13) 2072 48 __elv_add_request+0x98/0x9f 14) 2024 112 __make_request+0x43e/0x4bb 15) 1912 224 generic_make_request+0x424/0x471 16) 1688 80 submit_bio+0x108/0x115 17) 1608 48 submit_bh+0xfc/0x11e 18) 1560 112 __block_write_full_page+0x1ee/0x2e8 19) 1448 80 block_write_full_page_endio+0xff/0x10e 20) 1368 16 block_write_full_page+0x15/0x17 21) 1352 16 blkdev_writepage+0x18/0x1a 22) 1336 32 __writepage+0x1a/0x40 23) 1304 304 write_cache_pages+0x241/0x3c1 24) 1000 16 generic_writepages+0x27/0x29 [...] 30) 424 64 bdi_writeback_task+0x3f/0xb0 31) 360 48 bdi_start_fn+0x76/0xd7 32) 312 128 kthread+0x7f/0x87 33) 184 184 child_rip+0xa/0x20
This not only gives you the maximum stack size found, but also shows a breakdown of the stack size used by each function. Note that write_cache_pages has the largest stack, using 304 bytes, followed by generic_make_request, using a 224 byte stack. To reset the maximum value, echo "0" to stack_max_size file.
[tracing]# echo 0 > stack_max_size
Keeping running for a while will show where the kernel uses too many stacks. But remember that the stack tracker has no overhead only when it is not enabled. When it runs, the performance will decrease.
Note that when the kernel uses a separate stack, the stack tracker does not track the maximum stack size. Because an interrupt has its own stack, it does not track stack usage there. The reason is that when the stack is not the stack of the current task, there is no simple way to quickly see what is at the top of the stack. When using split stack, the process stack may be two pages, while the interrupt stack may be only one page. This may be fixed in the future, but keep this in mind when using the stack tracker.
summary
Ftrace is a very powerful tool and is easy to configure without additional tools. Everything shown in this tutorial can be used on embedded devices with only Busybox installed. Leveraging the ftrace infrastructure should reduce the time required to debug hard to find competitive conditions. Say goodbye to printk () boldly, because connect the function and function graph trackers with trace_printk() and tracing_ Using off () together can become the main tool for debugging Linux kernel.
Reference catalogue
Debugging the kernel using Ftrace - part 1 [LWN.net]https://lwn.net/Articles/365835/Debugging the kernel using Ftrace - part 2 [LWN.net]https://lwn.net/Articles/366796/Secrets of the Ftrace function tracer [LWN.net]https://lwn.net/Articles/370423/linux/ftrace.rst at master · torvalds/linux · GitHubLinux kernel source tree. Contribute to torvalds/linux development by creating an account on GitHub.https://github.com/torvalds/linux/blob/master/Documentation/trace/ftrace.rst