Analyzing the processing of Hard lock with NMI interrupt in kernel 2.6.24

Keywords: Linux

NMI interrupt of CPU is often used as Hard lock detection. Whether the CPU is locked or not, the hardware should always ensure that the NMI interrupt can be responded. As a method of Hard lock detection, when the CPU hardware is locked, its clock interrupt may not be responded, resulting in the clock count value cannot be changed.

In NMI interrupt processing, the CPU hardware deadlock is determined by determining whether the current clock count value is the same as that of the previous NMI interrupt. The process can be represented as follows:

1, How to deal with the clock value

For X86, when the clock interrupt comes, the kernel will add + 1 to the clock counter above the interrupt. In the low resolution mode, the counter + 1 is processed as follows:

//arch/x86/kernel/time_32.c
irqreturn_t timer_interrupt(int irq, void *dev_id)
{
	/* Keep nmi watchdog up to date */
	per_cpu(irq_stat, smp_processor_id()).irq0_irqs++;

        //...
}

After the high-resolution mode is turned on, the clock counter value + 1 is processed in 0xef local APIC interrupt processing as follows:

//arch/x86/kernel/apic_32.c
/* The guts of the apic timer interrupt */
static void local_apic_timer_interrupt(void)
{
	int cpu = smp_processor_id();
	struct clock_event_device *evt = &per_cpu(lapic_events, cpu);
	//....
	per_cpu(irq_stat, cpu).apic_timer_irqs++;  //Count value +1
	evt->event_handler(evt);
}

2, NMI interrupt handling

The NMI interrupt number is 2, where the gate break is set in the trap init function and the interrupt entry function is NMI. This function is an assembly function, which is defined in arch / x86 / kernel / entry_. S (in order to highlight the important part, some contents are omitted):

/* arch/x86/kernel/entry_32.S */
/* NMI is doubly nasty. It can happen _while_ we're handling
 * a debug fault, and the debug fault hasn't yet been able to
 * clear up the stack. So we first check whether we got  an
 * NMI on the sysenter entry path, but after that we need to
 * check whether we got an NMI on the debug path where the debug
 * fault happened on the sysenter path.*/
KPROBE_ENTRY(nmi)
	RING0_INT_FRAME
	pushl %eax
	CFI_ADJUST_CFA_OFFSET 4
	...
	je nmi_debug_stack_check
nmi_stack_correct:
	/* We have a RING0_INT_FRAME here */
	pushl %eax                           #Press in interrupt number
	CFI_ADJUST_CFA_OFFSET 4
	SAVE_ALL                             #Protect register environment
	xorl %edx,%edx		# Zero error codeset function argument error? Code
	movl %esp,%eax		# Pt ﹣ regs pointer set function parameter regs
	call do_nmi             #Enter interrupt handling function
	jmp restore_nocheck_notrace
	CFI_ENDPROC

nmi_stack_fixup:
	RING0_INT_FRAME
	FIX_STACK(12,nmi_stack_correct, 1)
	jmp nmi_stack_correct
        ...
1:	INTERRUPT_RETURN
	CFI_ENDPROC
.section __ex_table,"a"
	.align 4
	.long 1b,iret_exc
.previous
KPROBE_END(nmi)

The NMI interrupt processing entry first performs some state register changes and stack processing. Then, the interrupt number is pushed into the stack from eax, the function parameters are written into edx and eax, and the do ﹣ NMI function is called to start processing. The definition of do ﹣ NMI function is as follows:

//arch/x86/kernel/traps_32.c
fastcall __kprobes void do_nmi(struct pt_regs * regs, long error_code)
{
	int cpu;
	nmi_enter();
	cpu = smp_processor_id();
	++nmi_count(cpu);
	if (!ignore_nmis)
		default_do_nmi(regs);  //Start processing

	nmi_exit();
}

There is nothing to say about the do ﹣ NMI function. We are interested in the default ﹣ do ﹣ NMI function, which is defined as follows:

//arch/x86/kernel/traps_32.c
static __kprobes void default_do_nmi(struct pt_regs * regs)
{
	unsigned char reason = 0;
	/* Only the BSP gets external NMIs from the system.  */
	if (!smp_processor_id())
		reason = get_nmi_reason();
	if (!(reason & 0xc0)) {
		if (notify_die(DIE_NMI_IPI, "nmi_ipi", regs, reason, 2, SIGINT)
							== NOTIFY_STOP)
			return;
#ifdef CONFIG_X86_LOCAL_APIC
		/* Ok, so this is none of the documented NMI sources,
		 * so it must be the NMI watchdog. */
		if (nmi_watchdog_tick(regs, reason))  //Start to determine whether hard lock
			return;
		if (!do_nmi_callback(regs, smp_processor_id()))
#endif
			unknown_nmi_error(reason, regs);
		return;
	}
	if (notify_die(DIE_NMI, "nmi", regs, reason, 2, SIGINT) == NOTIFY_STOP)
		return;
	if (reason & 0x80)
		mem_parity_error(reason, regs);
	if (reason & 0x40)
		io_check_error(reason, regs);
	/* Reassert NMI in case it became active meanwhile
	 * as it's edge-triggered. */
	reassert_nmi();
}

After configuring local APIC, the function calls NMI watchdog tick, which is the key function to judge hard lock and directly enters the function definition:

__kprobes int nmi_watchdog_tick(struct pt_regs * regs, unsigned reason)
{
	/* Since current_thread_info()-> is always on the stack, and we
	 * always switch the stack NMI-atomically, it's safe to use
	 * smp_processor_id(). */
	unsigned int sum;
	int touched = 0;
	int cpu = smp_processor_id();
	int rc=0;
	...
	if (cpu_isset(cpu, backtrace_mask)) {  //backtrace trace
		...
	}
	/* Take the local apic timer and PIT/HPET into account. We don't
	 * know which one is active, when we have highres/dyntick on */
	sum = per_cpu(irq_stat, cpu).apic_timer_irqs +
		per_cpu(irq_stat, cpu).irq0_irqs;

	/* if the none of the timers isn't firing, this cpu isn't doing much */
	if (!touched && last_irq_sums[cpu] == sum) {
		/*
		 * Ayiee, looks like this CPU is stuck ...
		 * wait a few IRQs (5 seconds) before doing the oops ...
		 */
		alert_counter[cpu]++;
		if (alert_counter[cpu] == 5*nmi_hz)
			/* die_nmi will return ONLY if NOTIFY_STOP happens..*/
			die_nmi(regs, "BUG: NMI Watchdog detected LOCKUP");
	} else {
		last_irq_sums[cpu] = sum;
		alert_counter[cpu] = 0;
	}
	.....
	return rc;
}

In this function, first use notify ﹣ die (die ﹣ NMI) to read the current CPU mode from the MSR, and then judge the clock counter. Here we use a trick: always take the low resolution clock counter value + high resolution clock counter value as the clock counter value, regardless of whether the high-resolution clock mode is turned on or not, the value must reflect the change of the clock count.

If the clock count value of the two adjacent NMIS is unchanged, then it is judged that hard lock has occurred. Further confirm that the kernel waits for 5s. If the condition remains unchanged for 5s, enter die ﹣ NMI to execute oops. Otherwise, update the last clock count value last time, and the hard lock judgment is completed.

Reference documents:

https://www.cnblogs.com/muahao/p/7595158.html (explain hard lock and soft lock)

<Professional Linux Kernel Architecture> chapter14.1

Intel Intel 64 and lA-32 Architectures Software Developer's Manual

13 original articles published, 15 praised, 70000 visitors+
Private letter follow

Posted by ngubie on Sun, 08 Mar 2020 00:11:38 -0800