Solution to a problem of abnormal crash of Windows 7 virtual machine (BSOD) in KVM environment

Keywords: Windows network socket xml

Let's start with the environment:

Hardware

8 servers do hyper-integration architecture, software storage pool,
Each server has 96G memory, two Intel (R) Xeon (R) CPU E5-26700 @ 2.60GHz, and 32 threads.
Each server is a 1T SSD plus three 4T SATA, without raid card, two copies of data.
The server integrates two 1G ports to aggregate ports and provide services to the outside world.
A four-port gigabit network card is added to do port aggregation and internal data transmission.

Two. Software

The basic system is Ubuntu 16.04 LTS, a virtualization management platform based on KVM and libvirt.
Virtual Machine: Chinese Professional Version of Windows 7, 32 bits.
Each virtual machine allocates 3071M memory and two CPUs (2 sockets).
libvirt: 1.3.1

Three, phenomenon

windows on the server, occasionally stuck, most of them once or twice a day, a few three or four days stuck once, very few of them will have a longer interval.
Most of them occur in idle computer time.
There are many virtual machines that can access the Internet.
In addition, there are no other obvious rules.

IV. Question Investigation

1. Firstly, the server configuration, terminal configuration and network are checked.

Because this architecture has been deployed in many places and has been running steadily for many years, after checking the customer's on-site network, this reason is basically eliminated.

2. Later doubts about storage

Use iostat to continuously detect two servers for three hours.

iostat -x -m 1

From the logs recorded, it can be seen that the average response time of server storage has been in the normal range, and the load of reading and writing is not high.

3. Check the server logs

The default libvirt log is closed, edit / etc/libvirt/libvirtd.conf, and open the warning level log record.

log_level = 3
log_outputs="3:file:/var/log/libvirt/libvirtd.log"

Look at libvirt logs from/var/log/libvirt/libvirtd.log:

2017-03-15 06:03:08.894+0000: 13919: info : libvirt version: 1.3.1, package: 1ubuntu10.5 (Stefan Bader <stefan.bader@canonical.com> Thu, 06 Oct 2016 13:07:20 +0200)
2017-03-15 06:03:08.894+0000: 13919: info : hostname: sc7
2017-03-15 06:03:08.894+0000: 13919: warning : virStorageBackendVolOpen:1535 : ignoring permission error for '/media/vmpool/win7_test.img'
2017-03-15 06:03:48.180+0000: 13914: error : qemuMonitorIO:689 : internal error: End of file from monitor
2017-03-15 06:03:48.180+0000: 13914: error : virNetClientProgramDispatchError:177 : Cannot open log file: '/var/log/libvirt/qemu/vm105.log': Device or resource busy
2017-03-15 06:04:22.540+0000: 13918: error : qemuMigrationCheckJobStatus:2641 : operation failed: migration job: unexpectedly failed
2017-03-15 06:04:57.546+0000: 13918: warning : virStorageBackendVolOpen:1535 : ignoring permission error for '/media/vmpool/win7_test.img'
2017-03-15 06:06:12.507+0000: 13916: error : qemuMigrationCheckJobStatus:2641 : operation failed: migration job: unexpectedly failed
2017-03-15 06:11:21.182+0000: 13918: warning : virStorageBackendVolOpen:1535 : ignoring permission error for '/media/vmpool/win7_test.img'
2017-03-15 06:13:08.112+0000: 13920: error : qemuMigrationCheckJobStatus:2641 : operation failed: migration job: unexpectedly failed
2017-03-15 06:15:09.467+0000: 13919: warning : virStorageBackendVolOpen:1535 : ignoring permission error for '/media/vmpool/win7_test.img'
2017-03-15 06:35:17.680+0000: 13914: error : qemuMonitorIO:689 : internal error: End of file from monitor
2017-03-15 06:35:17.680+0000: 13914: error : virNetClientProgramDispatchError:177 : Cannot open log file: '/var/log/libvirt/qemu/vm102.log': Device or resource busy

2017-03-16 00:33:08.324+0000: 13914: error : virNetClientProgramDispatchError:177 : Cannot open log file: '/var/log/libvirt/qemu/vm105.log': Device or resource busy
2017-03-16 02:47:26.712+0000: 13919: warning : virStorageBackendVolOpen:1535 : ignoring permission error for '/media/vmpool/win7_test.img'
2017-03-16 02:52:03.184+0000: 13914: error : qemuMonitorIO:689 : internal error: End of file from monitor
2017-03-16 02:52:03.184+0000: 13914: error : virNetClientProgramDispatchError:177 : Cannot open log file: '/var/log/libvirt/qemu/vm113.log': Device or resource busy
2017-03-16 02:52:26.683+0000: 13919: warning : virStorageBackendVolOpen:1535 : ignoring permission error for '/media/vmpool/win7_test.img'

See that there are some virStorage Backend VolOpen, virNetClient Program Dispatch Error, and qemuMonitorIO errors in the log (omitted duplicates), but they are not the key issues and will not cause the virtual machine to crash.

View the logs of virtual machines from / var/log/libvirt/qemu/

2017-03-16 02:55:46.227+0000: starting up libvirt version: 1.3.1, package: 1ubuntu10.5 (Stefan Bader <stefan.bader@canonical.com> Thu, 06 Oct 2016 13:07:20 +0200), qemu version: 2.5.0 (Debian 1:2.5+dfsg-5ubuntu10.8), hostname: sc7
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin QEMU_AUDIO_DRV=spice /usr/bin/kvm-spice -name vm113 -S -machine pc-i440fx-xenial,accel=kvm,usb=off -cpu SandyBridge,+pdpe1gb,+osxsave,+dca,+pcid,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -m 3072 -realtime mlock=off -smp 2,sockets=1,cores=2,threads=1 -uuid 3fec386e-23ee-4853-b46a-37feb9ee3cf9 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-vm113/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime -no-shutdown -boot order=cd,menu=on,strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/media/vmpool/vm113.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -drive file=/media/vmdpool/vmd113.qcow2,format=qcow2,if=none,id=drive-virtio-disk1 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk1,id=virtio-disk1 -drive if=none,id=drive-ide0-1-1,readonly=on -device ide-cd,bus=ide.1,unit=1,drive=drive-ide0-1-1,id=ide0-1-1 -netdev tap,fd=34,id=hostnet0,vhost=on,vhostfd=42 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:a6:3a:68,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -spice port=5908,addr=0.0.0.0,disable-ticketing,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vgamem_mb=16,bus=pci.0,addr=0x2 -device intel-hda,id=sound0,bus=pci.0,addr=0x6 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on
char device redirected to /dev/pts/1 (label charserial0)
warning: host doesn't support requested feature: CPUID.01H:EDX.ds [bit 21]
warning: host doesn't support requested feature: CPUID.01H:EDX.acpi [bit 22]
warning: host doesn't support requested feature: CPUID.01H:EDX.ht [bit 28]
warning: host doesn't support requested feature: CPUID.01H:EDX.tm [bit 29]
warning: host doesn't support requested feature: CPUID.01H:EDX.pbe [bit 31]
warning: host doesn't support requested feature: CPUID.01H:ECX.dtes64 [bit 2]
warning: host doesn't support requested feature: CPUID.01H:ECX.monitor [bit 3]
warning: host doesn't support requested feature: CPUID.01H:ECX.ds_cpl [bit 4]
warning: host doesn't support requested feature: CPUID.01H:ECX.smx [bit 6]
warning: host doesn't support requested feature: CPUID.01H:ECX.est [bit 7]
warning: host doesn't support requested feature: CPUID.01H:ECX.tm2 [bit 8]
warning: host doesn't support requested feature: CPUID.01H:ECX.xtpr [bit 14]
warning: host doesn't support requested feature: CPUID.01H:ECX.pdcm [bit 15]
warning: host doesn't support requested feature: CPUID.01H:ECX.dca [bit 18]
warning: host doesn't support requested feature: CPUID.01H:ECX.osxsave [bit 27]
warning: host doesn't support requested feature: CPUID.01H:EDX.ds [bit 21]
warning: host doesn't support requested feature: CPUID.01H:EDX.acpi [bit 22]
warning: host doesn't support requested feature: CPUID.01H:EDX.ht [bit 28]
warning: host doesn't support requested feature: CPUID.01H:EDX.tm [bit 29]
warning: host doesn't support requested feature: CPUID.01H:EDX.pbe [bit 31]
warning: host doesn't support requested feature: CPUID.01H:ECX.dtes64 [bit 2]
warning: host doesn't support requested feature: CPUID.01H:ECX.monitor [bit 3]
warning: host doesn't support requested feature: CPUID.01H:ECX.ds_cpl [bit 4]
warning: host doesn't support requested feature: CPUID.01H:ECX.smx [bit 6]
warning: host doesn't support requested feature: CPUID.01H:ECX.est [bit 7]
warning: host doesn't support requested feature: CPUID.01H:ECX.tm2 [bit 8]
warning: host doesn't support requested feature: CPUID.01H:ECX.xtpr [bit 14]
warning: host doesn't support requested feature: CPUID.01H:ECX.pdcm [bit 15]
warning: host doesn't support requested feature: CPUID.01H:ECX.dca [bit 18]
warning: host doesn't support requested feature: CPUID.01H:ECX.osxsave [bit 27]
main_channel_link: add main channel client
main_channel_handle_parsed: net test: latency 15.188000 ms, bitrate 120548590 bps (114.964094 Mbps)
inputs_connect: inputs channel client create
red_dispatcher_set_cursor_peer: 
red_channel_client_disconnect: rcc=0x55f0084c69c0 (channel=0x55f007faa6f0 type=3 id=0)
red_channel_client_disconnect: rcc=0x7f9120236180 (channel=0x7f912020b990 type=2 id=0)
red_channel_client_disconnect: rcc=0x7f912022ff60 (channel=0x7f912020bf60 type=4 id=0)
red_peer_receive: Connection reset by peer
red_channel_client_disconnect: rcc=0x55f009402710 (channel=0x55f007f9f160 type=1 id=0)
main_channel_client_on_disconnect: rcc=0x55f009402710
snd_receive: Connection reset by peer
red_channel_client_disconnect_dummy: rcc=0x55f00919e910 (channel=0x55f00925ab90 type=6 id=0)
snd_channel_put: SndChannel=0x55f0087d8c50 freed
snd_receive: Connection reset by peer
red_channel_client_disconnect_dummy: rcc=0x55f0092349b0 (channel=0x55f00925a8a0 type=5 id=0)
snd_channel_put: SndChannel=0x55f008af3550 freed
red_client_destroy: destroy client 0x55f009496b80 with #channels=6
red_dispatcher_disconnect_cursor_peer: 
red_dispatcher_disconnect_display_peer: 
main_channel_link: add main channel client
main_channel_handle_parsed: net test: latency 16.994000 ms, bitrate 115386782 bps (110.041410 Mbps)
inputs_connect: inputs channel client create

The warning information in the middle is not supported by the cpu feature added when libvirt migrates virtual machines. Nor are there any critical errors that can cause crashes.

4. Analyzing the problems of virtual machine itself

After user modification, the virtual machine is optimized by several optimization tools. For the purpose of management, some system settings are manually adjusted and some permission modifications are made.

Copy the memory dump file when it crashes from windows. Memor.dmp.
Run windbg and configure symboal address by ctrl+s:

SRV*C:\Symbols*http://msdl.microsoft.com/download/symbols

Open the dmp file:

Microsoft (R) Windows Debugger Version 6.11.0001.404 X86
Copyright (c) Microsoft Corporation. All rights reserved.

Loading Dump File [C:\dmp\(98)MEMORY.DMP]
Kernel Summary Dump File: Only kernel address space is available

Symbol search path is: SRV*C:\Symbols*http://msdl.microsoft.com/download/symbols;SRV*c:\mysymbol* http://msdl.microsoft.com/download/symbols
Executable search path is: 
Windows 7 Kernel Version 7601 (Service Pack 1) MP (2 procs) Free x86 compatible
Product: WinNt, suite: TerminalServer SingleUserTS
Built by: 7601.18205.x86fre.win7sp1_gdr.130708-1532
Machine Name:
Kernel base = 0x83c13000 PsLoadedModuleList = 0x83d5c4d0
Debug session time: Wed Mar 15 13:11:54.777 2017 (GMT+8)
System Uptime: 0 days 5:00:52.740
Loading Kernel Symbols
...............................................................
................................................................
.........................
Loading User Symbols

Loading unloaded module list
.......

It can be seen that the system crashed after running for five hours. Execute in windbg:

0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

CLOCK_WATCHDOG_TIMEOUT (101)
An expected clock interrupt was not received on a secondary processor in an
MP system within the allocated interval. This indicates that the specified
processor is hung and not processing interrupts.
Arguments:
Arg1: 00000061, Clock interrupt time out interval in nominal clock ticks.
Arg2: 00000000, 0.
Arg3: 807a3120, The PRCB address of the hung processor.
Arg4: 00000001, 0.

Debugging Details:
------------------

*** ERROR: Module load completed but symbols could not be loaded for Ntfs.sys
*** ERROR: Module load completed but symbols could not be loaded for qutmdrv.sys

BUGCHECK_STR:  CLOCK_WATCHDOG_TIMEOUT_2_PROC

DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT

PROCESS_NAME:  System

CURRENT_IRQL:  1c

STACK_TEXT:  
807fc5e4 83c8e117 00000101 00000061 00000000 nt!KeBugCheckEx+0x1e
807fc620 83c8d763 00026160 00000000 0011a800 nt!KeAccumulateTicks+0x242
807fc660 83c8d610 83c4a2be 2e50d92a 00000000 nt!KeUpdateRunTime+0x145
807fc6b8 83c8ce13 8736e71b 8736e71b 000000d1 nt!KeUpdateSystemTime+0x613
807fc6b8 83c4a2be 8736e71b 8736e71b 000000d1 nt!KeUpdateSystemTimeAssist+0x13
807fc75c 83c93268 94484000 00000000 807fc7b4 nt!KeFlushMultipleRangeTb+0x2d5
807fc770 83c70ca0 00001000 876424e8 00000000 nt!MiFlushPteList+0x22
807fc838 83c9acdb 94485000 00001000 94272250 nt!MmSetAddressRangeModified+0x205
807fc8cc 8c0ebd3f 8763f998 00000000 00000000 nt!CcFlushCache+0x294
WARNING: Stack unwind information not available. Following frames may be wrong.
807fc91c 8c0f2dfc 87604da8 942670f8 00000000 Ntfs+0x7ed3f
807fc990 8c0ecd92 87604da8 87baa428 0c757330 Ntfs+0x85dfc
807fc9f8 83c49c1e 87361020 87baa428 87baa428 Ntfs+0x7fd92
807fca10 843ce20c 8735a300 87baa428 00000000 nt!IofCallDriver+0x63
807fca34 843ce3cb 807fca54 8735a300 00000000 FLTMGR!FltpLegacyProcessingAfterPreCallbacksCompleted+0x2aa
807fca6c 83c49c1e 8735a300 87baa428 874e9db0 FLTMGR!FltpDispatch+0xc5
807fca84 937684db 807fcaa4 83c49c1e 874e9db0 nt!IofCallDriver+0x63
807fca8c 83c49c1e 874e9db0 87baa428 87baa428 qutmdrv+0x4db
807fcaa4 83e3dbe9 87baa428 87646c98 00000000 nt!IofCallDriver+0x63
807fcac4 83e1c4bb 874e9db0 87646c98 00000000 nt!IopSynchronousServiceTail+0x1f8
807fcb30 83c508c6 87f0a590 807fcbc8 807fcbd0 nt!NtFlushBuffersFile+0x1d7
807fcb30 83c4e7ad 87f0a590 807fcbc8 807fcbd0 nt!KiSystemServicePostCall
807fcbb0 83e22b5d 80000564 807fcbc8 863e7798 nt!ZwFlushBuffersFile+0x11
807fcbd0 83eab328 8ce1a740 00000000 00000000 nt!CmpFileFlush+0x5a
807fcc00 83c9014b 00000000 00000000 863e7798 nt!CmpLazyFlushWorker+0x35
807fcc50 83e1c12b 00000001 b2d8e202 00000000 nt!ExpWorkerThread+0x10d
807fcc90 83cc3559 83c9003e 00000001 00000000 nt!PspSystemThreadStartup+0x9e
00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x19

STACK_COMMAND:  kb

SYMBOL_NAME:  ANALYSIS_INCONCLUSIVE

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: Unknown_Module

IMAGE_NAME:  Unknown_Image

DEBUG_FLR_IMAGE_TIMESTAMP:  0

FAILURE_BUCKET_ID:  CLOCK_WATCHDOG_TIMEOUT_2_PROC_ANALYSIS_INCONCLUSIVE

BUCKET_ID:  CLOCK_WATCHDOG_TIMEOUT_2_PROC_ANALYSIS_INCONCLUSIVE

Followup: MachineOwner

See the key error message:

CLOCK_WATCHDOG_TIMEOUT (101)
An expected clock interrupt was not received on a secondary processor in an
MP system within the allocated interval. This indicates that the specified
processor is hung and not processing interrupts.
Arguments:
Arg1: 00000061, Clock interrupt time out interval in nominal clock ticks.
Arg2: 00000000, 0.
Arg3: 807a3120, The PRCB address of the hung processor.
Arg4: 00000001, 0.

CLOCK_WATCHDOG_TIMEOUT That's it. Look at the explanation on msdn:
https://msdn.microsoft.com/en-us/library/windows/hardware/ff557211(v=vs.85).aspx It is in the case of multiprocessors that the second processor fails to respond to the interrupt request of the system in time.
There is no more information.

V. Solving the Problem

The problem has been found.
Attempted solutions:

1. Change Virtual Machine to Single CPU

Observed for a period of time, the failure disappeared, but this is the way to cure the symptoms.

2. Change Virtual Machine to Single socket Double Core Mode

(windows 7 supports a maximum of two-way CPUs by default, and kvm allocates two vcpUs by default)

#libvirt xml <CPU>
 <topology sockets='1' cores='2' threads='1'/>

Observation was invalid.

3. The bugzilla.redhat.com found that:

Bug 784836 - new cpu's flags, to control hyper-v related features > - "hv_relaxed" to depress some ugly Windows watchdog timer, and enable "relaxed" timing

Intuition is associated with this BSOD.

Increase in libvirt xml:


<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>

......
.....

  <qemu:commandline>
    <qemu:arg value='-cpu'/>
    <qemu:arg value='host,hv_time,hv_relaxed,hv_vapic,hv_spinlocks=0x1fff,kvm=off,hv_vendor_id=xeon'/>
  </qemu:commandline>

</domain>

Watch for one night and solve the problem.

From the command line, the execution parameters of qemu are as follows:

libvirt+  4573  3.1  3.3 6576684 3363956 ?     Sl   06:53  26:10 qemu-system-x86_64 -enable-kvm -name vm84 -S -machine pc-i440fx-xenial,accel=kvm,usb=off -cpu SandyBridge,+pdpe1gb,+osxsave,+dca,+pcid,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -m 3072 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 1bae86e4-4e9e-44cb-a8ae-2a8a87142f75 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-vmtest/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime -no-shutdown -boot order=cd,menu=on,strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/media/vmpool/vmtest.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -drive file=/media/vmdpool/vmdtest.qcow2,format=qcow2,if=none,id=drive-virtio-disk1 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk1,id=virtio-disk1 -drive if=none,id=drive-ide0-1-1,readonly=on -device ide-cd,bus=ide.1,unit=1,drive=drive-ide0-1-1,id=ide0-1-1 -netdev tap,fd=24,id=hostnet0,vhost=on,vhostfd=40 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:6c:0f:62,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -spice port=5906,addr=0.0.0.0,disable-ticketing,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vgamem_mb=16,bus=pci.0,addr=0x2 -device intel-hda,id=sound0,bus=pci.0,addr=0x6 -device hda-duplex,id=sound0-codec0,bus=sfaxianound0.0,cad=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -cpu host,hv_time,hv_relaxed,hv_vapic,hv_spinlocks=0x1fff,kvm=off,hv_vendor_id=xeon -msg timestamp=on

You can see that enabled: hv_relaxed.

Six, supplement

It was later discovered that libvirt-1.1.1 began to support the hyper-v feature, and HV started.

In libvirt XML < feature > add:

    &lt;hyperv&gt;
      &lt;relaxed state='on'/&gt;
      &lt;vapic state='on'/&gt;
      &lt;spinlocks state='on' retries='4096'/&gt;
    &lt;/hyperv&gt;

The same effect can be achieved.

VII. References

https://bugzilla.redhat.com/show_bug.cgi?id=784836 http://www.windowswally.com/how-to-fix-the-clock_watchdog_timeout-error/ https://msdn.microsoft.com/en-us/library/windows/hardware/ff557211(v=vs.85).aspx http://lists.nongnu.org/archive/html/qemu-devel/2014-02/msg00335.html

Posted by iceangel89 on Sun, 21 Apr 2019 15:51:34 -0700