Data Plane unloading principle of virtio Network -- Vhost net master

Keywords: Virtualization qemu

Interface usage


  • When configuring the virtio network card, libvirt enables this feature by specifying the driver as vhost, as follows:
 	  <model type='virtio'/>
      <driver name='vhost'/>


  • qemu designs vhost as an attribute of the network device. The virtio network card specifies the back-end network device as the tap device and enables vhost. The usage is as follows:
-netdev tap,fd=37,id=hostnet0,vhost=on,vhostfd=38 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=24:42:54:20:50:46,bus=pci.0,addr=0x7

data structure

configuration information

  1. The address of virtqueue, which is the host virtual address of the three key elements of vring, including descriptor table base address, available ring virtual address and used ring virtual address. The slave must know these to get the data on the virtio queue.
  2. The Guest memory layout is composed of a group of RAM MemoryRegionSection in qemu, which is actually the whole memory allocated by qemu to the virtual machine. Because the virtqueue stores the address of the data (only metadata), and the real data is scattered in the physical address space of the virtual machine, the slave must obtain the memory layout of the whole virtual machine in order to access the data pointed to on the virtqueue.
  3. ioeventfd, the front-end data is available. The kernel notifies the qemu virtqueue queue data in user status through this descriptor. slave gets the descriptor and can receive the front-end data notification to process the data on virtqueue.
  4. irqfd: after the back-end data processing is completed, the kernel injects an interrupt into the virtual machine vCPU through this descriptor to notify the front-end that the data has been processed. slave uses this descriptor to notify the front-end virtqueue that the data has been processed.
  • qemu has designed corresponding data structures to describe the above four types of information, as follows:


  • vhost_vring_addr is used to pass the address of a virtqueue. It encapsulates the index of the queue, the starting address of the descriptor table, the used ring address and the available ring address through Vhost_ SET_ VRING_ The addr command word is sent to Vhost net
struct vhost_vring_addr {
	unsigned int index;			/* Queue index */
	/* Option flags. */
	unsigned int flags;
	/* Start of array of descriptors (virtually contiguous) */
	uint64_t desc_user_addr;	/* Descriptor table start address (HVA) */
	/* Used structure address. Must be 32 bit aligned */
	uint64_t used_user_addr;	/* Used ring address (HVA)*/
	/* Available structure address. Must be 16 bit aligned */
	uint64_t avail_user_addr;	/* Available ring address (HVA)*/


  • vhost_memory is used to describe the memory layout of the virtual machine. We know that qemu uses MemoryRegion to describe the memory of the virtual machine and MemoryRegionSection to describe a section of MemoryRegion, vhot_memory wants to describe the memory layout. The data input comes from these two structures, Vhost_ Vhost of memory structure_ memory_ The region member corresponds to the MemoryRegionSection structure. When the memory layout is completely collected, the master passes the Vhost_ SET_ MEM_ The table command word is sent to Vhost net. In fact, Vhost_ KVM when memory and qemu memory are registered with the kernel_ userspace_ memory_ Region is designed for the same purpose and is passed to the kernel as a temporary structure.
struct vhost_memory_region {
    uint64_t guest_phys_addr;				/* Virtual machine starting physical address (GPA) of memory segment */
    uint64_t memory_size; /* bytes Memory segment length */
    uint64_t userspace_addr;				/* The starting user virtual address (HVA) of the memory segment */
    uint64_t flags_padding; /* No flags are currently specified. */
struct vhost_memory {
    uint32_t nregions;						/* How many memory segments does it contain */
    uint32_t padding;
    struct vhost_memory_region regions[0];	/* Memory segment array base address */


  • vhost_vring_file is used to save fd. Each virtio queue is associated with an ioeventfd and an irqfd. fd is encapsulated into this structure and passed to Vhost net
struct vhost_vring_file {
	unsigned int index;		/* virtqueue Queue index */
	/* fd File descriptor
	 * Pass -1 to unbind from file. 
	int fd; 				


  • qemu supports a variety of driver network devices, such as network card, tap device, Vhost user, etc. Netdev describes the network devices in qemu, which is defined in the net.json file as follows:
{ 'union': 'Netdev',
  'base': { 'id': 'str', 'type': 'NetClientDriver' },
  'discriminator': 'type',
  'data': {
    'nic':      'NetLegacyNicOptions',
    'user':     'NetdevUserOptions',
    'tap':      'NetdevTapOptions',
    'vhost-user': 'NetdevVhostUserOptions',
    'vhost-vdpa': 'NetdevVhostVDPAOptions' } }
  • The corresponding data structure is automatically generated during qemu compilation:
struct Netdev {
    char *id;
    NetClientDriver type;
    union { /* union tag is @type */
        NetLegacyNicOptions nic;
        NetdevUserOptions user;
        NetdevTapOptions tap;			/* tap Device as the back-end driver of network device */
        NetdevVhostUserOptions vhost_user;
        NetdevVhostVDPAOptions vhost_vdpa;
    } u;
  • When netdev specifies tap as the driver, the following options can be set:
{ 'struct': 'NetdevTapOptions',
  'data': {
    '*ifname':     'str',
    '*fd':         'str',
    '*vhost':      'bool',			/* Enable vhost feature */
    '*vhostfd':    'str',
   ......} }
  • The corresponding data structure is automatically generated during qemu compilation:
struct NetdevTapOptions {
    bool has_ifname;
    char *ifname;
    bool has_fd;
    char *fd;
    bool has_vhost;
    bool vhost;					/* Enable vhost feature */	
    bool has_vhostfd;
    char *vhostfd;
.	.....
  • Analyze the command line of qemu again. When tap is the driver, the vhost feature is enabled, and fd and vhostfd are specified. fd is obtained by libvirt through opening the / dev/tun character device and passed to qemu, and vhostfd is obtained by libvirt through opening the / dev / vhost net character device and passed to qemu.
-netdev tap,fd=37,id=hostnet0,vhost=on,vhostfd=38 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=24:42:54:20:50:46,bus=pci.0,addr=0x7

technological process

  • Before analyzing the device initialization, we parse the tap device command line parameters passed by libvirt as follows:
has_fd = true
fd = "37"
has_vhost = true
vhost = true
has_vhostfd = true
vhostfd = "38"

netdev initialization

  • The qemu main process parses to - netdev at startup. If the tap driver is specified to send and receive network messages, libvirt will create the tap device in advance. qemu only needs to remove the fd corresponding to the tap device transmitted from libvirt, which is similar to the process of obtaining fd from libvirt during migration. The initialization of the network device is performed by qemu after qemu parses the parameters_ create_ late_ Backends entry:
			qemu_opts_foreach(qemu_find_opts("netdev"),  net_init_netdev, NULL, errp)     
							net_client_init_fun[netdev->type](netdev, netdev->id, peer, errp)	<=>	net_init_tap
  • net_init_tap performs corresponding processing according to qemu command line parameters, as follows:
	if (tap->has_fd) {
        fd = monitor_fd_param(cur_mon, tap->fd, &err);		/* Get the tap device fd from libvirt */
        qemu_set_nonblock(fd);								/* Set tap fd non blocking */
        vnet_hdr = tap_probe_vnet_hdr(fd);				 	/* Check whether the kernel tap device supports IFF_VNET_HDR characteristics */	
        net_init_tap_one(tap, peer, "tap", name, NULL,
                         script, downscript,
                         vhostfdname, vnet_hdr, fd, &err);
  • The vhost feature enables you to complete the initialization of the tap device. Continue to analyze net_init_tap_one:
	if (tap->has_vhost ? tap->vhost :
        vhostfdname || (tap->has_vhostforce && tap->vhostforce)) {		/* If tap device vhost is enabled*/
        options.backend_type = VHOST_BACKEND_TYPE_KERNEL;				/* Set the vhost backend to kernel */
       	if (vhostfdname) {												/* If libvirt passes in vhostfd, the vhostfdname here is not empty. This is our case */
            vhostfd = monitor_fd_param(cur_mon, vhostfdname, &err);		/* Remove vhostfd */
            qemu_set_nonblock(vhostfd);									/* Set vhostfd to non blocking */
        } else {														/* If libvirt does not pass in vhostfd, qemu opens the character device to get vhostfd */
            vhostfd = open("/dev/vhost-net", O_RDWR);					/* As you can see from here, vhostfd is obtained by opening the / dev / Vhost net character device */
  • Here, the initialization of the tap device and the enabling of the vhost feature are basically completed. This process is relatively simple and has no substantive content. qemu only obtains the fd transmitted from libvirt according to the parameters and initializes the relevant data structures. After obtaining vhostfd, qemu will configure vhost. This process involves the vhost protocol.

vhost api registration

  • qemu will configure vhost after obtaining vhostfd. The core work is to register the vhost api. If the kernel is used as the vhost backend, it is to register the kernel_ops is used as the vhost api. This process is implemented in vhost_net_init, as follows:
				dev->vhost_ops = &kernel_ops
			for (i = 0; i < hdev->nvqs; ++i, ++n_initialized_vqs) {
        		vhost_virtqueue_init(hdev, hdev->vqs + i, hdev->vq_index + i);

Configuration information transmission


  • After the vhost api is registered, an eventfd will be created for each virtio queue in turn and used as an irqfd. In the traditional process, whenever qemu needs to notify the virtual machine that the data processing on the virtio queue is completed, it will write a 1 notification kernel to the wfd of irqfd, and the kvm at the other end reads that the rfd of irqfd is 1, then it will directly initiate interrupt injection to notify the virtual machine. After enabling the vhost feature, qemu no longer processes the virtio queue and needs to hand over the work of notifying kvm to vhost net in the kernel. Therefore, after qemu creates eventfd, it needs to pass the rfd of eventfd to vhost net, and vhost net finds the context of eventfd in the kernel according to the rfd_ CTX, and then use this data structure to notify kvm of injection interruption, and qemu will no longer participate in the processing of virtio queue. qemu via vhost_ SET_ VRING_ The call command word is passed irqfd. This process is in vhost_virtqueue_init, as follows:
	event_notifier_init(&vq->masked_notifier, 0)						/* Create irqfd */
		file.fd = event_notifier_get_fd(&vq->masked_notifier)			/* Take out the rfd of irqfd and pass it to the kernel */
			dev->vhost_ops->vhost_set_vring_call(dev, &file)	<=>	vhost_kernel_set_vring_call
				vhost_kernel_call(dev, VHOST_SET_VRING_CALL, file)		/* Via VHOST_SET_VRING_CALL command word passing fd */

virtqueue addr

  • Virtio queue consists of three elements: descriptor table, available ring and processed ring. After the virtual machine creates these three elements, it writes the addresses (GPA s) of these three elements to virtio_ PCI_ CAP_ COMMON_ Queue in CFG space_ desc, queue_avail,queue_used three fields to notify qemu. qemu obtains the address, translates it into (HVA), creates the corresponding descriptor table, available ring and processed data structure on the host side, and maintains it, so as to realize virtio queue sharing. This is the traditional virtio queue initialization process. After enabling the vhost feature, in addition to completing the above work, when virtio detects that the virtio device loads the DRIVER, just go to VIRTIO_PCI_CAP_COMMON_CFG device_ The status field writes the status (first ACKNOWLEDGE, then DRIVER and DRIVER_OK). The backend qemu will set the status of the virtio device accordingly. For the net device enabled by vhost, the vhost net device will be started. One of the core steps is to transfer the address of the virtio device to the kernel. The process is as follows:
			k->set_status(vdev, val)	<=>	virtio_net_set_status
							for (i = 0; i < nvhosts; i++) {
								vhost_net_start_one(get_vhost_net(peer), dev);
								vhost_set_vring_enable(peer, peer->vring_enable);
									hdev->vhost_ops->vhost_set_mem_table(hdev, hdev->mem)	<=> vhost_kernel_set_mem_table
  • vhost_virtqueue_start will collect the virtiqueue information maintained by qemu and put it into Vhost_ Vhost of dev_ In the virtqueue structure, when you need to transfer virtiqueue related information, you can directly take the required information from the structure and transfer it to the virtqueue address Vhost of the kernel_ vring_ Addr is like this. The process is as follows: also from vhost_virtqueue fetch information assembly
	vq->num = state.num = virtio_queue_get_num(vdev, idx);
	vq->desc_size = s = l = virtio_queue_get_desc_size(vdev, idx);
	a = virtio_queue_get_desc_addr(vdev, idx);
	vq->desc_phys = a;
	vq->desc = vhost_memory_map(dev, a, &l, false);
	vq->avail_size = s = l = virtio_queue_get_avail_size(vdev, idx);
    vq->avail_phys = a = virtio_queue_get_avail_addr(vdev, idx);
    vq->avail = vhost_memory_map(dev, a, &l, false);
   	vq->used_size = s = l = virtio_queue_get_used_size(vdev, idx);
    vq->used_phys = a = virtio_queue_get_used_addr(vdev, idx);
    vq->used = vhost_memory_map(dev, a, &l, true);
    vhost_virtqueue_set_addr(dev, vq, vhost_vq_index, dev->log_enabled);
    	dev->vhost_ops->vhost_set_vring_addr(dev, &addr)	<=>	vhost_kernel_set_vring_addr
    		vhost_kernel_call(dev, VHOST_SET_VRING_ADDR, addr)

guest memory layout

  • In the process of qemu starting vhost net, before passing the virtio queue address, another step is to pass the memory layout of the virtual machine, also in Vhost_ dev_ Complete in start:
	hdev->vhost_ops->vhost_set_mem_table(hdev, hdev->mem)	<=> vhost_kernel_set_mem_table
		vhost_kernel_call(dev, VHOST_SET_MEM_TABLE, mem)
  • This process is very simple. Vhost_ Vhost in dev_ Memory information is passed to the kernel. This information is the layout of guest memory. Let's see how this information is collected. The process is as follows:
  • TODO

Posted by lazersam on Thu, 02 Dec 2021 22:48:28 -0800