OVS DPDK VXLAN Tunnel Treatment

Keywords: C Mac Ubuntu sudo Linux

Before learning about the implementation of OVS VXLAN, let's review how traditional VTEP devices handle VXLAN messages. As shown in the following figure:

After the vxlan message enters the switch port, the vxlan tunnel is terminated according to the message header information. After the end of the tunnel, overlay mapping is carried out according to underlay information to get the bd and vrf of overlay. For the above figure, after the end of the message tunnel, BR10 and bdif are bound to overlay message from vxlan10 to br10. BR10 forwards the same subnet FDB. If the purpose of overlay message is the MAC of bdif, then the message will enter its own vrf from bdif for three-layer routing. This process is the process after VTEP receives the vxlan message.

For overlay messages, after overlay routing, if the destination bd is br10. Then the message will enter BR10 from bdif and output from vxlan10 after fdb. Vxlan 10 interface is responsible for constructing vxlan package for message. After vxlan message is encapsulated, it enters underlay routing and forwards, leaving VTP.

Elements of VTEP

tunnel-terminate table

The tunnel termination table is used to strip the underlay header of the vxlan message.

When VXLAN messages enter VTEP, tunnel termination is required. VXLAN belongs to P2MP (point to multi-point) tunnel. At the end of the tunnel, it is only necessary to verify that the destination IP is the local IP (the destination MAC must be the local IP). Of course, before the end of the tunnel, it is necessary to determine whether the message is a VXLAN message. There are generally two forms of tunnel termination:

  • Like the linux kernel, regardless of whether the message is a vxlan message or not, it is processed as a regular message, because the destination IP of the vxlan message is local, the message will be sent to the local udp for processing. In udp processing, according to the destination port 4789, then the message will be transferred to the vlxan port for processing, then the overlay message will be processed. Enter the protocol stack from vxlan for the second processing.
  • Like traditional hardware vendors, the inner and outer headers of the whole message have been extracted in the parser stage. If it is a vxlan message, it goes directly to the tunnel-terminate table for termination.

tunnel-decap-map table

Tunnel de-encapsulation mapping table is used to determine the two-tier broadcast domain and three-tier routing domain of overlay message, namely bd and vrf.

In vxlan tunnels, the BD and VRF can be mapped according to vni, which can be used to route overlay messages in the same subnet FDB and different subnets.

tunnel-encap table

Tunnel encapsulation table is responsible for vxlan encapsulation of fdb forwarded or routed messages. For the same subnet message, we need to determine vni, underlay source IP, underlay destination IP. Generally speaking, the same subnet forwarding, VNI will not change. For cross-subnet forwarding, routing is needed. After routing, overlay-smac and overlay-dmac are determined. For the determination of vni, underlay-sip and underlay-dip, different forwarding models are quite different. The traditional forwarding model is that overlay-route is only responsible for routing, and the link layer is encapsulated after routing. The message is output from bdif, and bdif is connected to a bridge. The VNI of the bridge determines the VNI of the message. There are also some vendors that can directly decide vni, underlay-sip and underlay-dip by routing, referring specifically to the sai interface design of sonic.

underlay route table

After the overlay message is encapsulated, it enters underlay for routing and forwarding, so underlay routing is needed. Traditional network devices need to carry an underlay rif to the vxlan tunnel, through which underlay vrf is specified.

underlay neighbor

After underlay routing, neighbors are required to encapsulate underlay links and neighbor tables.

OVS DPDK VXLAN

ovs to achieve vtep function, we must achieve the above elements.

Main data structures

struct tnl_match {
    ovs_be64 in_key;//vni
    struct in6_addr ipv6_src;//Source IP
    struct in6_addr ipv6_dst;//Purpose IP
    odp_port_t odp_port;//Corresponding interface number
    bool in_key_flow;//This flag bit is false, indicating that VNI needs to be matched strictly, and that VNI is set for true, indicating that you want the flow table to be vni.
    bool ip_src_flow;//For false, it means that the source IP must be matched, and if the source IP is 0, it means that all source IP is wildcard. Tunneling for true representation using openflow flowsheet 
                     //tunnel-id matches further.
    bool ip_dst_flow;//For false, the destination IP must be matched, and for true, the destination IP for tunnel setting using openflow flow flow table.
};

struct tnl_port {
    struct hmap_node ofport_node;
    struct hmap_node match_node;

    const struct ofport_dpif *ofport;
    uint64_t change_seq;
    struct netdev *netdev;

    struct tnl_match match;//Tunnel Authentication Element, Unique Identification of a Tunnel
};

// ovs keeps vxlan port in a hash table:
/* Each hmap contains "struct tnl_port"s.
 * The index is a combination of how each of the fields listed under "Tunnel
 * matches" above matches, see the final paragraph for ordering.
 * vxlan Port mapping table. According to in_key_flow, in_dst_flow, in_src_flow, the three parameters are divided into 12.
 * Priority.
 */
static struct hmap *tnl_match_maps[N_MATCH_TYPES] OVS_GUARDED_BY(rwlock);

Detailed description

Strct tnl_match structure is the core structure of vxlan port. in_key, ipv6_src and ipv6_dst specify the core members of the vxlan package, which are the necessary members of the vxlan interface of traditional VTEP. They are used for tunnel termination and tunnel encapsulation. But in ovs, the designer added three important symbols:

  • in_key_flow: There are two values false to indicate that the tunnel VNI is set when the vxlan port is created. This value is used for tunnel termination, and if the message goes out of the interface, it is encapsulated with this value. So for false, the priority will be higher. If true, it means that the tunnel termination does not match vni. The specific operation of VNI is handled by flow table. When encapsulating messages, VNI is set by flow table.
  • ip_src_flow: There are three values, if false, indicating that the destination IP of the vxlan message needs to be matched with that IP at the end of the tunnel. When encapsulating a message, the IP is used as the tunnel source IP for the outgoing message on the vxlan port. There are two cases when true is true. A src_ip of 0 means that all IP is wildly available, otherwise, only the IP can be determined by matching.
  • ip_dst_flow: There are two values, if false, indicating that the source IP of the matching message is the IP at the end of the tunnel. When the message is output from the interface, the tunnel destination interface is the interface. If true, it's all handled by flow.

vxlan port creation example:

admin@ubuntu:$ sudo ovs-vsctl add-port br0 vxlan1 -- set interface vxlan1 type=vxlan     options:remote_ip=flow options:key=flow options:dst_port=8472 options:local_ip=flow  
admin@ubuntu:$ 
sudo ovs-vsctl add-port br0 vxlan2 -- set interface vxlan2 type=vxlan     options:remote_ip=flow options:key=flow options:dst_port=8472   
admin@ubuntu:/var/log/openvswitch$ 
sudo ovs-vsctl add-port br0 vxlan13 -- set interface vxlan13 type=vxlan     options:remote_ip=flow options:key=191 options:dst_port=8472 
admin@ubuntu:/var/log/openvswitch$ 

ovs switch is a sdn switch, and its core action is open flow flow flow table. The introduction of these three symbols is precisely used to remove the limitations of traditional vxlan devices. These three elements can have 2*2*3=12 combinations depending on their values, i.e. the size of the tnl_match_maps array.

Tunnel Port Search Process

/* Returns a pointer to the 'tnl_match_maps' element corresponding to 'm''s
 * matching criteria. 
 * Its priority is determined by three flags and configuration, i.e. the index of map. This function is called when the vxlan interface is added, and decides to add the vxlan interface to that map.
 */
static struct hmap **
tnl_match_map(const struct tnl_match *m)
{
    enum ip_src_type ip_src;

    ip_src = (m->ip_src_flow ? IP_SRC_FLOW
              : ipv6_addr_is_set(&m->ipv6_src) ? IP_SRC_CFG
              : IP_SRC_ANY);

    return &tnl_match_maps[6 * m->in_key_flow + 3 * m->ip_dst_flow + ip_src];
}

/* Returns the tnl_port that is the best match for the tunnel data in 'flow',
 * or NULL if no tnl_port matches 'flow'. 
 * In the process of tunnel termination, the corresponding vxlan ports are searched according to message information, and the tunnel termination is carried out.
 */
static struct tnl_port *
tnl_find(const struct flow *flow) OVS_REQ_RDLOCK(rwlock)
{
    enum ip_src_type ip_src;
    int in_key_flow;
    int ip_dst_flow;
    int i;

    i = 0;
    for (in_key_flow = 0; in_key_flow < 2; in_key_flow++) {//in_key_flow has the highest priority and 0 priority is higher than 1 priority.
        for (ip_dst_flow = 0; ip_dst_flow < 2; ip_dst_flow++) {//ip_dst_flow takes second place, i.e. vxlan message source IP
            for (ip_src = 0; ip_src < 3; ip_src++) {//ip_src has the lowest priority, and the possible values can be viewed from IP_SRC_CFG.
                struct hmap *map = tnl_match_maps[i];

                if (map) {
                    struct tnl_port *tnl_port;
                    struct tnl_match match;

                    memset(&match, 0, sizeof match);

                    /* The apparent mix-up of 'ip_dst' and 'ip_src' below is
                     * correct, because "struct tnl_match" is expressed in
                     * terms of packets being sent out, but we are using it
                     * here as a description of how to treat received
                     * packets. 
                     * in_key_flow When it's true, there's no need to match vni
                     */
                    match.in_key = in_key_flow ? 0 : flow->tunnel.tun_id;
                    if (ip_src == IP_SRC_CFG) {
                        match.ipv6_src = flow_tnl_dst(&flow->tunnel);
                    }
                    if (!ip_dst_flow) {/*  */
                        match.ipv6_dst = flow_tnl_src(&flow->tunnel);
                    }
                    match.odp_port = flow->in_port.odp_port;
                    match.in_key_flow = in_key_flow;
                    match.ip_dst_flow = ip_dst_flow;
                    match.ip_src_flow = ip_src == IP_SRC_FLOW;
                    //Accurate matching
                    tnl_port = tnl_find_exact(&match, map);
                    if (tnl_port) {
                        return tnl_port;
                    }
                }

                i++;
            }
        }
    }

    return NULL;
}

advantage

Through the addition of these three symbols, ovs greatly simplifies the configuration of vxlan ports, making a global vxlan port sufficient for application. Other parameters are operated by flow tables, which highlights the advantages of SDN and can adapt to large-scale scenarios.

terminate table

The tunnel termination table of ovs is built when vxlan port is created.

//Tunnel global information initialization. ovs uses classifier to construct tunnel termination table cls. The global variable addr_list holds all underlay ip addresses locally.
//underlay ip will be used as the destination address of tunnel termination and the source IP address of tunnel encapsulation in the future.
//port_list saves tunnels using transport layer ports, such as vxlan tunnels.
void
tnl_port_map_init(void)
{
    classifier_init(&cls, flow_segment_u64s);//Tunnel Termination Table
    ovs_list_init(&addr_list);//underlay ip list
    ovs_list_init(&port_list);//tnl_port control block list
    unixctl_command_register("tnl/ports/show", "-v", 0, 1, tnl_port_show, NULL);
}

Tunnel Port Addition

/* Adds 'ofport' to the module with datapath port number 'odp_port'. 'ofport's
 * must be added before they can be used by the module. 'ofport' must be a
 * tunnel.
 *
 * Returns 0 if successful, otherwise a positive errno value. 
 * native_tnl Indicates whether the tunnel has been opened or not.
 */
int
tnl_port_add(const struct ofport_dpif *ofport, const struct netdev *netdev,
             odp_port_t odp_port, bool native_tnl, const char name[]) OVS_EXCLUDED(rwlock)
{
    bool ok;

    fat_rwlock_wrlock(&rwlock);
    ok = tnl_port_add__(ofport, netdev, odp_port, true, native_tnl, name);
    fat_rwlock_unlock(&rwlock);

    return ok ? 0 : EEXIST;
}

//Adding tunnel ports
static bool
tnl_port_add__(const struct ofport_dpif *ofport, const struct netdev *netdev,
               odp_port_t odp_port, bool warn, bool native_tnl, const char name[])
    OVS_REQ_WRLOCK(rwlock)
{
    const struct netdev_tunnel_config *cfg;
    struct tnl_port *existing_port;
    struct tnl_port *tnl_port;
    struct hmap **map;

    cfg = netdev_get_tunnel_config(netdev);
    ovs_assert(cfg);

    tnl_port = xzalloc(sizeof *tnl_port);
    tnl_port->ofport = ofport;
    tnl_port->netdev = netdev_ref(netdev);
    tnl_port->change_seq = netdev_get_change_seq(tnl_port->netdev);
    //These parameters will not affect the end of the tunnel.
    tnl_port->match.in_key = cfg->in_key;
    tnl_port->match.ipv6_src = cfg->ipv6_src;
    tnl_port->match.ipv6_dst = cfg->ipv6_dst;
    tnl_port->match.ip_src_flow = cfg->ip_src_flow;
    tnl_port->match.ip_dst_flow = cfg->ip_dst_flow;
    tnl_port->match.in_key_flow = cfg->in_key_flow;
    tnl_port->match.odp_port = odp_port;
    //Find the location of tunnel in map according to the matching condition of tunnel
    map = tnl_match_map(&tnl_port->match);
    //See if the same interface exists
    existing_port = tnl_find_exact(&tnl_port->match, *map);
    if (existing_port) {
        if (warn) {
            struct ds ds = DS_EMPTY_INITIALIZER;
            tnl_match_fmt(&tnl_port->match, &ds);
            VLOG_WARN("%s: attempting to add tunnel port with same config as "
                      "port '%s' (%s)", tnl_port_get_name(tnl_port),
                      tnl_port_get_name(existing_port), ds_cstr(&ds));
            ds_destroy(&ds);
        }
        netdev_close(tnl_port->netdev);
        free(tnl_port);
        return false;
    }

    hmap_insert(ofport_map, &tnl_port->ofport_node, hash_pointer(ofport, 0));

    if (!*map) {
        *map = xmalloc(sizeof **map);
        hmap_init(*map);
    }
    hmap_insert(*map, &tnl_port->match_node, tnl_hash(&tnl_port->match));
    tnl_port_mod_log(tnl_port, "adding");

    if (native_tnl) {//If tunnel termination is supported, the tunnel termination table is constructed. Generally speaking, it needs to be opened in dpdk mode and not in kernel mode.
        const char *type;

        type = netdev_get_type(netdev);
        tnl_port_map_insert(odp_port, cfg->dst_port, name, type);

    }
    return true;
}

//For tunnels requiring transport layer, the destination port is processed
void
tnl_port_map_insert(odp_port_t port, ovs_be16 tp_port,
                    const char dev_name[], const char type[])
{
    struct tnl_port *p;
    struct ip_device *ip_dev;
    uint8_t nw_proto;

    nw_proto = tnl_type_to_nw_proto(type);
    if (!nw_proto) {//Return directly without requiring a transport layer
        return;
    }

    //To add tunnel ports to the list, here's a bug: instead of comparing tp_port = P - > tp_port, p - > port = port
    //The judgment condition is changed to (p->port= Port & & p->nw_proto== nw_proto)
    ovs_mutex_lock(&mutex);
    LIST_FOR_EACH(p, node, &port_list) {
        if (tp_port == p->tp_port && p->nw_proto == nw_proto) {
             goto out;
        }
    }

    p = xzalloc(sizeof *p);
    p->port = port;
    p->tp_port = tp_port;
    p->nw_proto = nw_proto;
    ovs_strlcpy(p->dev_name, dev_name, sizeof p->dev_name);
    ovs_list_insert(&port_list, &p->node);
    //Traverse the ip address of each local device and use the device as the source address to construct the tunnel termination table
    //Tunnel termination requires parameters such as transport layer protocol, transport layer destination port, native IP (destination IP of vxlan message)
    LIST_FOR_EACH(ip_dev, node, &addr_list) {
        map_insert_ipdev__(ip_dev, p->dev_name, p->port, p->nw_proto, p->tp_port);
    }

out:
    ovs_mutex_unlock(&mutex);
}
//ip_dev: Source device
//Name of device
//Port number,
//Protocol, Port
static void
map_insert_ipdev__(struct ip_device *ip_dev, char dev_name[],
                   odp_port_t port, uint8_t nw_proto, ovs_be16 tp_port)
{
    if (ip_dev->n_addr) {//Traverse every address of the device
        int i;

        for (i = 0; i < ip_dev->n_addr; i++) {
            //The destination MAC of the message must be ip_dev->mac
            map_insert(port, ip_dev->mac, &ip_dev->addr[i],
                       nw_proto, tp_port, dev_name);
        }
    }
}

Tunnel termination process

We saw above how to construct the tunnel termination table, and here we look at how ovs performs the tunnel termination process.

ovs can only process one layer of messages when designing, that is, when parsing the message, it can only parse to the transport layer and can not perceive the overlay information of the message. When vxlan message arrives at ovs from dpdk interface, its purpose mac is the mac of the internal interface, and its destination IP is the IP of the internal interface. When forwarding with normal rules, the message will be forwarded to the internal interface, and the tunneling will be completed when the OUTPUT action is constructed.

/* Composite message output action */
static void
compose_output_action(struct xlate_ctx *ctx, ofp_port_t ofp_port,
                      const struct xlate_bond_recirc *xr)
{
    /* Need to check if it's stp message */
    compose_output_action__(ctx, ofp_port, xr, true);
}


/* Composite message output action */
static void
compose_output_action__(struct xlate_ctx *ctx, ofp_port_t ofp_port,
                        const struct xlate_bond_recirc *xr, bool check_stp)
{
    const struct xport *xport = get_ofp_port(ctx->xbridge, ofp_port);/* Get xport */
    struct flow_wildcards *wc = ctx->wc;/* Get the circulation matches */
    struct flow *flow = &ctx->xin->flow;/* Get the input stream */
    struct flow_tnl flow_tnl;
    ovs_be16 flow_vlan_tci;
    uint32_t flow_pkt_mark;
    uint8_t flow_nw_tos;
    odp_port_t out_port, odp_port;
    bool tnl_push_pop_send = false;
    uint8_t dscp;

    ......
        
    if (out_port != ODPP_NONE) {/* Output Conversion */
        xlate_commit_actions(ctx);/* Conversion output action */

        if (xr) {/* If there is a bond reentry action */
            struct ovs_action_hash *act_hash;

            /* Hash action. */
            act_hash = nl_msg_put_unspec_uninit(ctx->odp_actions,
                                                OVS_ACTION_ATTR_HASH,
                                                sizeof *act_hash);
            act_hash->hash_alg = xr->hash_alg;
            act_hash->hash_basis = xr->hash_basis;

            /* Recirc action. Add reentrant action and set reentrant id */
            nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC,
                           xr->recirc_id);
        } else {

            if (tnl_push_pop_send) {/* Do label pop-in or pop-up actions need to be performed? */
                build_tunnel_send(ctx, xport, flow, odp_port);
                flow->tunnel = flow_tnl; /* Restore tunnel metadata Metadata of message */
            } else {
                odp_port_t odp_tnl_port = ODPP_NONE;

                /* XXX: Write better Filter for tunnel port. We can use inport
                * int tunnel-port flow to avoid these checks completely. 
                * If the message is sent to local, check whether the tunnel termination function is set. End the tunnel
                */
                if (ofp_port == OFPP_LOCAL &&
                    ovs_native_tunneling_is_on(ctx->xbridge->ofproto)) {

                    odp_tnl_port = tnl_port_map_lookup(flow, wc);
                }

                if (odp_tnl_port != ODPP_NONE) {
                    nl_msg_put_odp_port(ctx->odp_actions,
                                        OVS_ACTION_ATTR_TUNNEL_POP,
                                        odp_tnl_port);
                } else {
                    /* Tunnel push-pop action is not compatible with
                     * IPFIX action. */
                    compose_ipfix_action(ctx, out_port);

                    /* Handle truncation of the mirrored packet. */
                    if (ctx->mirror_snaplen > 0 &&
                        ctx->mirror_snaplen < UINT16_MAX) {
                        struct ovs_action_trunc *trunc;

                        trunc = nl_msg_put_unspec_uninit(ctx->odp_actions,
                                                         OVS_ACTION_ATTR_TRUNC,
                                                         sizeof *trunc);
                        trunc->max_len = ctx->mirror_snaplen;
                        if (!ctx->xbridge->support.trunc) {
                            ctx->xout->slow |= SLOW_ACTION;
                        }
                    }

                    nl_msg_put_odp_port(ctx->odp_actions,
                                        OVS_ACTION_ATTR_OUTPUT,
                                        out_port);
                }
            }
        }

        ctx->sflow_odp_port = odp_port;
        ctx->sflow_n_outputs++;
        /* Set up the interface */
        ctx->nf_output_iface = ofp_port;
    }

    /* Out port mirror processing, what we do here is out mirror. */
    if (mbridge_has_mirrors(ctx->xbridge->mbridge) && xport->xbundle) {/* Determine whether the bridge supports mirroring and is at the exit port */
    /*Processing the mirror image */
        mirror_packet(ctx, xport->xbundle,
                      xbundle_mirror_dst(xport->xbundle->xbridge,
                                         xport->xbundle));/* Getting the mirror policy for this port */
    }

 out:
    /* Restore flow,After the value is written to the action, it needs to be restored */
    flow->vlan_tci = flow_vlan_tci;
    flow->pkt_mark = flow_pkt_mark;
    flow->nw_tos = flow_nw_tos;
}

/* 'flow' is non-const to allow for temporary modifications during the lookup.
 * Any changes are restored before returning. 
 * flow Parameters allow temporary modifications to some values, but they need to be restored before returning.
 */
odp_port_t
tnl_port_map_lookup(struct flow *flow, struct flow_wildcards *wc)
{
    //Find classifier rules and end tunnels
    const struct cls_rule *cr = classifier_lookup(&cls, OVS_VERSION_MAX, flow,
                                                  wc);
    //Return tunnel port number.
    return (cr) ? tnl_port_cast(cr)->portno : ODPP_NONE;
}

The above is the processing performed when the slow path search classifier is finished. After the message is processed, the fast flow table will be installed and the datapath action will be executed.

/* The action executes a callback function, and the parameter may_steal indicates whether the message can be released or not. */
static void
dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
              const struct nlattr *a, bool may_steal)
{
    struct dp_netdev_execute_aux *aux = aux_;/* Action Execution Auxiliary Function */
    uint32_t *depth = recirc_depth_get();/* Reentry depth */
    struct dp_netdev_pmd_thread *pmd = aux->pmd;/* poll thread */
    struct dp_netdev *dp = pmd->dp;/* Polling Bridge Equipment */
    int type = nl_attr_type(a);/* Get the action type */
    long long now = aux->now;/* Acquisition of time */
    struct tx_port *p;/* Send Port */

    switch ((enum ovs_action_attr)type) {/* Action type */
    ......

    case OVS_ACTION_ATTR_TUNNEL_POP:/* Remove the outer label, still need reentry, internal message processing */
        if (*depth < MAX_RECIRC_DEPTH) {
            struct dp_packet_batch *orig_packets_ = packets_;
            odp_port_t portno = nl_attr_get_odp_port(a);
            //Check whether the port exists
            p = pmd_tnl_port_cache_lookup(pmd, portno);
            if (p) {
                struct dp_packet_batch tnl_pkt;
                int i;

                if (!may_steal) {
                    dp_packet_batch_clone(&tnl_pkt, packets_);
                    packets_ = &tnl_pkt;
                    dp_packet_batch_reset_cutlen(orig_packets_);
                }
                
                dp_packet_batch_apply_cutlen(packets_);
                //Tunnel decap, p - > port - > netdev to guide decap packaging. The port number is meaningless.
                netdev_pop_header(p->port->netdev, packets_);
                if (!packets_->count) {
                    return;
                }

                for (i = 0; i < packets_->count; i++) {
                    //The input interface of overlay message is set to portno.
                    packets_->packets[i]->md.in_port.odp_port = portno;
                }
                 
                (*depth)++;
                dp_netdev_recirculate(pmd, packets_);
                (*depth)--;
                return;
            }
        }
        break;
        ......
}
    
    
/* vxlan Head-out stack */
struct dp_packet *
netdev_vxlan_pop_header(struct dp_packet *packet)
{
    struct pkt_metadata *md = &packet->md;/* Getting metadata of message */
    struct flow_tnl *tnl = &md->tunnel;/* Getting tunnel information of metadata */
    struct vxlanhdr *vxh;
    unsigned int hlen;

    pkt_metadata_init_tnl(md);/* Initialize metadata of message */
    if (VXLAN_HLEN > dp_packet_l4_size(packet)) {/* If the size of the entire message is not as large as that of the vxlan package, then an error is returned */
        goto err;
    }

    vxh = udp_extract_tnl_md(packet, tnl, &hlen);/* Extracting vxlan tunnel information */
    if (!vxh) {
        goto err;
    }
    /* vxlan header verification */
    if (get_16aligned_be32(&vxh->vx_flags) != htonl(VXLAN_FLAGS) ||
       (get_16aligned_be32(&vxh->vx_vni) & htonl(0xff))) {
        VLOG_WARN_RL(&err_rl, "invalid vxlan flags=%#x vni=%#x\n",
                     ntohl(get_16aligned_be32(&vxh->vx_flags)),
                     ntohl(get_16aligned_be32(&vxh->vx_vni)));
        goto err;
    }
    //Extracting vni and setting it to tunnel id
    tnl->tun_id = htonll(ntohl(get_16aligned_be32(&vxh->vx_vni)) >> 8);
    tnl->flags |= FLOW_TNL_F_KEY;

    /* Offset the head of the tunnel */
    dp_packet_reset_packet(packet, hlen + VXLAN_HLEN);

    return packet;
err:
    dp_packet_delete(packet);
    return NULL;
}

The extracted tunnel metadata is filled into the flow, the tunnel head is offset, the tunnel is finished, and the inner message is ready to re-enter.

tunnel-decap-map table

The tunnel termination process has been described before. Next, the decomposed mapping is needed to find the input VXLAN port of the overlay message. So the processing of overlay message is started. ovs implements decap-map function through vxlan tunnel description control block.

vxlan interface description control block lookup

After the end of the tunnel, the message calls dp_netdev_recirculate function with tunnel metadata to reenter, and then slows down the path query classifier.

static struct ofproto_dpif *
xlate_lookup_ofproto_(const struct dpif_backer *backer, const struct flow *flow,
                      ofp_port_t *ofp_in_port, const struct xport **xportp)
{
    struct xlate_cfg *xcfg = ovsrcu_get(struct xlate_cfg *, &xcfgp);/* Get the currently valid xlate configuration */
    const struct xport *xport;

    xport = xport_lookup(xcfg, tnl_port_should_receive(flow)
                         ? tnl_port_receive(flow)
                         : odp_port_to_ofport(backer, flow->in_port.odp_port));
    if (OVS_UNLIKELY(!xport)) {
        return NULL;
    }
    *xportp = xport;
    if (ofp_in_port) {
        *ofp_in_port = xport->ofp_port;
    }
    return xport->xbridge->ofproto;/* Find its openflow switch description control block ofproto according to xlate port */
}

The vxlan port description control block is queried using the function tnl_port_receive(flow).

/* Looks in the table of tunnels for a tunnel matching the metadata in 'flow'.
 * Returns the 'ofport' corresponding to the new in_port, or a null pointer if
 * none is found.
 *
 * Callers should verify that 'flow' needs to be received by calling
 * tnl_port_should_receive() before this function. */
const struct ofport_dpif *
tnl_port_receive(const struct flow *flow) OVS_EXCLUDED(rwlock)
{
    char *pre_flow_str = NULL;
    const struct ofport_dpif *ofport;
    struct tnl_port *tnl_port;

    fat_rwlock_rdlock(&rwlock);
    //Find the corresponding tunnel interface
    tnl_port = tnl_find(flow);
    //Using Tunnel Interface as New Input Interface
    ofport = tnl_port ? tnl_port->ofport : NULL;
    if (!tnl_port) {
        char *flow_str = flow_to_string(flow);

        VLOG_WARN_RL(&rl, "receive tunnel port not found (%s)", flow_str);
        free(flow_str);
        goto out;
    }

    if (!VLOG_DROP_DBG(&dbg_rl)) {
        pre_flow_str = flow_to_string(flow);
    }

    if (pre_flow_str) {
        char *post_flow_str = flow_to_string(flow);
        char *tnl_str = tnl_port_fmt(tnl_port);
        VLOG_DBG("flow received\n"
                 "%s"
                 " pre: %s\n"
                 "post: %s",
                 tnl_str, pre_flow_str, post_flow_str);
        free(tnl_str);
        free(pre_flow_str);
        free(post_flow_str);
    }

out:
    fat_rwlock_unlock(&rwlock);
    return ofport;
}

/* Returns the tnl_port that is the best match for the tunnel data in 'flow',
 * or NULL if no tnl_port matches 'flow'. */
static struct tnl_port *
tnl_find(const struct flow *flow) OVS_REQ_RDLOCK(rwlock)
{
    enum ip_src_type ip_src;
    int in_key_flow;
    int ip_dst_flow;
    int i;

    i = 0;
    for (in_key_flow = 0; in_key_flow < 2; in_key_flow++) {
        for (ip_dst_flow = 0; ip_dst_flow < 2; ip_dst_flow++) {
            for (ip_src = 0; ip_src < 3; ip_src++) {
                struct hmap *map = tnl_match_maps[i];

                if (map) {
                    struct tnl_port *tnl_port;
                    struct tnl_match match;

                    memset(&match, 0, sizeof match);

                    /* The apparent mix-up of 'ip_dst' and 'ip_src' below is
                     * correct, because "struct tnl_match" is expressed in
                     * terms of packets being sent out, but we are using it
                     * here as a description of how to treat received
                     * packets. 
                     * in_key_flow When it's true, there's no need to match vni
                     */
                    match.in_key = in_key_flow ? 0 : flow->tunnel.tun_id;
                    if (ip_src == IP_SRC_CFG) {
                        match.ipv6_src = flow_tnl_dst(&flow->tunnel);
                    }
                    if (!ip_dst_flow) {/*  */
                        match.ipv6_dst = flow_tnl_src(&flow->tunnel);
                    }
                    match.odp_port = flow->in_port.odp_port;
                    match.in_key_flow = in_key_flow;
                    match.ip_dst_flow = ip_dst_flow;
                    match.ip_src_flow = ip_src == IP_SRC_FLOW;
                    //Accurate matching
                    tnl_port = tnl_find_exact(&match, map);
                    if (tnl_port) {
                        return tnl_port;
                    }
                }

                i++;
            }
        }
    }

    return NULL;
}

tunnel-encap table

Here we begin to analyze vxlan encapsulation. When the overlay message is processed, if it is sent to another vtep, it will eventually go out from a tunnelport. tunnel-related issues will be dealt with when building output actions on slow paths.

/* Composite message output action */
static void
compose_output_action__(struct xlate_ctx *ctx, ofp_port_t ofp_port,
                        const struct xlate_bond_recirc *xr, bool check_stp)
{
    const struct xport *xport = get_ofp_port(ctx->xbridge, ofp_port);/* Get xport */
    ......

    if (xport->is_tunnel) {/* If the port is a tunnel interface */
        struct in6_addr dst;
         /* Save tunnel metadata so that changes made due to
          * the Logical (tunnel) Port are not visible for any further
          * matches, while explicit set actions on tunnel metadata are.
          */
        flow_tnl = flow->tunnel;/* Save tunnel metadata first */
        //
        odp_port = tnl_port_send(xport->ofport, flow, ctx->wc);
        if (odp_port == ODPP_NONE) {
            xlate_report(ctx, OFT_WARN, "Tunneling decided against output");
            goto out; /* restore flow_nw_tos */
        }
        dst = flow_tnl_dst(&flow->tunnel);//
        if (ipv6_addr_equals(&dst, &ctx->orig_tunnel_ipv6_dst)) {
            xlate_report(ctx, OFT_WARN, "Not tunneling to our own address");
            goto out; /* restore flow_nw_tos */
        }
        if (ctx->xin->resubmit_stats) {/* Keep abreast of new statistics */
            netdev_vport_inc_tx(xport->netdev, ctx->xin->resubmit_stats);
        }
        if (ctx->xin->xcache) {/* Add netdev statistics */
            struct xc_entry *entry;

            entry = xlate_cache_add_entry(ctx->xin->xcache, XC_NETDEV);
            entry->dev.tx = netdev_ref(xport->netdev);
        }
        out_port = odp_port;
        //Perform tunnel addition or tunnel termination
        if (ovs_native_tunneling_is_on(ctx->xbridge->ofproto)) {
            xlate_report(ctx, OFT_DETAIL, "output to native tunnel");
            tnl_push_pop_send = true;
        } else {
            xlate_report(ctx, OFT_DETAIL, "output to kernel tunnel");
            commit_odp_tunnel_action(flow, &ctx->base_flow, ctx->odp_actions);/* Submit tunnel action */
            flow->tunnel = flow_tnl; /* Restore tunnel metadata Recovery of tunnel metadata */
        }
    } else {
        odp_port = xport->odp_port;
        out_port = odp_port;
    }

    if (out_port != ODPP_NONE) {/* Output Conversion */
        xlate_commit_actions(ctx);/* Conversion output action */

        if (xr) {/* If there is a bond reentry action */
            struct ovs_action_hash *act_hash;

            /* Hash action. */
            act_hash = nl_msg_put_unspec_uninit(ctx->odp_actions,
                                                OVS_ACTION_ATTR_HASH,
                                                sizeof *act_hash);
            act_hash->hash_alg = xr->hash_alg;
            act_hash->hash_basis = xr->hash_basis;

            /* Recirc action. Add reentrant action and set reentrant id */
            nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC,
                           xr->recirc_id);
        } else {

            if (tnl_push_pop_send) {/* Do label pop-in or pop-up actions need to be performed? */
                build_tunnel_send(ctx, xport, flow, odp_port);
                flow->tunnel = flow_tnl; /* Restore tunnel metadata Metadata of message */
            } else {
                odp_port_t odp_tnl_port = ODPP_NONE;

                /* XXX: Write better Filter for tunnel port. We can use inport
                * int tunnel-port flow to avoid these checks completely. 
                * If the message is sent to local, check whether the tunnel termination function is set.
                */
                if (ofp_port == OFPP_LOCAL &&
                    ovs_native_tunneling_is_on(ctx->xbridge->ofproto)) {

                    odp_tnl_port = tnl_port_map_lookup(flow, wc);
                }
                //There is a tunnel termination table entry, and the tunnel termination action is added.
                if (odp_tnl_port != ODPP_NONE) {
                    nl_msg_put_odp_port(ctx->odp_actions,
                                        OVS_ACTION_ATTR_TUNNEL_POP,
                                        odp_tnl_port);
                } else {
                    /* Tunnel push-pop action is not compatible with
                     * IPFIX action. */
                    compose_ipfix_action(ctx, out_port);

                    /* Handle truncation of the mirrored packet. */
                    if (ctx->mirror_snaplen > 0 &&
                        ctx->mirror_snaplen < UINT16_MAX) {
                        struct ovs_action_trunc *trunc;

                        trunc = nl_msg_put_unspec_uninit(ctx->odp_actions,
                                                         OVS_ACTION_ATTR_TRUNC,
                                                         sizeof *trunc);
                        trunc->max_len = ctx->mirror_snaplen;
                        if (!ctx->xbridge->support.trunc) {
                            ctx->xout->slow |= SLOW_ACTION;
                        }
                    }

                    nl_msg_put_odp_port(ctx->odp_actions,
                                        OVS_ACTION_ATTR_OUTPUT,
                                        out_port);
                }
            }
        }

        ctx->sflow_odp_port = odp_port;
        ctx->sflow_n_outputs++;
        /* Set up the interface */
        ctx->nf_output_iface = ofp_port;
    }

    /* Out port mirror processing, what we do here is out mirror. */
    if (mbridge_has_mirrors(ctx->xbridge->mbridge) && xport->xbundle) {/* Determine whether the bridge supports mirroring and is at the exit port */
    /*Processing the mirror image */
        mirror_packet(ctx, xport->xbundle,
                      xbundle_mirror_dst(xport->xbundle->xbridge,
                                         xport->xbundle));/* Getting the mirror policy for this port */
    }

 out:
    /* Restore flow,After the value is written to the action, it needs to be restored */
    flow->vlan_tci = flow_vlan_tci;
    flow->pkt_mark = flow_pkt_mark;
    flow->nw_tos = flow_nw_tos;
}


/* Given that 'flow' should be output to the ofport corresponding to
 * 'tnl_port', updates 'flow''s tunnel headers and returns the actual datapath
 * port that the output should happen on.  May return ODPP_NONE if the output
 * shouldn't occur. 
 * */
odp_port_t
tnl_port_send(const struct ofport_dpif *ofport, struct flow *flow,
              struct flow_wildcards *wc) OVS_EXCLUDED(rwlock)
{
    const struct netdev_tunnel_config *cfg;/* Tunnel configuration information */
    struct tnl_port *tnl_port;
    char *pre_flow_str = NULL;
    odp_port_t out_port;

    fat_rwlock_rdlock(&rwlock);/* Getting Read Locks for Read-Write Locks */
    tnl_port = tnl_find_ofport(ofport);/* Get the tunnel port for this port */
    out_port = tnl_port ? tnl_port->match.odp_port : ODPP_NONE;
    if (!tnl_port) {
        goto out;
    }

    cfg = netdev_get_tunnel_config(tnl_port->netdev);/* Tunnel configuration for access ports */
    ovs_assert(cfg);

    if (!VLOG_DROP_DBG(&dbg_rl)) {
        pre_flow_str = flow_to_string(flow);
    }

    if (!cfg->ip_src_flow) {/* Source IP is specified */
        flow->tunnel.ip_src = in6_addr_get_mapped_ipv4(&tnl_port->match.ipv6_src);
        if (!flow->tunnel.ip_src) {
            flow->tunnel.ipv6_src = tnl_port->match.ipv6_src;
        } else {
            flow->tunnel.ipv6_src = in6addr_any;
        }
    }
    if (!cfg->ip_dst_flow) {/* Destination IP is specified */
        flow->tunnel.ip_dst = in6_addr_get_mapped_ipv4(&tnl_port->match.ipv6_dst);
        if (!flow->tunnel.ip_dst) {
            flow->tunnel.ipv6_dst = tnl_port->match.ipv6_dst;
        } else {
            flow->tunnel.ipv6_dst = in6addr_any;
        }
    }
    flow->tunnel.tp_dst = cfg->dst_port;/* Destination port number */
    if (!cfg->out_key_flow) {
        flow->tunnel.tun_id = cfg->out_key;
    }

    if (cfg->ttl_inherit && is_ip_any(flow)) {
        wc->masks.nw_ttl = 0xff;/* Need to match ttl */
        flow->tunnel.ip_ttl = flow->nw_ttl;
    } else {
        flow->tunnel.ip_ttl = cfg->ttl;
    }

    if (cfg->tos_inherit && is_ip_any(flow)) {
        wc->masks.nw_tos |= IP_DSCP_MASK;
        flow->tunnel.ip_tos = flow->nw_tos & IP_DSCP_MASK;
    } else {
        flow->tunnel.ip_tos = cfg->tos;
    }

    /* ECN fields are always inherited. */
    if (is_ip_any(flow)) {
        wc->masks.nw_tos |= IP_ECN_MASK;

        if (IP_ECN_is_ce(flow->nw_tos)) {
            flow->tunnel.ip_tos |= IP_ECN_ECT_0;
        } else {
            flow->tunnel.ip_tos |= flow->nw_tos & IP_ECN_MASK;
        }
    }

    flow->tunnel.flags |= (cfg->dont_fragment ? FLOW_TNL_F_DONT_FRAGMENT : 0)
        | (cfg->csum ? FLOW_TNL_F_CSUM : 0)
        | (cfg->out_key_present ? FLOW_TNL_F_KEY : 0);

    if (pre_flow_str) {
        char *post_flow_str = flow_to_string(flow);
        char *tnl_str = tnl_port_fmt(tnl_port);
        VLOG_DBG("flow sent\n"
                 "%s"
                 " pre: %s\n"
                 "post: %s",
                 tnl_str, pre_flow_str, post_flow_str);
        free(tnl_str);
        free(pre_flow_str);
        free(post_flow_str);
    }

out:
    fat_rwlock_unlock(&rwlock);
    return out_port;
}

Building Outer Packaging

This is just to prepare the data encapsulated in the outer layer and put it in tnl_push_data to prepare for writing the message later.

static int
build_tunnel_send(struct xlate_ctx *ctx, const struct xport *xport,
                  const struct flow *flow, odp_port_t tunnel_odp_port)
{
    struct netdev_tnl_build_header_params tnl_params;
    struct ovs_action_push_tnl tnl_push_data;
    struct xport *out_dev = NULL;
    ovs_be32 s_ip = 0, d_ip = 0;
    struct in6_addr s_ip6 = in6addr_any;
    struct in6_addr d_ip6 = in6addr_any;
    struct eth_addr smac;
    struct eth_addr dmac;
    int err;
    char buf_sip6[INET6_ADDRSTRLEN];
    char buf_dip6[INET6_ADDRSTRLEN];
    //underlay routing lookup, because the remote-ip of the tunnel is already known according to the vxlan port.
    err = tnl_route_lookup_flow(flow, &d_ip6, &s_ip6, &out_dev);
    if (err) {
        xlate_report(ctx, OFT_WARN, "native tunnel routing failed");
        return err;
    }

    xlate_report(ctx, OFT_DETAIL, "tunneling to %s via %s",
                 ipv6_string_mapped(buf_dip6, &d_ip6),
                 netdev_get_name(out_dev->netdev));

    /* Use mac addr of bridge port of the peer. Use the mac address of the bridge as the source mac address */
    err = netdev_get_etheraddr(out_dev->netdev, &smac);
    if (err) {
        xlate_report(ctx, OFT_WARN,
                     "tunnel output device lacks Ethernet address");
        return err;
    }

    d_ip = in6_addr_get_mapped_ipv4(&d_ip6);
    if (d_ip) {
        s_ip = in6_addr_get_mapped_ipv4(&s_ip6);
    }
    //Get the neighbor's destination mac address
    err = tnl_neigh_lookup(out_dev->xbridge->name, &d_ip6, &dmac);
    if (err) {
        xlate_report(ctx, OFT_DETAIL,
                     "neighbor cache miss for %s on bridge %s, "
                     "sending %s request",
                     buf_dip6, out_dev->xbridge->name, d_ip ? "ARP" : "ND");
        if (d_ip) {//Make an arp request
            tnl_send_arp_request(ctx, out_dev, smac, s_ip, d_ip);
        } else {
            tnl_send_nd_request(ctx, out_dev, smac, &s_ip6, &d_ip6);
        }
        return err;
    }

    if (ctx->xin->xcache) {
        struct xc_entry *entry;

        entry = xlate_cache_add_entry(ctx->xin->xcache, XC_TNL_NEIGH);
        ovs_strlcpy(entry->tnl_neigh_cache.br_name, out_dev->xbridge->name,
                    sizeof entry->tnl_neigh_cache.br_name);
        entry->tnl_neigh_cache.d_ipv6 = d_ip6;
    }

    xlate_report(ctx, OFT_DETAIL, "tunneling from "ETH_ADDR_FMT" %s"
                 " to "ETH_ADDR_FMT" %s",
                 ETH_ADDR_ARGS(smac), ipv6_string_mapped(buf_sip6, &s_ip6),
                 ETH_ADDR_ARGS(dmac), buf_dip6);
    //Building underlay link layer information
    netdev_init_tnl_build_header_params(&tnl_params, flow, &s_ip6, dmac, smac);
    //Build the udp, ip and eth layers of tunnel outer information and store them in tnl_push_data
    err = tnl_port_build_header(xport->ofport, &tnl_push_data, &tnl_params);
    if (err) {
        return err;
    }
    //Output port and tunnel port
    tnl_push_data.tnl_port = odp_to_u32(tunnel_odp_port);
    tnl_push_data.out_port = odp_to_u32(out_dev->odp_port);
    //A tunnel encapsulation action is added to the message, and the final message encapsulation is performed when the action is executed.
    odp_put_tnl_push_action(ctx->odp_actions, &tnl_push_data);
    return 0;
}
void
odp_put_tnl_push_action(struct ofpbuf *odp_actions,
                        struct ovs_action_push_tnl *data)
{
    int size = offsetof(struct ovs_action_push_tnl, header);

    size += data->header_len;
    nl_msg_put_unspec(odp_actions, OVS_ACTION_ATTR_TUNNEL_PUSH, data, size);
}

Execute push

/* The action executes a callback function, and the parameter may_steal indicates whether the message can be released or not. */
static void
dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
              const struct nlattr *a, bool may_steal)
{
    ......
    case OVS_ACTION_ATTR_TUNNEL_PUSH:/* Tunnel processing, adding external labels, requiring reentry, is usually used to process tunnel messages such as vxlan */
        if (*depth < MAX_RECIRC_DEPTH) {/* If the nesting depth is less than the maximum limit depth, reentry */
            struct dp_packet_batch tnl_pkt;
            struct dp_packet_batch *orig_packets_ = packets_;
            int err;

            if (!may_steal) {/* If the caller does not want to send the device to take over the message, it needs to copy a message for processing. */
                dp_packet_batch_clone(&tnl_pkt, packets_);
                packets_ = &tnl_pkt;
                dp_packet_batch_reset_cutlen(orig_packets_);
            }

            dp_packet_batch_apply_cutlen(packets_);
            /* Perform the Tunnel Label Add Action */
            err = push_tnl_action(pmd, a, packets_);
            if (!err) {/* After adding tunnel label, reentry processing is required */
                (*depth)++;
                dp_netdev_recirculate(pmd, packets_);
                (*depth)--;
            }
            return;
        }
        break;
        ......
    dp_packet_delete_batch(packets_, may_steal);
}

/* Adding Tunnel Label Action */
static int
push_tnl_action(const struct dp_netdev_pmd_thread *pmd,/* Current polling process */
                const struct nlattr *attr,/* attribute */
                struct dp_packet_batch *batch)/* Batch Message Processing */
{
    struct tx_port *tun_port;
    const struct ovs_action_push_tnl *data;
    int err;

    data = nl_attr_get(attr);

    /* Tunnel Port Search */
    tun_port = pmd_tnl_port_cache_lookup(pmd, u32_to_odp(data->tnl_port));
    if (!tun_port) {
        err = -EINVAL;
        goto error;
    }

    /* Adding Tunnel Head */
    err = netdev_push_header(tun_port->port->netdev, batch, data);
    if (!err) {
        return 0;
    }
error:
    dp_packet_delete_batch(batch, true);
    return err;
}
/* Push tunnel header (reading from tunnel metadata) and resize
 * 'batch->packets' for further processing.
 *
 * The caller must make sure that 'netdev' support this operation by checking
 * that netdev_has_tunnel_push_pop() returns true. */
int
netdev_push_header(const struct netdev *netdev,
                   struct dp_packet_batch *batch,
                   const struct ovs_action_push_tnl *data)
{
    int i;

    for (i = 0; i < batch->count; i++) {/* Processing each message one by one */
        netdev->netdev_class->push_header(batch->packets[i], data);
        //The key function initializes the metadata of the encapsulated message to prepare for message reentry. The outgoing interface here is unerlay routing
        //Find out data - > out_port.
        pkt_metadata_init(&batch->packets[i]->md, u32_to_odp(data->out_port));
    }

    return 0;
}
//For vxlan, the netdev - > netdev_class - > push_header function is
/* Adding tunnel head to message */
void
netdev_tnl_push_udp_header(struct dp_packet *packet,
                           const struct ovs_action_push_tnl *data)
{
    struct udp_header *udp;
    int ip_tot_size;

    /* First press into Ethernet and IP header */
    udp = netdev_tnl_push_ip_header(packet, data->header, data->header_len, &ip_tot_size);

    /* set udp src port Get random udp source ports */
    udp->udp_src = netdev_tnl_get_src_port(packet);
    udp->udp_len = htons(ip_tot_size);/* Setting the total length of udp message */

    if (udp->udp_csum) {/* Computation of udp check codes */
        uint32_t csum;
        if (netdev_tnl_is_header_ipv6(dp_packet_data(packet))) {
            csum = packet_csum_pseudoheader6(netdev_tnl_ipv6_hdr(dp_packet_data(packet)));
        } else {
            csum = packet_csum_pseudoheader(netdev_tnl_ip_hdr(dp_packet_data(packet)));
        }

        csum = csum_continue(csum, udp, ip_tot_size);
        udp->udp_csum = csum_finish(csum);

        if (!udp->udp_csum) {
            udp->udp_csum = htons(0xffff);
        }
    }
}

//Change the output port to the input port, that is, the three-tier port to the two-tier port.
static inline void
pkt_metadata_init(struct pkt_metadata *md, odp_port_t port)
{
    /* It can be expensive to zero out all of the tunnel metadata. However,
     * we can just zero out ip_dst and the rest of the data will never be
     * looked at. */
    memset(md, 0, offsetof(struct pkt_metadata, in_port));/* Initialize all data before the port to 0 */
    md->tunnel.ip_dst = 0;
    md->tunnel.ipv6_dst = in6addr_any;

    md->in_port.odp_port = port;
}

When you get here, execute dp_netdev_recirculate to reentry, at which time the vxlan message leaves the server from the dpdk physical port via fdb.

underlay route table

The OVS routing comes from two parts, one is the synchronous kernel routing, marked as cached. The other part is to add routes using commands such as ovs-appctl ovs/route/add.

[root@ ~]# ovs-appctl ovs/route/show
Route Table:
Cached: 1.1.1.1/32 dev tun0 SRC 1.1.1.1
Cached: 10.226.137.204/32 dev eth2 SRC 10.226.137.204
Cached: 10.255.9.204/32 dev br-phy SRC 10.255.9.204
Cached: 127.0.0.1/32 dev lo SRC 127.0.0.1
Cached: 169.254.169.110/32 dev tap_metadata SRC 169.254.169.110
Cached: 169.254.169.240/32 dev tap_proxy SRC 169.254.169.240
Cached: 169.254.169.241/32 dev tap_proxy SRC 169.254.169.241
Cached: 169.254.169.250/32 dev tap_metadata SRC 169.254.169.250
Cached: 169.254.169.254/32 dev tap_metadata SRC 169.254.169.254
Cached: 172.17.0.1/32 dev docker0 SRC 172.17.0.1
Cached: ::1/128 dev lo SRC ::1
Cached: 10.226.137.192/27 dev eth2 SRC 10.226.137.204
Cached: 10.226.137.224/27 dev br-phy GW 10.255.9.193 SRC 10.255.9.204
Cached: 10.254.225.0/27 dev br-phy GW 10.255.9.193 SRC 10.255.9.204
Cached: 10.254.225.224/27 dev br-phy GW 10.255.9.193 SRC 10.255.9.204
Cached: 10.255.8.192/27 dev br-phy GW 10.255.9.193 SRC 10.255.9.204
Cached: 10.255.9.192/27 dev br-phy SRC 10.255.9.204
Cached: 1.1.1.0/24 dev br-phy GW 10.255.9.193 SRC 10.255.9.204
Cached: 10.226.0.0/16 dev eth2 GW 10.226.137.193 SRC 10.226.137.204
Cached: 172.17.0.0/16 dev docker0 SRC 172.17.0.1
Cached: 127.0.0.0/8 dev lo SRC 127.0.0.1
Cached: 0.0.0.0/0 dev eth2 GW 10.226.137.193 SRC 10.226.137.204
Cached: fe80::/64 dev port-r6kxee6d3t SRC fe80::80a0:94ff:fedc:43b
[root@A04-R08-I137-204-9320C72 ~]# 

Routing module initialization

/* Users of the route_table module should register themselves with this
 * function before making any other route_table function calls. */
void
route_table_init(void)
    OVS_EXCLUDED(route_table_mutex)
{
    ovs_mutex_lock(&route_table_mutex);
    ovs_assert(!nln);
    ovs_assert(!route_notifier);
    ovs_assert(!route6_notifier);

    ovs_router_init();
    nln = nln_create(NETLINK_ROUTE, (nln_parse_func *) route_table_parse,
                     &rtmsg);

    route_notifier =
        nln_notifier_create(nln, RTNLGRP_IPV4_ROUTE,
                            (nln_notify_func *) route_table_change, NULL);
    route6_notifier =
        nln_notifier_create(nln, RTNLGRP_IPV6_ROUTE,
                            (nln_notify_func *) route_table_change, NULL);

    route_table_reset();
    name_table_init();

    ovs_mutex_unlock(&route_table_mutex);
}

/* May not be called more than once. */
void
ovs_router_init(void)
{
    classifier_init(&cls, NULL);//Classifier is used to implement routing lookup.
    unixctl_command_register("ovs/route/add", "ip_addr/prefix_len out_br_name gw", 2, 3,
                             ovs_router_add, NULL);
    unixctl_command_register("ovs/route/show", "", 0, 0, ovs_router_show, NULL);
    unixctl_command_register("ovs/route/del", "ip_addr/prefix_len", 1, 1, ovs_router_del,
                             NULL);
    unixctl_command_register("ovs/route/lookup", "ip_addr", 1, 1,
                             ovs_router_lookup_cmd, NULL);
}

Monitoring Kernel Routing Events with netlink

//This function sets the routing event change flag
static void
route_table_change(const struct route_table_msg *change OVS_UNUSED,
                   void *aux OVS_UNUSED)
{
    route_table_valid = false;
}

/* Run periodically to update the locally maintained routing table. */
//Periodically Processing Route Change Function
void
route_table_run(void)
    OVS_EXCLUDED(route_table_mutex)
{
    ovs_mutex_lock(&route_table_mutex);
    if (nln) {
        rtnetlink_run();
        nln_run(nln);

        if (!route_table_valid) {
            route_table_reset();
        }
    }
    ovs_mutex_unlock(&route_table_mutex);
}

static int
route_table_reset(void)
{
    struct nl_dump dump;
    struct rtgenmsg *rtmsg;
    uint64_t reply_stub[NL_DUMP_BUFSIZE / 8];
    struct ofpbuf request, reply, buf;

    route_map_clear();//Delete all routes
    netdev_get_addrs_list_flush();
    route_table_valid = true;
    rt_change_seq++;

    ofpbuf_init(&request, 0);

    nl_msg_put_nlmsghdr(&request, sizeof *rtmsg, RTM_GETROUTE, NLM_F_REQUEST);

    rtmsg = ofpbuf_put_zeros(&request, sizeof *rtmsg);
    rtmsg->rtgen_family = AF_UNSPEC;
    //Re-add all routes
    nl_dump_start(&dump, NETLINK_ROUTE, &request);
    ofpbuf_uninit(&request);

    ofpbuf_use_stub(&buf, reply_stub, sizeof reply_stub);
    while (nl_dump_next(&dump, &reply, &buf)) {
        struct route_table_msg msg;

        if (route_table_parse(&reply, &msg)) {
            route_table_handle_msg(&msg);
        }
    }
    ofpbuf_uninit(&buf);

    return nl_dump_done(&dump);
}

underlay neighbor

dpdk-ovs maintains an underlay neighbor information for the tunnel.

static void
dp_initialize(void)
{
    static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;

    if (ovsthread_once_start(&once)) {
        int i;

        tnl_conf_seq = seq_create();
        dpctl_unixctl_register();
        tnl_port_map_init();
        tnl_neigh_cache_init();
        route_table_init();

        for (i = 0; i < ARRAY_SIZE(base_dpif_classes); i++) {
            dp_register_provider(base_dpif_classes[i]);
        }

        ovsthread_once_done(&once);
    }
}

void
tnl_neigh_cache_init(void)
{
    unixctl_command_register("tnl/arp/show", "", 0, 0, tnl_neigh_cache_show, NULL);
    unixctl_command_register("tnl/arp/set", "BRIDGE IP MAC", 3, 3, tnl_neigh_cache_add, NULL);
    unixctl_command_register("tnl/arp/flush", "", 0, 0, tnl_neigh_cache_flush, NULL);
    unixctl_command_register("tnl/neigh/show", "", 0, 0, tnl_neigh_cache_show, NULL);
    unixctl_command_register("tnl/neigh/set", "BRIDGE IP MAC", 3, 3, tnl_neigh_cache_add, NULL);
    unixctl_command_register("tnl/neigh/flush", "", 0, 0, tnl_neigh_cache_flush, NULL);
}

Use commands to view neighbors

[root@ ~]# ovs-appctl tnl/arp/show
IP                                            MAC                 Bridge
==========================================================================
10.255.9.193                                  9c:e8:95:0f:49:16   br-phy
[root@ ~]# 

ovs-dpdk obtains neighborhood information by processing arp and neigh messages on the data surface.

/* Perform action conversion */
static void
do_xlate_actions(const struct ofpact *ofpacts, size_t ofpacts_len,
                 struct xlate_ctx *ctx)
{
    struct flow_wildcards *wc = ctx->wc;/* wildcard */
    struct flow *flow = &ctx->xin->flow;/* Streams to be processed */
    const struct ofpact *a;

    /* Neighbor listening is enabled only when tunnels are opened, mainly by listening for arp packages and icmpv6 messages for neighbor learning. */
    if (ovs_native_tunneling_is_on(ctx->xbridge->ofproto)) {
        tnl_neigh_snoop(flow, wc, ctx->xbridge->name);
    }
    /* dl_type already in the mask, not set below. */
    ......
}
//Learn from your neighbors.
int
tnl_neigh_snoop(const struct flow *flow, struct flow_wildcards *wc,
                const char name[IFNAMSIZ])
{
    int res;
    res = tnl_arp_snoop(flow, wc, name);
    if (res != EINVAL) {
        return res;
    }
    return tnl_nd_snoop(flow, wc, name);
}
static int
tnl_arp_snoop(const struct flow *flow, struct flow_wildcards *wc,
              const char name[IFNAMSIZ])
{
    if (flow->dl_type != htons(ETH_TYPE_ARP)
        || FLOW_WC_GET_AND_MASK_WC(flow, wc, nw_proto) != ARP_OP_REPLY
        || eth_addr_is_zero(FLOW_WC_GET_AND_MASK_WC(flow, wc, arp_sha))) {
        return EINVAL;
    }

    tnl_arp_set(name, FLOW_WC_GET_AND_MASK_WC(flow, wc, nw_src), flow->arp_sha);
    return 0;
}
static int
tnl_nd_snoop(const struct flow *flow, struct flow_wildcards *wc,
             const char name[IFNAMSIZ])
{
    if (!is_nd(flow, wc) || flow->tp_src != htons(ND_NEIGHBOR_ADVERT)) {
        return EINVAL;
    }
    /* - RFC4861 says Neighbor Advertisements sent in response to unicast Neighbor
     *   Solicitations SHOULD include the Target link-layer address. However, Linux
     *   doesn't. So, the response to Solicitations sent by OVS will include the
     *   TLL address and other Advertisements not including it can be ignored.
     * - OVS flow extract can set this field to zero in case of packet parsing errors.
     *   For details refer miniflow_extract()*/
    if (eth_addr_is_zero(FLOW_WC_GET_AND_MASK_WC(flow, wc, arp_tha))) {
        return EINVAL;
    }

    memset(&wc->masks.ipv6_src, 0xff, sizeof wc->masks.ipv6_src);
    memset(&wc->masks.ipv6_dst, 0xff, sizeof wc->masks.ipv6_dst);
    memset(&wc->masks.nd_target, 0xff, sizeof wc->masks.nd_target);

    tnl_neigh_set__(name, &flow->nd_target, flow->arp_tha);
    return 0;
}

Neighbors aging regularly

void
tnl_neigh_cache_run(void)
{
    struct tnl_neigh_entry *neigh;
    bool changed = false;

    ovs_mutex_lock(&mutex);
    CMAP_FOR_EACH(neigh, cmap_node, &table) {
        if (neigh->expires <= time_now()) {
            tnl_neigh_delete(neigh);
            changed = true;
        }
    }
    ovs_mutex_unlock(&mutex);

    if (changed) {
        seq_change(tnl_conf_seq);
    }
}

Posted by neilybod on Mon, 09 Sep 2019 21:31:56 -0700