IPv4 route lookup

Keywords: route

Routing query function fib_ table_ The lookup parameter specifies the routing table tb, flow information flp, and flag FIB required for the query operation_ Flags, return the result res. The destination address daddr of the flow structure flp member is the primary key value required for the query. fib_flags supports two lookup flags: FIB_LOOKUP_NOREF and FIB_LOOKUP_IGNORE_LINKSTATE.

First, get the first node pn of the trie, and then traverse the trie tree from its first child node n (its cindex=0).

/* should be called with rcu_read_lock */
int fib_table_lookup(struct fib_table *tb, const struct flowi4 *flp, struct fib_result *res, int fib_flags)
{
    struct trie *t = (struct trie *) tb->tb_data;
#ifdef CONFIG_IP_FIB_TRIE_STATS
    struct trie_use_stats __percpu *stats = t->stats;
#endif
    const t_key key = ntohl(flp->daddr);
    struct key_vector *n, *pn;
    struct fib_alias *fa;
    unsigned long index;
    t_key cindex;

    pn = t->kv;
    cindex = 0;

    n = get_child_rcu(pn, cindex);
    if (!n) {
        trace_fib_table_lookup(tb->tb_id, flp, NULL, -EAGAIN);
        return -EAGAIN;
    }

Step 1: first, obtain the index value of the key (destination address) in node n. get_ When obtaining the index, the cindex function implicitly compares the prefixes. If the index is greater than or equal to the number of bits that node n can handle, it indicates that the prefixes are not equal (see XOR operation in get_cindex). Otherwise, the key value is equal to the prefix of node n, and the index is the index value of the child node. Next, if node n is a leaf node, that is, the required node, the route lookup is completed.

    /* Step 1: Travel to the longest prefix match in the trie */
    for (;;) {
        index = get_cindex(key, n);

        /* This bit of code is a bit tricky but it combines multiple
         * checks into a single check.  The prefix consists of the
         * prefix plus zeros for the "bits" in the prefix. The index
         * is the difference between the key and this value.  From
         * this we can actually derive several pieces of data.
         *   if (index >= (1ul << bits))
         *     we have a mismatch in skip bits and failed
         *   else
         *     we know the value is cindex
         *
         * This check is safe even if bits == KEYLENGTH due to the
         * fact that we can only allocate a node with 32 bits if a
         * long is greater than 32 bits.
         */
        if (index >= (1ul << n->bits))
            break;

        /* we have found a leaf. Prefixes have already been compared */
        if (IS_LEAF(n))
            goto found;

Get as follows_ The cindex macro defines that if the prefixes of the two are the same, the result is the index value. Otherwise, it must be greater than or equal to the maximum index value of n nodes.

#define get_cindex(key, kv) (((key) ^ (kv)->key) >> (kv)->pos)

When node n is an intermediate node, if its suffix length slen is greater than its position pos, it indicates that it is not the case of the longest prefix matching (it may be on the wrong road), and then it may return to this node again. Therefore, its position is recorded to save the processing cycle. At this time, the key value matches the prefix of the intermediate node n, continue to take out the child node pointed to by the index of the key value key in N, and traverse the child node again.

If the new child node is empty, it indicates that the longest prefix matching cannot be completed. You need to trace back to the root of the trie tree and adjust it to backtrace processing. This requires the pn parent node.

        /* only record pn and cindex if we are going to be chopping
         * bits later.  Otherwise we are just wasting cycles.
         */
        if (n->slen > n->pos) {
            pn = n;
            cindex = index;
        }

        n = get_child_rcu(n, index);
        if (unlikely(!n))
            goto backtrace;
    }

As shown in the following routing table, find the route with the destination address (key value) of 1.1.2.3. First, compare the key=1.1.2.3 with the 1.0.0.0 node. The first 14 bits are the same and match, but this node is an intermediate node and not a leaf node. Continue to find. The 15th bit is different. Node 1.0.0.0 processes data bit[15,16], and key corresponds to child node index bit[15,16]=10b, that is, child node 1.1.0.0/23.

Node 1.1.0.0/23 is the same as the first 22 bits of key=1.1.2.3. They match, but this node is not a leaf node. Continue to find. The 23th bit of the two is different. Node 1.1.0.0 processes bit[23,24], and bit[23,24]=10b corresponding to key=1.1.2.3, that is, the child node index is 2, which matches to the empty node. The index of child node 1.1.0.0 is 0 and the index of child node 1.1.1.0 is 1, which needs to be processed by the backtrace code.

In addition, when traversing node 1.1.0.0/23, because its suffix length slen is 16 (the maximum suffix length of the node and its child nodes), which is greater than its node position pos (32-23 = 9), the parent node pn is updated to node 1.1.0.0/23, and the index cindex of n in the parent node is recorded.

# ip route add 1.1.1.0/24 via 192.168.2.2 table 10 
# ip route add 1.1.0.0/16 via 192.168.2.3 table 10    
# ip route add 1.0.0.0/8 via 192.168.2.4 table 10    
#
# cat /proc/net/fib_trie
Id 10:
  +-- 1.0.0.0/15 2 
     |-- 1.0.0.0
        /8 universe UNICAST      192.168.2.4
     +-- 1.1.0.0/23 2 
        |-- 1.1.0.0
           /16 universe UNICAST  192.168.2.3
        |-- 1.1.1.0
           /24 universe UNICAST  192.168.2.2

In step 2, save the child node pointer of node n found in the previous step to cptr. If the key value key does not match the prefix of node n, or if the prefix matches, the suffix of node n is equal to its node position pos, that is, the suffix length of node n and its subsequent child nodes should be greater than or equal to the position pos of node n, and the opposite prefix length is gradually becoming shorter. According to the longest matching principle, backtrace is required in both cases.

If node n is a leaf node, jump out of the loop.

    /* Step 2: Sort out leaves and begin backtracing for longest prefix */
    for (;;) {
        /* record the pointer where our next node pointer is stored */
        struct key_vector __rcu **cptr = n->tnode;

        /* This test verifies that none of the bits that differ
         * between the key and the prefix exist in the region of
         * the lsb and higher in the prefix.
         */
        if (unlikely(prefix_mismatch(key, n)) || (n->slen == n->pos))
            goto backtrace;

        /* exit out and process leaf */
        if (unlikely(IS_LEAF(n)))
            break;

When node n is an intermediate node, take out its child nodes, which are also represented by n. If the current child node index is not zero, set the Least Significant Bit of cindex to zero. For example, when the node bits is 3, the child node index range is [0 - 7]. When the cindex is 6 (110b), removing the LSB is 100b, that is, the new cindex is 4. Get the new child node corresponding to cindex at this time. If it is not empty, start the for loop above to verify whether this node matches the key value key. Otherwise, if the new child node is empty, continue to remove the next LSB of cindex.

If cindex is equal to zero, it indicates that all child nodes have been traversed and need to continue to the upper part of the trie tree (in the case of backtrace). If the parent node is the root of the trie tree, the EAGAIN error is returned. Otherwise, get the index of the parent node pn in its parent node (the grandfather node of n node), replace pn with its parent node, and continue to search at this level of the trie tree.

        /* Don't bother recording parent info.  Since we are in
         * prefix match mode we will have to come back to wherever
         * we started this traversal anyway
         */
        while ((n = rcu_dereference(*cptr)) == NULL) {
backtrace:
            /* If we are at cindex 0 there are no more bits for
             * us to strip at this level so we must ascend back
             * up one level to see if there are any more bits to be stripped there.
             */
            while (!cindex) {
                t_key pkey = pn->key;

                /* If we don't have a parent then there is nothing
                 * for us to do as we do not have any further nodes to parse.
                 */
                if (IS_TRIE(pn)) {
                    trace_fib_table_lookup(tb->tb_id, flp, NULL, -EAGAIN);
                    return -EAGAIN;
                }
                /* Get Child's index */
                pn = node_parent_rcu(pn);
                cindex = get_index(pkey, pn);
            }
            /* strip the least significant bit from the cindex */
            cindex &= cindex - 1;

            cptr = &pn->tnode[cindex]; /* grab pointer for next child node */
        }
    }

In one case, the key value to be searched is exactly the same as the prefix of node n, and their XOR is zero. In addition, the two are not exactly equal, but they are exactly the same within the numerical range represented by the prefix of node n. In the latter case, the length of the key value is greater than the prefix length of node n.

static inline t_key prefix_mismatch(t_key key, struct key_vector *n)
{
    t_key prefix = n->key;

    return (key ^ prefix) & (prefix | -prefix);
}

As shown in the following routing table, the key value is equal to 1.1.2.3. In step 1, a matching node cannot be found. Starting from backtrace, split the LSB of cindex, and the index value changes from 10b to 00b, that is, index 0, that is, node 1.1.0.0/16. Due to the existence of child nodes corresponding to index 0, exit the while loop. Go to the for loop at the beginning of step 2. Because the prefix of the new node matches the key, and the new node is a leaf node, that is, the routing node we are looking for, and the gateway is 192.168.2.3.

# cat /proc/net/fib_trie
Id 10:
  +-- 1.0.0.0/15 2 
     |-- 1.0.0.0
        /8 universe UNICAST      192.168.2.4
     +-- 1.1.0.0/23 2 
        |-- 1.1.0.0
           /16 universe UNICAST  192.168.2.3
        |-- 1.1.1.0
           /24 universe UNICAST  192.168.2.2

In step 3, the matching node n is found here, and the FIB of node n is traversed_ In the alias route alias linked list, first, the Most Significant Bit of the key value different from node n should be less than the number of digits represented by the suffix of the current route. For example, in the case of an 8-bit suffix (24 bit prefix), the index should be less than 256 (1 < < 8), that is, the key value of node n is exactly the same as at least the first 24 bits of the search key value.

In addition, if the tos value is specified, traverse the FIB of the_ The tos value of alias also needs to be equal; fib_ FIB corresponding to alias_ Info is valid and Fib_ The scope value of info is less than the specified value (scope is a representation of distance, and the distance is closer).

If fib_ The type of alias is RTN_BLACKHOLE,RTN_UNREACHABLE,RTN_PROHIBIT and other types, err will be less than zero, and an error will be returned.

found:
    /* this line carries forward the xor from earlier in the function */
    index = key ^ n->key;

    /* Step 3: Process the leaf, if that fails fall back to backtracing */
    hlist_for_each_entry_rcu(fa, &n->leaf, fa_list) {
        struct fib_info *fi = fa->fa_info;
        struct fib_nh_common *nhc;
        int nhsel, err;

        if ((BITS_PER_LONG > KEYLENGTH) || (fa->fa_slen < KEYLENGTH)) {
            if (index >= (1ul << fa->fa_slen))
                continue;
        }
        if (fa->fa_tos && fa->fa_tos != flp->flowi4_tos)
            continue;
        if (fi->fib_dead)
            continue;
        if (fa->fa_info->fib_scope < flp->flowi4_scope)
            continue;
        fib_alias_accessed(fa);
        err = fib_props[fa->fa_type].error;
        if (unlikely(err < 0)) {
out_reject:
#ifdef CONFIG_IP_FIB_TRIE_STATS
            this_cpu_inc(stats->semantic_match_passed);
#endif
            trace_fib_table_lookup(tb->tb_id, flp, NULL, err);
            return err;
        }

The following test fib_info. If the next hop is unavailable (RTNH_F_DEAD), continue to traverse fib_ The next in the alias linked list. If fib_ The next hop reference of info has a value. Check whether the next hop is a blackhole. If it is true, it is determined by the above out_reject snippet code processing. Otherwise, find the appropriate next hop in the nexthop reference. If no suitable next hop is found, the backtrace process will be executed again.

        if (fi->fib_flags & RTNH_F_DEAD)
            continue;

        if (unlikely(fi->nh)) {
            if (nexthop_is_blackhole(fi->nh)) {
                err = fib_props[RTN_BLACKHOLE].error;
                goto out_reject;
            }

            nhc = nexthop_get_nhc_lookup(fi->nh, fib_flags, flp, &nhsel);
            if (nhc)
                goto set_result;
            goto miss;
        }

For multipath routing, one is that the route itself is configured with multiple next hops, or the next hop reference of the route is a multipath group. Function fib_lookup_good_nhc judges the availability of each next hop. Finally, it assigns the routing query result to the res structure and returns it to the caller. Otherwise, if no suitable next hop is found, the backtrace process will be executed again.

        for (nhsel = 0; nhsel < fib_info_num_path(fi); nhsel++) {
            nhc = fib_info_nhc(fi, nhsel);

            if (!fib_lookup_good_nhc(nhc, fib_flags, flp))
                continue;
set_result:
            if (!(fib_flags & FIB_LOOKUP_NOREF))
                refcount_inc(&fi->fib_clntref);

            res->prefix = htonl(n->key);
            res->prefixlen = KEYLENGTH - fa->fa_slen;
            res->nh_sel = nhsel;
            res->nhc = nhc;
            res->type = fa->fa_type;
            res->scope = fi->fib_scope;
            res->fi = fi;
            res->table = tb;
            res->fa_head = &n->leaf;
#ifdef CONFIG_IP_FIB_TRIE_STATS
            this_cpu_inc(stats->semantic_match_passed);
#endif
            trace_fib_table_lookup(tb->tb_id, flp, nhc, err);

            return err;
        }
    }
miss:
#ifdef CONFIG_IP_FIB_TRIE_STATS
    this_cpu_inc(stats->semantic_match_miss);
#endif
    goto backtrace;

Kernel version 5.10

Posted by PHP Man on Sun, 05 Dec 2021 19:14:06 -0800