The evolution of reuseport in the Linux kernel

Keywords: Linux socket

The SO_REUSEPORT option was introduced into the kernel on Linux 3.9, before which there was a similar option, SO_REUSEADDR.If you are not sure about the differences and connections between the two, it is recommended that you read How do SO_REUSEADDR and SO_REUSEPORT differ?.
If you don't want to read it, the next section is for lazy people.

What are SO_REUSEADDR and SO_REUSEPORT?

TCP/UDP uniquely identifies a connection with five tuples.At any time, the five tuples of two connections cannot be identical, otherwise when a message is received, the protocol stack cannot determine which connection it belongs to.

quintet
{<protocol>, <src addr>, <src port>, <dest addr>, <dest port>}

In the quintile, protocol determines when creating a socket, <src addr> and <src port> when bind(), and <dest addr> and <dest port> when connecting ().Of course, bind () and connect() do not need to be explicitly used at some point, but this is not discussed in this article.

So, if the SO_REUSEADDR and SO_REUSEPORT options are set on the socket, when do they work?The answer is bind(), which is when <src addr>and <src port>are determined.

The behavior of SO_REUSEADDR and SO_REUSEPORT differs slightly between operating system cores, but they all originate from BSD.Therefore, the next step is to illustrate the implementation of BSD as the standard.

SO_REUSEADDR

Suppose I now need bind() to bind socketA to A:X and socketB to B:Y (regardless of X=0 or Y=0, since 0 means that the kernel will automatically allocate ports and there must be no conflict).

If X!=Y, both binds () will succeed regardless of the relationship between A and B.But if X==Y, the result would be as follows:

SO_REUSEADDR       socketA        socketB       Result
---------------------------------------------------------------------
  ON/OFF       192.168.0.1:21   192.168.0.1:21    Error (EADDRINUSE)
  ON/OFF       192.168.0.1:21      10.0.0.1:21    OK
  ON/OFF          10.0.0.1:21   192.168.0.1:21    OK
   OFF             0.0.0.0:21   192.168.1.0:21    Error (EADDRINUSE)
   OFF         192.168.1.0:21       0.0.0.0:21    Error (EADDRINUSE)
   ON              0.0.0.0:21   192.168.1.0:21    OK
   ON          192.168.1.0:21       0.0.0.0:21    OK
  ON/OFF           0.0.0.0:21       0.0.0.0:21    Error (EADDRINUSE)

The first column indicates whether the SO_REUSEADDR annotation is set or not, and the last column indicates whether the socket bound after is successful.

Note: The object set here refers to the post-bound socket (that is, do not care if the previous one is set or not)

It can be seen that SO_REUSEADDR in the BSD implementation enables one socket to bind successfully simultaneously using a wildcard address (0.0.0.0) and a specified address (192.168.1.0).

SO_REUSEADDR also has an application scenario where there is a TIME_WAIT state in TCP, which refers to the last remaining stage of an active shutdown end.Assuming that socketA is bound to A:X and actively uses close() after completing TCP communication to enter TIME_WAIT, if socketB also binds A:X, it will also get EADDRINUSE error, but if socketB sets SO_REUSEADDR, it will bind successfully.

SO_REUSEPORT

If you understand SO_REUSEADDR, then SO_REUSEPORT is a good understanding that allows two socket s to bind the exact same <IP:Port>.

SO_REUSEPORT       socketA        socketB       Result
---------------------------------------------------------------------
    ON         192.168.0.1:21   192.168.0.1:21    OK

Reminder, the above results are all the results of BSD, there are some differences in the Linux kernel, as shown in

  • SO_REUSEPORT is supported in version 3.9. Once a TCP Socket as a Server is bound to a specific port, LISTEN is started, even if SO_REUSEADDR has been set before, it will not take effect.This Linux is stricter than BSD
SO_REUSEADDR       socketA        socketB       Result
---------------------------------------------------------------------
    ON/OFF      192.168.0.1:21   0.0.0.0:21    Error (EADDRINUSE)
  • Prior to version 3.9, the SO_REUSEADDR option as a Client had the effect of SO_REUSEPORT in BSD.This Linux is more relaxed than BSD.
SO_REUSEADDR      socketA            socketB           Result
---------------------------------------------------------------------
    ON        192.168.0.2:55555   192.168.0.2:55555      OK

The evolution of reuseport on Linux

Linux < 3.9

Here's how to do it:

Kernel socket s use the skc_reuse field to indicate whether SO_REUSEADDR is set

 struct sock_common {
     /* omitted */
    unsigned char        skc_reuse;
    /* omitted */
}

int sock_setsockopt(struct socket *sock, int level, int optname,...
{
    ......
    case SO_REUSEADDR:
     sk->sk_reuse = (valbool ? SK_CAN_REUSE : SK_NO_REUSE);
     break;
}

inet_bind_bucket represents a bound port.

struct inet_bind_bucket {
    /* omitted */
    unsigned short        port;
    signed short        fastreuse;
    int            num_owners;
    struct hlist_node    node;
    struct hlist_head    owners;
};

The fastreuse in the above structure indicates whether the port supports sharing, and all socket s that share the port are suspended to the owner member.When a user uses bind(), the kernel uses TCP:inet_csk_get_port(),UDP:udp_v4_get_port() to bind ports.

/* inet_connection_Sock.c: inet_csk_get_port() */
tb_found:
    if (!hlist_empty(&tb->owners)) {
        ......
        if (tb->fastreuse > 0 &&
            sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
            smallest_size == -1) {
            goto success;

Therefore, this bind() can succeed when the port supports sharing and the socket also has SO_REUSEADDR set and is not in LISTEN state.

3.9 =< Linux < 4.5

The 3.9 kernel adds support for SO_REUSEPORT, and listener s can be bound to the same <IP:Port>At this time, when Server receives a SYN message sent by Client, it will select one of the socket s to respond.

[Fig]

Specifically, version 3.9 extends sock_common, splitting skc_reuse from its original record.

struct sock_common {
     unsigned short        skc_family;
     volatile unsigned char    skc_state;
-    unsigned char        skc_reuse;
+    unsigned char        skc_reuse:4;
+    unsigned char        skc_reuseport:4;


@@ int sock_setsockopt(struct socket *sock, int level, int optname,
     case SO_REUSEADDR:
         sk->sk_reuse = (valbool ? SK_CAN_REUSE : SK_NO_REUSE);
         break;
+    case SO_REUSEPORT:
+        sk->sk_reuseport = valbool;
+        break;

The inet_bind_bucket is then expanded accordingly

struct inet_bind_bucket {
     /* omitted */
     unsigned short        port;
-    signed short        fastreuse;
+    signed char        fastreuse;
+    signed char        fastreuseport;
+    kuid_t            fastuid;

When binding ports, a queue reuseport's pass condition is added

/* inet_connection_sock.c: inet_csk_get_port() */
tb_found:
         if (sk->sk_reuse == SK_FORCE_REUSE)
             goto success;
-        if (tb->fastreuse > 0 &&
-            sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
+        if (((tb->fastreuse > 0 &&
+              sk->sk_reuse && sk->sk_state != TCP_LISTEN) ||
+             (tb->fastreuseport > 0 &&
+              sk->sk_reuseport && uid_eq(tb->fastuid, uid))) 
             && smallest_size == -1) {
               goto success;

When a Client's SYN message arrives, Server first calculates a hash conflict chain based on the local port (SYN message's <dport>), then traverses all Sockets in the chain table, scoring them based on the degree of quaternion matching; if reuseport is enabled, multiple Sockets may get the highest score, and the kernel will followThe machine selects one for subsequent processing.

/* inet_hashtables.c  */
struct sock *__inet_lookup_listener(struct......)
{
    struct sock *sk, *result;
    unsigned int hash = inet_lhashfn(net, hnum);
    struct inet_listen_hashbucket *ilb = &hashinfo->listening_hash[hash]; // Find hash conflict chain based on local port
    /* code omitted */
    result = NULL;
    hiscore = 0;
    sk_nulls_for_each_rcu(sk, node, &ilb->head) {
        score = compute_score(sk, net, hnum, daddr, dif); // Scoring based on match
        if (score > hiscore) {
            result = sk;
            hiscore = score;
            reuseport = sk->sk_reuseport;
            if (reuseport) {
                phash = inet_ehashfn(net, daddr, hnum,
                             saddr, sport);
                matches = 1;                             // How many socket s to satisfy if reuseport
            }
        } else if (score == hiscore && reuseport) {
            matches++;
            if (reciprocal_scale(phash, matches) == 0)
                result = sk;
            phash = next_pseudo_random32(phash);
        }
    }
    /*
     * if the nulls value we got at the end of this lookup is
     * not the expected one, we must restart lookup.
     * We probably met an item that was moved to another chain.
     */
    return result;
}

Take a chestnut and assume that the kernel has four hash conflict chains for listening socket s. Then the user establishes four servers: A, B, C, D. The address and port to listen on are shown below. A and B enable SO_REUSEPORT.Conflict chains are keyed by ports, so A, B, D hang onto the same conflict chain.If a SYN message <192.168.10.1, 21> is received at this time, then the kernel will traverse listening_hash[0], scoring the seven sockets above. Since B is listening for an exact address, B will score higher than A, and the kernel will eventually select a SocketB for subsequent processing.

4.5 < Linux

From the example above, it can be seen that when a SYN message is received, the kernel must traverse a complete hash conflict chain, scoring each socket, which is slightly redundant.Therefore, in version 4.5, the kernel introduced reuseport groups, which are bound to the same IP and Port, and sockets with the SO_REUSEPORT option are organized into a group.

--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -318,6 +318,7 @@ struct cg_proto;
   *    @sk_error_report: callback to indicate errors (e.g. %MSG_ERRQUEUE)
   *    @sk_backlog_rcv: callback to process the backlog
   *    @sk_destruct: called at sock freeing time, i.e. when all refcnt == 0
+  *    @sk_reuseport_cb: reuseport group container
  */
 struct sock {
     /*
@@ -453,6 +454,7 @@ struct sock {
     int            (*sk_backlog_rcv)(struct sock *sk,
                           struct sk_buff *skb);
     void                    (*sk_destruct)(struct sock *sk);
+    struct sock_reuseport __rcu    *sk_reuseport_cb;
 };

This feature only supports UDP at version 4.5 and TCP at version 4.6( patch ).When looking for a listen socket, instead of traversing the entire conflict chain, the kernel finds a qualified socket, and if it sets SO_REUSEPORT, it directly finds the reuseport group to which it belongs and chooses one for subsequent processing.

@@ -215,6 +217,7 @@ struct sock *__inet_lookup_listener(struct net *net,
     unsigned int hash = inet_lhashfn(net, hnum);
     struct inet_listen_hashbucket *ilb = &hashinfo->listening_hash[hash];
     int score, hiscore, matches = 0, reuseport = 0;
+    bool select_ok = true;
     u32 phash = 0;
 
     rcu_read_lock();
@@ -230,6 +233,15 @@ begin:
             if (reuseport) {
                 phash = inet_ehashfn(net, daddr, hnum,
                              saddr, sport);
+                if (select_ok) {
+                    struct sock *sk2;
+                    sk2 = reuseport_select_sock(sk, phash,
+                                    skb, doff);
+                    if (sk2) {
+                        result = sk2;
+                        goto found;
+                    }
+                }
                 matches = 1;
             }
         }

Posted by Jabop on Sat, 28 Sep 2019 02:08:36 -0700