Analysis of IO Multiplexing and Event Mechanism in redis

Keywords: C socket Redis network Programming

baiyan

Introduce

Read this article before reading: Analysis of the Way to Improve the Performance of Server Concurrent IO from Network Programming Foundation to epoll To better understand the content of this article, thank you.
We know that when we use redis, we can get the data returned by the redis server by sending a get command from the client. Redis is based on the traditional C/S architecture. It receives a connection from the client by listening on a TCP port (6379), and then executes subsequent commands, and returns the execution results to the client.

redis is a qualified server program

Let's start with a question: As a qualified server program, how does redis server process a get command on the command line and return the result to the client?
To answer this question, we first review what we said in the previous article. Clients and servers need to create a socket to indicate their network address and port number respectively, and then communicate between sockets based on TCP protocol. Usually, the socket communication flow of a server program is as follows:

int main(int argc, char *argv[]) {
    listenSocket = socket(); //Call socket() system call to create a listener socket descriptor
    bind(listenSocket);  //Binding Address and Port
    listen(listenSocket); //Conversion from default active socket to server-applicable passive socket
    while (1) { //Continuous looping to monitor whether there is a client connection event coming
        connSocket = accept($listenSocket); //Accept client connection
        read(connsocket); //Reading data from the client, only one client can be processed at the same time
        write(connsocket); //The data returned to the client can only be processed by one client at the same time.
    }
    return 0;
}

In redis, the same steps are required. After establishing a connection with the client, the command sent by the client will be read, and then the command will be executed. Finally, the execution result of the command will be returned to the client by calling the write system call.
But such a process can only handle the connection and read-write events of one client at the same time. In order to enable single-process server applications to handle multiple client events simultaneously, we adopt IO multiplexing mechanism. The best IO multiplexing mechanism is epoll. Review the server code that we finally created using epoll in our last article:

int main(int argc, char *argv[]) {

    listenSocket = socket(AF_INET, SOCK_STREAM, 0); //Ibid., create a listener socket descriptor
    
    bind(listenSocket)  //Ibid., binding addresses and ports
    
    listen(listenSocket) //Ibid., from the default active socket to the server-applicable passive socket
    
    epfd = epoll_create(EPOLL_SIZE); //Create an epoll instance
    
    ep_events = (epoll_event*)malloc(sizeof(epoll_event) * EPOLL_SIZE); //Create an epoll_event structure to store Socket Sets
    event.events = EPOLLIN;
    event.data.fd = listenSocket;
    
    epoll_ctl(epfd, EPOLL_CTL_ADD, listenSocket, &event); //Add listener sockets to the listener list
    
    while (1) {
    
        event_cnt = epoll_wait(epfd, ep_events, EPOLL_SIZE, -1); //Waiting to return the socket descriptors that are ready
        
        for (int i = 0; i < event_cnt; ++i) { //Traversing through all ready socket descriptors
            if (ep_events[i].data.fd == listenSocket) { //If the listener socket descriptor is ready, a new client is connected to it.
            
                connSocket = accept(listenSocket); //Call accept() to establish a connection
                
                event.events = EPOLLIN;
                event.data.fd = connSocket;
                
                epoll_ctl(epfd, EPOLL_CTL_ADD, connSocket, &event); //Adding listeners to newly established connection socket descriptors to listen for subsequent read and write events on connection descriptors
                
            } else { //If the connection socket descriptor event is ready, you can read and write
            
                strlen = read(ep_events[i].data.fd, buf, BUF_SIZE); //Reading data from the connection socket descriptor will certainly read the data at this time, without blocking.
                if (strlen == 0) { //It is no longer possible to read data from the connection socket, so you need to remove the monitoring of the socket.
                
                    epoll_ctl(epfd, EPOLL_CTL_DEL, ep_events[i].data.fd, NULL); //Delete listening to this descriptor
                    
                    close(ep_events[i].data.fd);
                } else {
                    write(ep_events[i].data.fd, buf, str_len); //If the client can write the data back to the client
                }
            }
        }
    }
    close(listenSocket);
    close(epfd);
    return 0;
}

Redis based on the original selection, poll and epoll mechanism, combined with its unique business needs, encapsulated its own set of event handling functions, we call it ae (a simple event-driven programming library). While redis uses select, epoll or kqueue technology on mac, redis will first judge, and then select the one with the best performance:

/* Include the best multiplexing layer supported by this system.
 * The following should be ordered by performances, descending. */
#ifdef HAVE_EVPORT
#include "ae_evport.c"
#else
    #ifdef HAVE_EPOLL
    #include "ae_epoll.c"
    #else
        #ifdef HAVE_KQUEUE
        #include "ae_kqueue.c"
        #else
        #include "ae_select.c"
        #endif
    #endif
#endif

Because the select function is a system call in POSIX standard, it can be implemented on different versions of operating system, so it is used as a bottom-up solution. For the sake of convenience, the following articles all use epoll mechanism to explain.

IO Multiplexing in redis

When we start a redis-server on the command line, redis actually does something similar to the epoll server we wrote earlier. There are three key function calls:

int main(int argc, char **argv) {
    ...
    initServerConfig(); //Initialize the structure for storing server-side information
    ...
    initServer(); //Initialize the redis event loop and call epoll_create and epoll_ctl. Create socket, bind, listen, accept to call in this function, and register the listener descriptor and connection descriptor returned after the call
    ...
    aeMain(); //Execute while(1) event loop, call epoll_wait to get the ready descriptor, and call the corresponding handler
    ...
}

Next, let's look at each one:

initServerConfig()

All information on redis server is stored in a redis server structure, which has many fields, such as socket information (such as address and port) on the server side, and many configuration information supporting other redis functions such as clustering, persistence and so on. This function call initializes all fields of the redisServer structure and assigns an initial value. Since we are talking about the application of event and IO multiplexing mechanism in redis, we only focus on a few fields.

initServer()

This function call is our top priority. After initializing the relevant information of the server, it is necessary to create, bind, monitor sockets and establish connections with the client. In this function, we call socket, bind, listen, accept, epoll_create, epoll_ctl. We can learn the event mechanism of redis step by step from the epoll server mentioned above. The main function calls of initServer() are as follows:

void initServer(void) {
    ...
    server.el = aeCreateEventLoop(server.maxclients+CONFIG_FDSET_INCR); 
    ...

    if (server.port != 0 && listenToPort(server.port,server.ipfd,&server.ipfd_count) == C_ERR)
        exit(1);
    ...

    for (j = 0; j < server.ipfd_count; j++) {
        if (aeCreateFileEvent(server.el, server.ipfd[j], AE_READABLE, acceptTcpHandler,NULL) == AE_ERR){
                serverPanic("Unrecoverable error creating server.ipfd file event.");
       }
    }
    ...
}

We interpret these lines of key code in top-down order:

aeCreateEventLoop()

In redis, there is an aeEventLoop concept that manages all relevant event description fields, stores registered events, and ready events:

typedef struct aeEventLoop {
    int stop; //Identify whether the event loop (that is, while(1)) ends
    
    aeFileEvent *events;  //Store registered file events (file events, client connection and read-write events)
    aeFiredEvent *fired;  //Store ready file events
    aeTimeEvent *timeEventHead; //Store time events (after time events)
    
    void *apidata; /* Storing epoll related information */
    
    aeBeforeSleepProc *beforesleep; //Functions that need to be called before an event occurs
    aeBeforeSleepProc *aftersleep; //Functions that need to be called after an event occurs
} aeEventLoop;

redis stores all ready descriptors returned through epoll_wait() in the fired array, then traverses the array, and calls the corresponding event handler to process all events at once. In the aeCreateEventLoop() function, the structure field that manages all event information is initialized, which also includes calling epoll_create() to initialize epfd of epoll:

aeEventLoop *aeCreateEventLoop(int setsize) {
    aeEventLoop *eventLoop;
    int i;

    if ((eventLoop = zmalloc(sizeof(*eventLoop))) == NULL) goto err;
    eventLoop->events = zmalloc(sizeof(aeFileEvent)*setsize);
    eventLoop->fired = zmalloc(sizeof(aeFiredEvent)*setsize);
    if (eventLoop->events == NULL || eventLoop->fired == NULL) goto err;
    eventLoop->setsize = setsize;
    eventLoop->lastTime = time(NULL);
    eventLoop->timeEventHead = NULL;
    eventLoop->timeEventNextId = 0;
    eventLoop->stop = 0;
    eventLoop->maxfd = -1;
    eventLoop->beforesleep = NULL;
    eventLoop->aftersleep = NULL;
    if (aeApiCreate(eventLoop) == -1) goto err; //Calling aeApiCreate() calls epoll_create() internally.
    for (i = 0; i < setsize; i++)
        eventLoop->events[i].mask = AE_NONE;
    return eventLoop;
}

In the aeApiCreate() function, epoll_create() is called and the created epfd is stored in the apidata field of the eventLoop structure:

typedef struct aeApiState {
    int epfd;
    struct epoll_event *events;
} aeApiState;

static int aeApiCreate(aeEventLoop *eventLoop) {
    aeApiState *state = zmalloc(sizeof(aeApiState));

    if (!state) return -1;
    state->events = zmalloc(sizeof(struct epoll_event)*eventLoop->setsize);
    if (!state->events) {
        zfree(state);
        return -1;
    }
    state->epfd = epoll_create(1024); /* Initialize epfd of epoll by calling epoll_create */
    if (state->epfd == -1) {
        zfree(state->events);
        zfree(state);
        return -1;
    }
    eventLoop->apidata = state; //Keep the created epfd in the apidata field of the eventLoop structure
    return 0;
}

listenToPort()

After creating epfd, we will do socket creation, binding and monitoring. These steps are carried out in listenToPort() function:

int listenToPort(int port, int *fds, int *count) {
    if (server.bindaddr_count == 0) server.bindaddr[0] = NULL;
    for (j = 0; j < server.bindaddr_count || j == 0; j++) { //Traverse all ip addresses
        if (server.bindaddr[j] == NULL) { //No binding address yet
           ...
        } else if (strchr(server.bindaddr[j],':')) { //Binding IPv6 address
            ...
        } else { //Binding IPv4 addresses, usually into this if branch
            fds[*count] = anetTcpServer(server.neterr,port,server.bindaddr[j], server.tcp_backlog);  //Real binding logic
        }
        ...
    }
    return C_OK;
}

redis will first determine the type of binding ip address, we are generally IPv4, so we will generally go to the third branch, call anetTcpServer() function for specific binding logic:

static int _anetTcpServer(char *err, int port, char *bindaddr, int af, int backlog)
{
   ...
    if ((rv = getaddrinfo(bindaddr,_port,&hints,&servinfo)) != 0) {
        anetSetError(err, "%s", gai_strerror(rv));
        return ANET_ERR;
    }
    for (p = servinfo; p != NULL; p = p->ai_next) {
        if ((s = socket(p->ai_family,p->ai_socktype,p->ai_protocol)) == -1) //Call socket() to create a listening socket
            continue;

        if (af == AF_INET6 && anetV6Only(err,s) == ANET_ERR) goto error;
        if (anetSetReuseAddr(err,s) == ANET_ERR) goto error;
        if (anetListen(err,s,p->ai_addr,p->ai_addrlen,backlog) == ANET_ERR) s = ANET_ERR; //Call bind() and listen() to bind ports and convert them into server-side passive sockets
        goto end;
    }
}

After calling socket() system calls to create sockets, further calls to bind() and listen() are needed. These two steps are implemented within the anetListen() function:

static int anetListen(char *err, int s, struct sockaddr *sa, socklen_t len, int backlog) {
    if (bind(s,sa,len) == -1) { //Call bind() binding port
        anetSetError(err, "bind: %s", strerror(errno));
        close(s);
        return ANET_ERR;
    }

    if (listen(s, backlog) == -1) { //Call listen() to convert active socket to passive listening socket
        anetSetError(err, "listen: %s", strerror(errno));
        close(s);
        return ANET_ERR;
    }
    return ANET_OK;
}

Seeing this, we know that redis, like the epoll servers we wrote about, is a process that requires socket creation, binding, and monitoring.

aeCreateFileEvent

In redis, client connection events and read-write events are collectively referred to as file events. We just completed the socket creation, bind, listen process. Now that we have a listener descriptor, we need to first add the listener descriptor to epoll's listener list to listen for client connection events. In initServer(), the client connection event is handled by calling aeCreateFileEvent() and specifying its event handler function acceptTcpHandler():

    for (j = 0; j < server.ipfd_count; j++) {
        if (aeCreateFileEvent(server.el, server.ipfd[j], AE_READABLE, acceptTcpHandler,NULL) == AE_ERR){
                serverPanic("Unrecoverable error creating server.ipfd file event.");
        }
    }

Following up on the aeCreateFileEvent() function, we found that the aeApiAddEvent() function was further invoked internally:

int aeCreateFileEvent(aeEventLoop *eventLoop, int fd, int mask, aeFileProc *proc, void *clientData) {
    if (fd >= eventLoop->setsize) {
        errno = ERANGE;
        return AE_ERR;
    }
    aeFileEvent *fe = &eventLoop->events[fd];

    if (aeApiAddEvent(eventLoop, fd, mask) == -1)
        return AE_ERR;
    fe->mask |= mask;
    if (mask & AE_READABLE) fe->rfileProc = proc;
    if (mask & AE_WRITABLE) fe->wfileProc = proc;
    fe->clientData = clientData;
    if (fd > eventLoop->maxfd)
        eventLoop->maxfd = fd;
    return AE_OK;
}
static int aeApiAddEvent(aeEventLoop *eventLoop, int fd, int mask) {
    aeApiState *state = eventLoop->apidata;
    struct epoll_event ee = {0}; 
    int op = eventLoop->events[fd].mask == AE_NONE ?
            EPOLL_CTL_ADD : EPOLL_CTL_MOD;

    ee.events = 0;
    mask |= eventLoop->events[fd].mask;
    if (mask & AE_READABLE) ee.events |= EPOLLIN;
    if (mask & AE_WRITABLE) ee.events |= EPOLLOUT;
    ee.data.fd = fd;
    if (epoll_ctl(state->epfd,op,fd,&ee) == -1) return -1; //Call epoll_ctl to add client connection events
    return 0;
}

The aeApiAddEvent function calls epoll_ctl() to add client connection events to the listener list. At the same time, redis places the event handler in the aeFileEvent structure for storage:

typedef struct aeFileEvent {
    int mask; /* one of AE_(READABLE|WRITABLE|BARRIER) */
    aeFileProc *rfileProc; //Read event handler
    aeFileProc *wfileProc; //Write event handler
    void *clientData;  //Client data
} aeFileEvent;

Comparing with the epoll server program we wrote before, we have implemented the following steps:

int main(int argc, char *argv[]) {

    listenSocket = socket(AF_INET, SOCK_STREAM, 0); //Create a listener socket descriptor
    
    bind(listenSocket)  //Binding Address and Port
    
    listen(listenSocket) //Conversion from default active socket to server-applicable passive socket
    
    epfd = epoll_create(EPOLL_SIZE); //Create an epoll instance
    
    ep_events = (epoll_event*)malloc(sizeof(epoll_event) * EPOLL_SIZE); //Create an epoll_event structure to store Socket Sets
    event.events = EPOLLIN;
    event.data.fd = listenSocket;
    
    epoll_ctl(epfd, EPOLL_CTL_ADD, listenSocket, &event); //Add listener sockets to the listener list
   ...
}

We have implemented socket creation, bind, listen, epfd creation through epoll_create(), and added the initial listening socket descriptor event to epoll's listening list, and specified the event handling function for it. Next, it's time to call epoll_wait() in while(1) loop. By calling epoll_wait() through blocking, all ready socket descriptors are returned, the corresponding events are triggered, and then the events are processed.

aeMain()

Finally, through while(1) loop, waiting for the arrival of client connection events:

void aeMain(aeEventLoop *eventLoop) {
    eventLoop->stop = 0;
    while (!eventLoop->stop) {
        if (eventLoop->beforesleep != NULL)
            eventLoop->beforesleep(eventLoop);
        aeProcessEvents(eventLoop, AE_ALL_EVENTS|AE_CALL_AFTER_SLEEP);
    }
}

In EvetLoop, stop flag is used to determine whether the loop ends or not. If not, the loop calls aeProcessEvents(). We guess that epoll_wait() is called here, blocking waiting for events to arrive, then traversing all ready socket descriptors, and then calling the corresponding event handler function.

int aeProcessEvents(aeEventLoop *eventLoop, int flags)
{
        numevents = aeApiPoll(eventLoop, tvp); //Call epoll_wait()
        ...
}

Let's follow up aeApiPoll to see how epoll_wait() is called:

static int aeApiPoll(aeEventLoop *eventLoop, struct timeval *tvp) {
    aeApiState *state = eventLoop->apidata; //
    int retval, numevents = 0;

    retval = epoll_wait(state->epfd,state->events,eventLoop->setsize, tvp ? (tvp->tv_sec*1000 + tvp->tv_usec/1000) : -1);
    if (retval > 0) {
        int j;
        numevents = retval;
        for (j = 0; j < numevents; j++) {
            int mask = 0;
            struct epoll_event *e = state->events+j;

            if (e->events & EPOLLIN) mask |= AE_READABLE;
            if (e->events & EPOLLOUT) mask |= AE_WRITABLE;
            if (e->events & EPOLLERR) mask |= AE_WRITABLE;
            if (e->events & EPOLLHUP) mask |= AE_WRITABLE;
            eventLoop->fired[j].fd = e->data.fd;
            eventLoop->fired[j].mask = mask;
        }
    }
    return numevents;
}

First, the epfd and registered event set created in aeApiCreate() are extracted from EvetLoop, and epoll_wait() is called to wait for the arrival of events, and the descriptor set of all ready events is returned. Later, all ready descriptor sets are traversed to determine what type of descriptor it is, whether it is readable or writable, and then all ready-to-handle events are stored in fired arrays in EvetLoop, along with readable or writable tags at the corresponding array locations.
Back to the external caller, we have now put all the events that can be handled in the fired array, so we can traverse the array to get all the events that can be handled, and then call the corresponding event handler function:

int aeProcessEvents(aeEventLoop *eventLoop, int flags)
{
        numevents = aeApiPoll(eventLoop, tvp); //Call epoll_wait()

        for (j = 0; j < numevents; j++) {
            aeFileEvent *fe = &eventLoop->events[eventLoop->fired[j].fd]; //Cycle out all ready events
            int mask = eventLoop->fired[j].mask;
            int fd = eventLoop->fired[j].fd;
            int fired = 0; 

            if (!invert && fe->mask & mask & AE_READABLE) {
                fe->rfileProc(eventLoop,fd,fe->clientData,mask); //If the event is a read event, call the read event handler
                fired++;
            }

            if (fe->mask & mask & AE_WRITABLE) {
                if (!fired || fe->wfileProc != fe->rfileProc) {
                    fe->wfileProc(eventLoop,fd,fe->clientData,mask); //If the event is a write event, call the write event handler
                    fired++;
                }
            }
        }
    }
    ...
}

As for how to distinguish between client connection events and read-write events, redis specifies different event handlers (such as accept events are acceptTcpHandler event handlers), and read-write events are other event handlers. By encapsulating this layer, the procedure of judging the type of socket descriptor is avoided, and the previously registered event handler can be called directly.
Looking back at the epoll server we wrote before, is it very similar to this code?

    while (1) {
    
        event_cnt = epoll_wait(epfd, ep_events, EPOLL_SIZE, -1); //Waiting to return the socket descriptors that are ready
        
        for (int i = 0; i < event_cnt; ++i) { //Traversing through all ready socket descriptors
            if (ep_events[i].data.fd == listenSocket) { //If the listener socket descriptor is ready, a new client is connected to it.
            
                connSocket = accept(listenSocket); //Call accept() to establish a connection
                
                event.events = EPOLLIN;
                event.data.fd = connSocket;
                
                epoll_ctl(epfd, EPOLL_CTL_ADD, connSocket, &event); //Adding listeners to newly established connection socket descriptors to listen for subsequent read and write events on connection descriptors
                
            } else { //If the connection socket descriptor event is ready, you can read and write
            
                strlen = read(ep_events[i].data.fd, buf, BUF_SIZE); //Reading data from the connection socket descriptor will certainly read the data at this time, without blocking.
                if (strlen == 0) { //It is no longer possible to read data from the connection socket, so you need to remove the monitoring of the socket.
                
                    epoll_ctl(epfd, EPOLL_CTL_DEL, ep_events[i].data.fd, NULL); //Delete listening to this descriptor
                    
                    close(ep_events[i].data.fd);
                } else {
                    write(ep_events[i].data.fd, buf, str_len); //If the client can write the data back to the client
                }
            }
        }
    }

summary

So far, we have mastered the IO multiplexing scenario in redis. Redis centralizes all connections with read-write events and time events we have not mentioned, and encapsulates the underlying IO multiplexing mechanism. Finally, a single process can handle multiple connections and read-write events. This is the application of IO multiplexing in redis.

Posted by laide234 on Sun, 01 Sep 2019 01:09:34 -0700