baiyan
Introduce
Read this article before reading: Analysis of the Way to Improve the Performance of Server Concurrent IO from Network Programming Foundation to epoll To better understand the content of this article, thank you.
We know that when we use redis, we can get the data returned by the redis server by sending a get command from the client. Redis is based on the traditional C/S architecture. It receives a connection from the client by listening on a TCP port (6379), and then executes subsequent commands, and returns the execution results to the client.
redis is a qualified server program
Let's start with a question: As a qualified server program, how does redis server process a get command on the command line and return the result to the client?
To answer this question, we first review what we said in the previous article. Clients and servers need to create a socket to indicate their network address and port number respectively, and then communicate between sockets based on TCP protocol. Usually, the socket communication flow of a server program is as follows:
int main(int argc, char *argv[]) { listenSocket = socket(); //Call socket() system call to create a listener socket descriptor bind(listenSocket); //Binding Address and Port listen(listenSocket); //Conversion from default active socket to server-applicable passive socket while (1) { //Continuous looping to monitor whether there is a client connection event coming connSocket = accept($listenSocket); //Accept client connection read(connsocket); //Reading data from the client, only one client can be processed at the same time write(connsocket); //The data returned to the client can only be processed by one client at the same time. } return 0; }
In redis, the same steps are required. After establishing a connection with the client, the command sent by the client will be read, and then the command will be executed. Finally, the execution result of the command will be returned to the client by calling the write system call.
But such a process can only handle the connection and read-write events of one client at the same time. In order to enable single-process server applications to handle multiple client events simultaneously, we adopt IO multiplexing mechanism. The best IO multiplexing mechanism is epoll. Review the server code that we finally created using epoll in our last article:
int main(int argc, char *argv[]) { listenSocket = socket(AF_INET, SOCK_STREAM, 0); //Ibid., create a listener socket descriptor bind(listenSocket) //Ibid., binding addresses and ports listen(listenSocket) //Ibid., from the default active socket to the server-applicable passive socket epfd = epoll_create(EPOLL_SIZE); //Create an epoll instance ep_events = (epoll_event*)malloc(sizeof(epoll_event) * EPOLL_SIZE); //Create an epoll_event structure to store Socket Sets event.events = EPOLLIN; event.data.fd = listenSocket; epoll_ctl(epfd, EPOLL_CTL_ADD, listenSocket, &event); //Add listener sockets to the listener list while (1) { event_cnt = epoll_wait(epfd, ep_events, EPOLL_SIZE, -1); //Waiting to return the socket descriptors that are ready for (int i = 0; i < event_cnt; ++i) { //Traversing through all ready socket descriptors if (ep_events[i].data.fd == listenSocket) { //If the listener socket descriptor is ready, a new client is connected to it. connSocket = accept(listenSocket); //Call accept() to establish a connection event.events = EPOLLIN; event.data.fd = connSocket; epoll_ctl(epfd, EPOLL_CTL_ADD, connSocket, &event); //Adding listeners to newly established connection socket descriptors to listen for subsequent read and write events on connection descriptors } else { //If the connection socket descriptor event is ready, you can read and write strlen = read(ep_events[i].data.fd, buf, BUF_SIZE); //Reading data from the connection socket descriptor will certainly read the data at this time, without blocking. if (strlen == 0) { //It is no longer possible to read data from the connection socket, so you need to remove the monitoring of the socket. epoll_ctl(epfd, EPOLL_CTL_DEL, ep_events[i].data.fd, NULL); //Delete listening to this descriptor close(ep_events[i].data.fd); } else { write(ep_events[i].data.fd, buf, str_len); //If the client can write the data back to the client } } } } close(listenSocket); close(epfd); return 0; }
Redis based on the original selection, poll and epoll mechanism, combined with its unique business needs, encapsulated its own set of event handling functions, we call it ae (a simple event-driven programming library). While redis uses select, epoll or kqueue technology on mac, redis will first judge, and then select the one with the best performance:
/* Include the best multiplexing layer supported by this system. * The following should be ordered by performances, descending. */ #ifdef HAVE_EVPORT #include "ae_evport.c" #else #ifdef HAVE_EPOLL #include "ae_epoll.c" #else #ifdef HAVE_KQUEUE #include "ae_kqueue.c" #else #include "ae_select.c" #endif #endif #endif
Because the select function is a system call in POSIX standard, it can be implemented on different versions of operating system, so it is used as a bottom-up solution. For the sake of convenience, the following articles all use epoll mechanism to explain.
IO Multiplexing in redis
When we start a redis-server on the command line, redis actually does something similar to the epoll server we wrote earlier. There are three key function calls:
int main(int argc, char **argv) { ... initServerConfig(); //Initialize the structure for storing server-side information ... initServer(); //Initialize the redis event loop and call epoll_create and epoll_ctl. Create socket, bind, listen, accept to call in this function, and register the listener descriptor and connection descriptor returned after the call ... aeMain(); //Execute while(1) event loop, call epoll_wait to get the ready descriptor, and call the corresponding handler ... }
Next, let's look at each one:
initServerConfig()
All information on redis server is stored in a redis server structure, which has many fields, such as socket information (such as address and port) on the server side, and many configuration information supporting other redis functions such as clustering, persistence and so on. This function call initializes all fields of the redisServer structure and assigns an initial value. Since we are talking about the application of event and IO multiplexing mechanism in redis, we only focus on a few fields.
initServer()
This function call is our top priority. After initializing the relevant information of the server, it is necessary to create, bind, monitor sockets and establish connections with the client. In this function, we call socket, bind, listen, accept, epoll_create, epoll_ctl. We can learn the event mechanism of redis step by step from the epoll server mentioned above. The main function calls of initServer() are as follows:
void initServer(void) { ... server.el = aeCreateEventLoop(server.maxclients+CONFIG_FDSET_INCR); ... if (server.port != 0 && listenToPort(server.port,server.ipfd,&server.ipfd_count) == C_ERR) exit(1); ... for (j = 0; j < server.ipfd_count; j++) { if (aeCreateFileEvent(server.el, server.ipfd[j], AE_READABLE, acceptTcpHandler,NULL) == AE_ERR){ serverPanic("Unrecoverable error creating server.ipfd file event."); } } ... }
We interpret these lines of key code in top-down order:
aeCreateEventLoop()
In redis, there is an aeEventLoop concept that manages all relevant event description fields, stores registered events, and ready events:
typedef struct aeEventLoop { int stop; //Identify whether the event loop (that is, while(1)) ends aeFileEvent *events; //Store registered file events (file events, client connection and read-write events) aeFiredEvent *fired; //Store ready file events aeTimeEvent *timeEventHead; //Store time events (after time events) void *apidata; /* Storing epoll related information */ aeBeforeSleepProc *beforesleep; //Functions that need to be called before an event occurs aeBeforeSleepProc *aftersleep; //Functions that need to be called after an event occurs } aeEventLoop;
redis stores all ready descriptors returned through epoll_wait() in the fired array, then traverses the array, and calls the corresponding event handler to process all events at once. In the aeCreateEventLoop() function, the structure field that manages all event information is initialized, which also includes calling epoll_create() to initialize epfd of epoll:
aeEventLoop *aeCreateEventLoop(int setsize) { aeEventLoop *eventLoop; int i; if ((eventLoop = zmalloc(sizeof(*eventLoop))) == NULL) goto err; eventLoop->events = zmalloc(sizeof(aeFileEvent)*setsize); eventLoop->fired = zmalloc(sizeof(aeFiredEvent)*setsize); if (eventLoop->events == NULL || eventLoop->fired == NULL) goto err; eventLoop->setsize = setsize; eventLoop->lastTime = time(NULL); eventLoop->timeEventHead = NULL; eventLoop->timeEventNextId = 0; eventLoop->stop = 0; eventLoop->maxfd = -1; eventLoop->beforesleep = NULL; eventLoop->aftersleep = NULL; if (aeApiCreate(eventLoop) == -1) goto err; //Calling aeApiCreate() calls epoll_create() internally. for (i = 0; i < setsize; i++) eventLoop->events[i].mask = AE_NONE; return eventLoop; }
In the aeApiCreate() function, epoll_create() is called and the created epfd is stored in the apidata field of the eventLoop structure:
typedef struct aeApiState { int epfd; struct epoll_event *events; } aeApiState; static int aeApiCreate(aeEventLoop *eventLoop) { aeApiState *state = zmalloc(sizeof(aeApiState)); if (!state) return -1; state->events = zmalloc(sizeof(struct epoll_event)*eventLoop->setsize); if (!state->events) { zfree(state); return -1; } state->epfd = epoll_create(1024); /* Initialize epfd of epoll by calling epoll_create */ if (state->epfd == -1) { zfree(state->events); zfree(state); return -1; } eventLoop->apidata = state; //Keep the created epfd in the apidata field of the eventLoop structure return 0; }
listenToPort()
After creating epfd, we will do socket creation, binding and monitoring. These steps are carried out in listenToPort() function:
int listenToPort(int port, int *fds, int *count) { if (server.bindaddr_count == 0) server.bindaddr[0] = NULL; for (j = 0; j < server.bindaddr_count || j == 0; j++) { //Traverse all ip addresses if (server.bindaddr[j] == NULL) { //No binding address yet ... } else if (strchr(server.bindaddr[j],':')) { //Binding IPv6 address ... } else { //Binding IPv4 addresses, usually into this if branch fds[*count] = anetTcpServer(server.neterr,port,server.bindaddr[j], server.tcp_backlog); //Real binding logic } ... } return C_OK; }
redis will first determine the type of binding ip address, we are generally IPv4, so we will generally go to the third branch, call anetTcpServer() function for specific binding logic:
static int _anetTcpServer(char *err, int port, char *bindaddr, int af, int backlog) { ... if ((rv = getaddrinfo(bindaddr,_port,&hints,&servinfo)) != 0) { anetSetError(err, "%s", gai_strerror(rv)); return ANET_ERR; } for (p = servinfo; p != NULL; p = p->ai_next) { if ((s = socket(p->ai_family,p->ai_socktype,p->ai_protocol)) == -1) //Call socket() to create a listening socket continue; if (af == AF_INET6 && anetV6Only(err,s) == ANET_ERR) goto error; if (anetSetReuseAddr(err,s) == ANET_ERR) goto error; if (anetListen(err,s,p->ai_addr,p->ai_addrlen,backlog) == ANET_ERR) s = ANET_ERR; //Call bind() and listen() to bind ports and convert them into server-side passive sockets goto end; } }
After calling socket() system calls to create sockets, further calls to bind() and listen() are needed. These two steps are implemented within the anetListen() function:
static int anetListen(char *err, int s, struct sockaddr *sa, socklen_t len, int backlog) { if (bind(s,sa,len) == -1) { //Call bind() binding port anetSetError(err, "bind: %s", strerror(errno)); close(s); return ANET_ERR; } if (listen(s, backlog) == -1) { //Call listen() to convert active socket to passive listening socket anetSetError(err, "listen: %s", strerror(errno)); close(s); return ANET_ERR; } return ANET_OK; }
Seeing this, we know that redis, like the epoll servers we wrote about, is a process that requires socket creation, binding, and monitoring.
aeCreateFileEvent
In redis, client connection events and read-write events are collectively referred to as file events. We just completed the socket creation, bind, listen process. Now that we have a listener descriptor, we need to first add the listener descriptor to epoll's listener list to listen for client connection events. In initServer(), the client connection event is handled by calling aeCreateFileEvent() and specifying its event handler function acceptTcpHandler():
for (j = 0; j < server.ipfd_count; j++) { if (aeCreateFileEvent(server.el, server.ipfd[j], AE_READABLE, acceptTcpHandler,NULL) == AE_ERR){ serverPanic("Unrecoverable error creating server.ipfd file event."); } }
Following up on the aeCreateFileEvent() function, we found that the aeApiAddEvent() function was further invoked internally:
int aeCreateFileEvent(aeEventLoop *eventLoop, int fd, int mask, aeFileProc *proc, void *clientData) { if (fd >= eventLoop->setsize) { errno = ERANGE; return AE_ERR; } aeFileEvent *fe = &eventLoop->events[fd]; if (aeApiAddEvent(eventLoop, fd, mask) == -1) return AE_ERR; fe->mask |= mask; if (mask & AE_READABLE) fe->rfileProc = proc; if (mask & AE_WRITABLE) fe->wfileProc = proc; fe->clientData = clientData; if (fd > eventLoop->maxfd) eventLoop->maxfd = fd; return AE_OK; }
static int aeApiAddEvent(aeEventLoop *eventLoop, int fd, int mask) { aeApiState *state = eventLoop->apidata; struct epoll_event ee = {0}; int op = eventLoop->events[fd].mask == AE_NONE ? EPOLL_CTL_ADD : EPOLL_CTL_MOD; ee.events = 0; mask |= eventLoop->events[fd].mask; if (mask & AE_READABLE) ee.events |= EPOLLIN; if (mask & AE_WRITABLE) ee.events |= EPOLLOUT; ee.data.fd = fd; if (epoll_ctl(state->epfd,op,fd,&ee) == -1) return -1; //Call epoll_ctl to add client connection events return 0; }
The aeApiAddEvent function calls epoll_ctl() to add client connection events to the listener list. At the same time, redis places the event handler in the aeFileEvent structure for storage:
typedef struct aeFileEvent { int mask; /* one of AE_(READABLE|WRITABLE|BARRIER) */ aeFileProc *rfileProc; //Read event handler aeFileProc *wfileProc; //Write event handler void *clientData; //Client data } aeFileEvent;
Comparing with the epoll server program we wrote before, we have implemented the following steps:
int main(int argc, char *argv[]) { listenSocket = socket(AF_INET, SOCK_STREAM, 0); //Create a listener socket descriptor bind(listenSocket) //Binding Address and Port listen(listenSocket) //Conversion from default active socket to server-applicable passive socket epfd = epoll_create(EPOLL_SIZE); //Create an epoll instance ep_events = (epoll_event*)malloc(sizeof(epoll_event) * EPOLL_SIZE); //Create an epoll_event structure to store Socket Sets event.events = EPOLLIN; event.data.fd = listenSocket; epoll_ctl(epfd, EPOLL_CTL_ADD, listenSocket, &event); //Add listener sockets to the listener list ... }
We have implemented socket creation, bind, listen, epfd creation through epoll_create(), and added the initial listening socket descriptor event to epoll's listening list, and specified the event handling function for it. Next, it's time to call epoll_wait() in while(1) loop. By calling epoll_wait() through blocking, all ready socket descriptors are returned, the corresponding events are triggered, and then the events are processed.
aeMain()
Finally, through while(1) loop, waiting for the arrival of client connection events:
void aeMain(aeEventLoop *eventLoop) { eventLoop->stop = 0; while (!eventLoop->stop) { if (eventLoop->beforesleep != NULL) eventLoop->beforesleep(eventLoop); aeProcessEvents(eventLoop, AE_ALL_EVENTS|AE_CALL_AFTER_SLEEP); } }
In EvetLoop, stop flag is used to determine whether the loop ends or not. If not, the loop calls aeProcessEvents(). We guess that epoll_wait() is called here, blocking waiting for events to arrive, then traversing all ready socket descriptors, and then calling the corresponding event handler function.
int aeProcessEvents(aeEventLoop *eventLoop, int flags) { numevents = aeApiPoll(eventLoop, tvp); //Call epoll_wait() ... }
Let's follow up aeApiPoll to see how epoll_wait() is called:
static int aeApiPoll(aeEventLoop *eventLoop, struct timeval *tvp) { aeApiState *state = eventLoop->apidata; // int retval, numevents = 0; retval = epoll_wait(state->epfd,state->events,eventLoop->setsize, tvp ? (tvp->tv_sec*1000 + tvp->tv_usec/1000) : -1); if (retval > 0) { int j; numevents = retval; for (j = 0; j < numevents; j++) { int mask = 0; struct epoll_event *e = state->events+j; if (e->events & EPOLLIN) mask |= AE_READABLE; if (e->events & EPOLLOUT) mask |= AE_WRITABLE; if (e->events & EPOLLERR) mask |= AE_WRITABLE; if (e->events & EPOLLHUP) mask |= AE_WRITABLE; eventLoop->fired[j].fd = e->data.fd; eventLoop->fired[j].mask = mask; } } return numevents; }
First, the epfd and registered event set created in aeApiCreate() are extracted from EvetLoop, and epoll_wait() is called to wait for the arrival of events, and the descriptor set of all ready events is returned. Later, all ready descriptor sets are traversed to determine what type of descriptor it is, whether it is readable or writable, and then all ready-to-handle events are stored in fired arrays in EvetLoop, along with readable or writable tags at the corresponding array locations.
Back to the external caller, we have now put all the events that can be handled in the fired array, so we can traverse the array to get all the events that can be handled, and then call the corresponding event handler function:
int aeProcessEvents(aeEventLoop *eventLoop, int flags) { numevents = aeApiPoll(eventLoop, tvp); //Call epoll_wait() for (j = 0; j < numevents; j++) { aeFileEvent *fe = &eventLoop->events[eventLoop->fired[j].fd]; //Cycle out all ready events int mask = eventLoop->fired[j].mask; int fd = eventLoop->fired[j].fd; int fired = 0; if (!invert && fe->mask & mask & AE_READABLE) { fe->rfileProc(eventLoop,fd,fe->clientData,mask); //If the event is a read event, call the read event handler fired++; } if (fe->mask & mask & AE_WRITABLE) { if (!fired || fe->wfileProc != fe->rfileProc) { fe->wfileProc(eventLoop,fd,fe->clientData,mask); //If the event is a write event, call the write event handler fired++; } } } } ... }
As for how to distinguish between client connection events and read-write events, redis specifies different event handlers (such as accept events are acceptTcpHandler event handlers), and read-write events are other event handlers. By encapsulating this layer, the procedure of judging the type of socket descriptor is avoided, and the previously registered event handler can be called directly.
Looking back at the epoll server we wrote before, is it very similar to this code?
while (1) { event_cnt = epoll_wait(epfd, ep_events, EPOLL_SIZE, -1); //Waiting to return the socket descriptors that are ready for (int i = 0; i < event_cnt; ++i) { //Traversing through all ready socket descriptors if (ep_events[i].data.fd == listenSocket) { //If the listener socket descriptor is ready, a new client is connected to it. connSocket = accept(listenSocket); //Call accept() to establish a connection event.events = EPOLLIN; event.data.fd = connSocket; epoll_ctl(epfd, EPOLL_CTL_ADD, connSocket, &event); //Adding listeners to newly established connection socket descriptors to listen for subsequent read and write events on connection descriptors } else { //If the connection socket descriptor event is ready, you can read and write strlen = read(ep_events[i].data.fd, buf, BUF_SIZE); //Reading data from the connection socket descriptor will certainly read the data at this time, without blocking. if (strlen == 0) { //It is no longer possible to read data from the connection socket, so you need to remove the monitoring of the socket. epoll_ctl(epfd, EPOLL_CTL_DEL, ep_events[i].data.fd, NULL); //Delete listening to this descriptor close(ep_events[i].data.fd); } else { write(ep_events[i].data.fd, buf, str_len); //If the client can write the data back to the client } } } }
summary
So far, we have mastered the IO multiplexing scenario in redis. Redis centralizes all connections with read-write events and time events we have not mentioned, and encapsulates the underlying IO multiplexing mechanism. Finally, a single process can handle multiple connections and read-write events. This is the application of IO multiplexing in redis.