How to solve the problem of nginx shock group

Keywords: Programming Nginx socket

For nginx's swarm alarm problem, we first need to understand that during the startup of nginx, the master process will listen to each port specified in the configuration file, and then the master process will call the fork() method to create each subprocess. According to the working principle of the process, the subprocess will inherit all the memory data of the parent process and the port it listens to, that is It is said that the worker process will also listen to each port after it is started. As for the alarm group, it means that when the client has a new connection request coming, the connection establishment event of each worker process will be triggered, but only one worker process can handle the event normally, while other worker processes will find that the event has failed, so that they can enter the waiting state again. This phenomenon that all worker processes are "shocked" by an event is the "swarm" problem. Obviously, if all the worker processes are triggered, it will consume a lot of resources. This article focuses on how nginx deals with the panic problem.

1. Solutions

In the previous article, we mentioned that when each worker process is created, the NGX ﹣ worker ﹣ process ﹣ init() method will be called to initialize the current worker process. There is a very important step in this process, that is, each worker process will call the epoll ﹣ create() method to create a unique epoll handle for itself. For each port that needs to be listened to, there is a file descriptor corresponding to it. However, the worker process only adds the file descriptor to the epoll handle of the current process through the epoll < ctl() method, and listens for the accept event, which will be triggered by the connection establishment event of the client to handle the event. It can also be seen from here that if the worker process does not add the file descriptor corresponding to the port to be monitored to the epoll handle of the process, it cannot be triggered. Based on this principle, nginx uses a shared lock to control whether the current process has permission to add the port to be monitored to the epoll handle of the current process, that is to say, only the process obtaining the lock will listen to the target port. In this way, only one worker process is triggered each time an event occurs. As shown in the figure below, it is a schematic diagram of the working cycle of the worker process:

As for the flow in the figure, it should be noted that each worker process will try to acquire the shared lock after entering the loop. If not, the file descriptors of the monitored port will be removed from the epoll handle of the current process (i.e. it will be removed if it does not exist). The main purpose of this is to prevent the loss of client connection events , even though it may cause a small number of swarm alarm problems, it is not serious. Imagine that, according to the theory, when the current process releases the lock, the file descriptor of the listening port is removed from the epoll handle. Before the next worker process acquires the lock, the file descriptor corresponding to each port does not have any epoll handle to listen in this period of time, which will cause the event loss. If, in turn, the file descriptors listening to are removed according to the figure when the lock acquisition fails. Because the lock acquisition fails, it means that there must be a process listening to these file descriptors, so the removal is safe at this time. However, this will cause a problem. According to the above figure, when the current process finishes executing a cycle, it will release the lock, and then handle other events. Note that the file descriptors it listens to are not released in this process. At this time, if another process acquires the lock and listens for the file descriptor, then two processes will listen for the file descriptor. Therefore, if the connection establishment event occurs on the client, then two worker processes will be triggered. This problem can be tolerated for two main reasons:

  • At this time, the swarm alarm only triggers fewer worker processes, which is much better than starting all worker processes every time;
  • The main reason for this spoofing problem is that the current process releases the lock, but does not release the file descriptors it listens to. However, after releasing the lock, the worker process mainly processes the read-write events and check flag bits of the client connection. This process is very short. After processing, it will try to acquire the lock, and then it will release the file descriptions it listens to In contrast, the event that the worker process that acquires the lock is waiting to handle the connection establishment event of the client is longer, so the probability of the swarm alarm problem is relatively small.

2. Source code explanation

The method of initial event of worker process is mainly implemented in NGX ﹣ process ﹣ events ﹣ and ﹣ timers() method. Let's see how the method handles the whole process. The source code of the method is as follows:

void ngx_process_events_and_timers(ngx_cycle_t *cycle) {
  ngx_uint_t flags;
  ngx_msec_t timer, delta;

  if (ngx_trylock_accept_mutex(cycle) == NGX_ERROR) {
    return;
  }

  // Here we start to process the events. For the kqueue model, it points to the ngx_kqueue_process_uevents() method,
  // For the epoll model, it points to the NGX ﹣ epoll ﹣ process ﹣ events() method
  // The main function of this method is to get the event list in the corresponding event model, and then add the event to NGX ﹣ posted ﹣ accept ﹣ events
  // In the queue or NGX "posted" events queue
  (void) ngx_process_events(cycle, timer, flags);

  // Here, we start to process the accept event, which is handed over to the NGX event accept() method of NGX event accept. C;
  ngx_event_process_posted(cycle, &ngx_posted_accept_events);

  // Start to release lock
  if (ngx_accept_mutex_held) {
    ngx_shmtx_unlock(&ngx_accept_mutex);
  }

  // If you do not need to process in the event queue, process the event directly
  // For the event processing, if it is an accept event, it will be handled by the NGX event accept() method of NGX event accept. C;
  // If it is a read event, it will be handled by the NGX HTTP wait request handler() method of NGX HTTP request. C;
  // For the completed event, it will be finally handed over to the NGX HTTP keepalive handler() method of NGX HTTP request. C.

  // We start to process other events except the accept event
  ngx_event_process_posted(cycle, &ngx_posted_events);
}

In the above code, we omit most of the inspection work, leaving only the skeleton code. First, the worker process will call the NGX? Trylock? Accept? Mutex() method to obtain the lock, in which if the lock is obtained, it will listen for the file descriptors corresponding to each port. The ngx_process_events() method is then called to handle the events monitored in the epoll handle. The shared lock will then be released, and finally the read-write events of the clients who have established the connection will be processed. Let's take a look at how the NGX? Trylock? Accept? Mutex() method obtains the shared lock:

ngx_int_t ngx_trylock_accept_mutex(ngx_cycle_t *cycle) {
  // Try to obtain shared lock using CAS algorithm
  if (ngx_shmtx_trylock(&ngx_accept_mutex)) {

    // A value of 1 indicates that the current process has acquired the lock
    if (ngx_accept_mutex_held && ngx_accept_events == 0) {
      return NGX_OK;
    }

    // This is mainly to register the file descriptor of the current connection to the queue of the corresponding event, such as the change list array of the kqueue model
    // When nginx enables each worker process, by default, the worker process inherits the socket handle that the master process listens to,
    // This leads to a problem that when there are client events on a port, all processes listening to the port will wake up,
    // However, only one worker process can successfully handle the event, while other processes wake up to find that the event has expired,
    // As a result, it will continue to enter the waiting state, which is called "shock group" phenomenon.
    // On the one hand, nginx solves the panic phenomenon by sharing the lock here, that is, only the worker process that obtains the lock can handle it
    // Client event, but in fact, the worker process re adds the listening events of each port for the current worker process in the process of obtaining the lock,
    // Other worker processes do not listen. In other words, only one worker process listens to each port at the same time,
    // In this way, the problem of "shock group" is avoided.
    // Here, the NGX ﹣ enable ﹣ accept ﹣ events() method is to re add the listening events of each port for the current process.
    if (ngx_enable_accept_events(cycle) == NGX_ERROR) {
      ngx_shmtx_unlock(&ngx_accept_mutex);
      return NGX_ERROR;
    }

    // Flag has successfully acquired the lock
    ngx_accept_events = 0;
    ngx_accept_mutex_held = 1;

    return NGX_OK;
  }

  // The previous lock acquisition failed, so you need to reset the state of NGX ﹣ accept ﹣ mutex ﹣ hold and clear the current connected event
  if (ngx_accept_mutex_held) {
    // If the current process's NGX ﹣ accept ﹣ mutex ﹣ head is 1, reset it to 0, and listen to the current process on each port
    // Events deleted
    if (ngx_disable_accept_events(cycle, 0) == NGX_ERROR) {
      return NGX_ERROR;
    }

    ngx_accept_mutex_held = 0;
  }

  return NGX_OK;
}

In the above code, three things are basically done:

  • Try to obtain the shared lock by using the CAS method through the NGX? Shmtx? Trylock() method;
  • After obtaining the lock, call the NGX ﹣ enable ﹣ accept ﹣ events() method to listen for the file descriptor corresponding to the target port;
  • If the lock is not obtained, call the NGX ﹣ disable ﹣ accept ﹣ events() method to release the listening file descriptor;

3. summary

This paper first explains the causes of the phenomenon of group panic, then introduces how nginx solves the problem of group panic, and finally explains how nginx handles the problem of group panic from the perspective of source code.

Posted by jandrews3 on Fri, 03 Jan 2020 02:43:55 -0800