Lock overhead optimization and a brief description of CAS

Keywords: C++ github Linux Attribute Programming

[TOC]

lock

Mutual exclusion locks are used to protect a critical area, that is, to protect a program fragment accessing shared resources, which can not be accessed by multiple threads at the same time. When a thread enters a critical section, other threads or processes must wait.

When it comes to the performance overhead of locks, it is generally said that the cost of locks is very large, how much the cost of locks is, where the main cost is, and how to improve the performance of locks.

Lock cost

Nowadays, the mechanism of lock generally uses futex (fast Userspace mutexes), a hybrid mechanism of kernel state and user state. How does the kernel maintain synchronization and mutex without futex? The system kernel maintains an object that is visible to all processes and is used to manage mutexes and notify blocked processes. If process A wants to enter the critical zone, first go inside to check the object, whether other processes are occupying the critical zone, and when leaving the critical zone, also go inside to check the object, whether there are other processes waiting to enter the critical zone, and then wake up the waiting process according to certain strategies. These unnecessary system calls (or kernel traps) cause a lot of performance overhead. To solve this problem, Futex came into being.

Futex is a hybrid synchronization mechanism of user state and kernel state. Firstly, synchronized processes share a segment of memory through mmap. The futex variable is located in the shared memory and operates atomically. When a process tries to enter or exit the mutex, it first checks the futex variable in the shared memory. If no competition occurs, it only modifies the futex, instead of executing system calls. When a process is told that competition occurs by accessing the futex variable, the system call must be executed to complete the corresponding processing (wait or wake up). Simply put, futex improves the efficiency of low-content by checking the user state (motivation). If you know that there is no competition, you don't have to fall into the kernel.

Mutex is implemented by using memory shared variables based on futex. If shared variables are built in a process, it is a thread lock. If it is built on shared memory between processes, it is a process lock. The _lock field in pthread_mutex_t is used to mark occupancy. CAS is used to determine whether _lock is occupied or not. If it is not occupied, it returns directly. Otherwise, the SYS_futex system call is invoked through _lll_lock_wait_private to force the thread to sleep. CAS is a user-mode CPU instruction. If there is no competition, simply modify the lock state and return. It is very efficient. Only when competition is found, can it be trapped in the kernel state through system calls. Therefore, FUTEX is a hybrid synchronization mechanism of user state and kernel state, which guarantees lock acquisition efficiency under low competition.

So if there is no conflict between locks, the processor overhead of each acquisition and release of locks is only the overhead of CAS instructions.

The best way to determine a thing is to actually test and observe it. Let's write a piece of code to test the cost of a conflict-free lock:

#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <time.h>

static inline long long unsigned time_ns(struct timespec* const ts) {
  if (clock_gettime(CLOCK_REALTIME, ts)) {
    exit(1);
  }
  return ((long long unsigned) ts->tv_sec) * 1000000000LLU
    + (long long unsigned) ts->tv_nsec;
}


int main()
{
    int res = -1;
    pthread_mutex_t mutex;

    //Initialize the mutex, using the default mutex attribute
    res = pthread_mutex_init(&mutex, NULL);
    if(res != 0)
    {
        perror("pthread_mutex_init failed\n");
        exit(EXIT_FAILURE);
    }

    long MAX = 1000000000;
    long c = 0;
    struct timespec ts;

    const long long unsigned start_ns = time_ns(&ts);

    while(c < MAX) 
    {
        pthread_mutex_lock(&mutex);
        c = c + 1;
        pthread_mutex_unlock(&mutex);
    }

    const long long unsigned delta = time_ns(&ts) - start_ns;

    printf("%f\n", delta/(double)MAX);

    return 0;
}

Description: The following performance tests were conducted under the Tencent cloud Intel(R) Xeon(R) CPU E5-26xx v4 1 core 2399.996 MHz.

Running 1 billion times, the spread to each lock/unlock operation is about 2.2ns each lock/unlock (deducting the cycle time of 2.7ns)

In the case of lock conflicts, the overhead is not so small.

First, pthread_mutex_lock will actually call sys_futex to enter the kernel to try to lock, and after being locked, the thread will go to sleep, which leads to the overhead of context switching and thread scheduling.

Two interlocking threads can be written to test the overhead of the process:

// Copyright (C) 2010 Benoit Sigoure
//
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation, either version 3 of the License, or
// (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program. If not, see <http://www.gnu.org/licenses/>.

#include <pthread.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#include <sys/syscall.h>
#include <sys/wait.h>
#include <time.h>
#include <unistd.h>

#include <linux/futex.h>

static inline long long unsigned time_ns(struct timespec* const ts) {
  if (clock_gettime(CLOCK_REALTIME, ts)) {
    exit(1);
  }
  return ((long long unsigned) ts->tv_sec) * 1000000000LLU
    + (long long unsigned) ts->tv_nsec;
}

static const int iterations = 500000;

static void* thread(void* restrict ftx) {
  int* futex = (int*) ftx;
  for (int i = 0; i < iterations; i++) {
    sched_yield();
    while (syscall(SYS_futex, futex, FUTEX_WAIT, 0xA, NULL, NULL, 42)) {
      // retry
      sched_yield();
    }
    *futex = 0xB;
    while (!syscall(SYS_futex, futex, FUTEX_WAKE, 1, NULL, NULL, 42)) {
      // retry
      sched_yield();
    }
  }
  return NULL;
}

int main(void) {
  struct timespec ts;
  const int shm_id = shmget(IPC_PRIVATE, sizeof (int), IPC_CREAT | 0666);
  int* futex = shmat(shm_id, NULL, 0);
  pthread_t thd;
  if (pthread_create(&thd, NULL, thread, futex)) {
    return 1;
  }
  *futex = 0xA;

  const long long unsigned start_ns = time_ns(&ts);
  for (int i = 0; i < iterations; i++) {
    *futex = 0xA;
    while (!syscall(SYS_futex, futex, FUTEX_WAKE, 1, NULL, NULL, 42)) {
      // retry
      sched_yield();
    }
    sched_yield();
    while (syscall(SYS_futex, futex, FUTEX_WAIT, 0xB, NULL, NULL, 42)) {
      // retry
      sched_yield();
    }
  }
  const long long unsigned delta = time_ns(&ts) - start_ns;

  const int nswitches = iterations << 2;
  printf("%i thread context switches in %lluns (%.1fns/ctxsw)\n",
         nswitches, delta, (delta / (float) nswitches));
  wait(futex);
  return 0;
}

Compilation uses gcc-std = gnu99-pthread context_switch.c.

The result is 2003.4ns/ctxsw, so the cost of lock collision is about 910 times that of non-collision, which is unexpectedly large.

Another c program can be used to test the overhead of "pure context switching". Threads only use sched_yield to give up the processor and do not sleep.

// Copyright (C) 2010 Benoit Sigoure
//
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
// the Free Software Foundation, either version 3 of the License, or
// (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program. If not, see <http://www.gnu.org/licenses/>.

#include <sched.h>
#include <pthread.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <errno.h>

static inline long long unsigned time_ns(struct timespec* const ts) {
  if (clock_gettime(CLOCK_REALTIME, ts)) {
    exit(1);
  }
  return ((long long unsigned) ts->tv_sec) * 1000000000LLU
    + (long long unsigned) ts->tv_nsec;
}

static const int iterations = 500000;

static void* thread(void*ctx) {
  (void)ctx;
  for (int i = 0; i < iterations; i++)
      sched_yield();
  return NULL;
}

int main(void) {
  struct sched_param param;
  param.sched_priority = 1;
  if (sched_setscheduler(getpid(), SCHED_FIFO, &param))
    fprintf(stderr, "sched_setscheduler(): %s\n", strerror(errno));

  struct timespec ts;
  pthread_t thd;
  if (pthread_create(&thd, NULL, thread, NULL)) {
    return 1;
  }

  long long unsigned start_ns = time_ns(&ts);
  for (int i = 0; i < iterations; i++)
      sched_yield();
  long long unsigned delta = time_ns(&ts) - start_ns;

  const int nswitches = iterations << 2;
  printf("%i thread context switches in %lluns (%.1fns/ctxsw)\n",
         nswitches, delta, (delta / (float) nswitches));
  return 0;
}

"Pure context switching" consumes about 381.2ns/ctxsw.

In this way, we can roughly divide the overhead of lock collision into three parts: pure context switching, 381.2ns, scheduler overhead (turning threads from sleep to ready or vice versa), 1622.2ns. In multi-core systems, there is also the overhead of cross-processor scheduling, which is very expensive. In real application scenarios, the cost of cache miss and TLB Miss caused by context switching will only be further increased.

Lock optimization

As you can see from the above, the real time consuming is not the number of locks, but the number of lock conflicts. Reducing the number of lock conflicts is the key to improving performance. Using finer-grained locks can reduce lock conflicts. The granularity mentioned here includes time and space. For example, a hash table contains a series of hash buckets. If a lock is set for each bucket, the granularity of space will be much smaller - access with non-conflicting hash values will not lead to lock conflicts, which is much lower than the probability of maintaining a lock for the whole hash table. Reducing time granularity is also easy to understand. The scope of locking contains only the necessary code segments. It minimizes the time between acquiring and releasing locks. Most importantly, it is absolutely not necessary to perform any blocking operations in locks. The use of read-write locks is also a good way to reduce conflicts. There is no mutual exclusion between read operations, which greatly reduces conflicts.

Assuming that insertion/deletion operations in one-way linked lists are few and the main operation is search, the performance of single-lock-based methods will be poor. In this case, you should consider using a read-write lock, pthread_rwlock_t, which allows multiple threads to search the list simultaneously. Insert and delete operations still lock the entire list. Assuming that the number of insertions and searches performed is almost the same, but the number of deletions is very small, it is not appropriate to lock the entire list during insertion. In this case, it is better to allow concurrent insertions on disjoint point s in the list, using the same read-write lock-based approach. Locking is performed at two levels. The linked list has a read-write lock. Each node contains a mutex lock. During insertion, the write thread establishes a read lock on the linked list, and then proceeds with the processing. Before inserting data, lock the node after which new data is added, release the node after insertion, and then release the read-write lock. Delete creates a write lock on the list. There is no need to acquire locks related to nodes; mutexes are built on only one operating node, which greatly reduces the number of lock conflicts.

The function of sys_futex system call is to sleep the locked current thread and let the processor be used by other threads. Since this process consumes a lot, that is to say, if the locked time does not exceed this value, it is not necessary to lock in the kernel at all. The released processor time is not enough to consume. . Sys_futex consumes enough time to run CAS many times. That is to say, for a system with frequent lock collisions and short average lock time, a worthwhile optimization method is to call CAS to try to obtain the lock first (this operation is also called spinlock), and then enter the kernel to lock after several failures. Of course, this optimization can only work in a multiprocessor system (there must be another processor to unlock, otherwise spin locking is meaningless). In glibc's pthread implementation, this mechanism can be used by setting PTHREAD_MUTEX_ADAPTIVE_NP attribute to pthread_mutex.

CAS

Some problems arising from locks:

  • Waiting for mutexes consumes precious time and costs a lot.
  • Low-priority threads can acquire mutexes, thus blocking high-priority threads that need the same mutex. This problem is called priority inversion.
  • Threads holding mutexes may be cancelled due to the end of the allocated time slice. This has a detrimental effect on other threads waiting for the same mutex because the waiting time is now longer. This problem is called lock convoying.

One of the benefits of lock-free programming is that one thread is suspended, which does not affect the execution of another thread, avoiding lock escort, and avoiding context switching and scheduling overhead in systems with frequent lock conflicts and short average locking time.

CAS (comapre and swap or check and set), comparing and replacing, referring to wiki, is an atomic instruction for thread data synchronization.

CAS core algorithm involves three parameters: memory value, update value and expectation value; CAS instruction checks whether a memory location contains the expected value first; if so, copy the new value to this location and return true; if not, return false.
CAS corresponds to an assembly instruction CMPXCHG and is therefore atomic.

bool compare_and_swap (int *accum, int *dest, int newval)
{
  if ( *accum == *dest ) {
      *dest = newval;
      return true;
  }
  return false;
}

Generally, the program will use CAS to complete a transactional operation continuously in the loop, which usually includes copying a shared variable to a local variable, then using this local variable to perform tasks to calculate new values, and finally using CAS to save the old value and memory value of the local variable to try to submit your changes. If the attempt fails, it will read again. Once the memory value is recalculated, and finally CAS is used to attempt to submit the changes, so that the loop. For example:

void LockFreeQueue::push(Node* newHead)
{
    for (;;)
    {
        // Copy shared variables (m_Head) to a local variable
        Node* oldHead = m_Head;

        // Tasks can be executed without paying attention to other threads
        newHead->next = oldHead;

        // Next try committing changes to shared variables
        // If the shared variable remains oldHead without being modified by other threads, CAS assigns the newHead to the shared variable m_Head and returns it
        // Otherwise, continue cycling and retry
        if (_InterlockedCompareExchange(&m_Head, newHead, oldHead))
            return;
    }
}

The data structure above sets up a shared header node m_Head. When push a new node, it will add the new node behind the header node. Don't believe that the execution of the program is continuous and the execution of the CPU is multithreaded concurrent. Before _InterlockedCompareExchange, or CAS, threads may be dispatched because the time slice is exhausted. The newly dispatched threads perform the push operation, and multiple threads share the m_Head variable. At this time, the m_Head has been modified. If the original thread continues to execute and overrides the old Head to the m_Head, the incoming nodes of other threads will be lost. So we need to compare whether m_Head is equal to oldHead. If it is, we can use new Head to override m_Head. If not, we need to use the latest m_Head to update the value of oldHead and go back to the loop. InterlockedCompareExchange automatically assigns m_Head to Head.

ABA problem

Because CAS needs to check whether the expected value and memory value have changed when submitting a change, if not update it, but if the original value has changed from A to B and then to A, then the value has not changed when using CAS check, but in fact a series of changes have taken place.

Memory recycling can cause serious problems in CAS:

T* ptr1 = new T(8, 18);
T* old = ptr1; 
delete ptr1;
T* ptr2 = new T(0, 1);
 
// We can't guarantee that the operating system won't reuse the ptr1 memory address, as most memory managers do.
if (old1 == ptr2) {
    // This means that the memory pointed to by the newly recovered ptr1 is used for the ptr2 applied later.
}

ABA problem is a common problem in the implementation of unlocked structure, which can be basically expressed as:

  • Process P1 reads a value A
  • P1 is suspended (time slice exhaustion, interruption, etc.), and process P2 begins to execute.
  • P2 modifies the value A to the value B, and then modifies it back to A.
  • P1 was awakened, and the comparison found that the value A did not change, and the program continued to execute.

For P1, the value A hasn't changed, but in fact A has changed. Continuing to use may cause problems. In CAS operation, this problem will become more serious because of more pointers. Imagine the following:

There is a heap (first in, then out) with top and node A. Node A is currently located at the top of the heap with the top pointer pointing to A. Now there is a process P1 that wants to pop a node, so follow the following unlock-free operation

pop()
{
  do{
    ptr = top; // ptr = top = NodeA
    next_prt = top->next; // next_ptr = NodeX
  } while(CAS(top, ptr, next_ptr) != true);
  return ptr;   
}

The process P2 interrupts P1 before executing CAS operation, and performs a series of pop and push operations on the heap, making the heap into the following structure:

Process P2 pop s out NodeA first, and then Push two NodeB and C. Because of the memory reuse mechanism widely used in memory management mechanism, Node C's address is consistent with that of NodeA before.

At this point P1 starts to run again. When CAS is performed, because top still points to the address of NodeA (which has actually become NodeC), the value of top is changed to NodeX. At this time, the heap structure is as follows:

After the CAS operation, the top pointer erroneously points to NodeX instead of NodeB.

ABA Solution

Tagged state reference, which adds extra tag bits, is like a version number; for example, one of the algorithms is to record the number of modifications of the pointer at the lower level of the memory address, and when the pointer is modified, the next CAS will fail, even if the address is caused by the memory reuse mechanism. Sometimes we call this mechanism ABA', because we make the second A slightly different from the first. Under the existing CPU, using 60 bit tag will not cause overflow problems until 10 years without restarting the program; in X64 CPU, it tends to support 128 bit CAS instructions, which can better ensure the avoidance of ABA problems.

Refer to the liblfds library code below to illustrate the implementation process of Tagged state reference.

One way we want to avoid ABA problems is to use longer pointers, which requires a CAS instruction that supports dword length. How does liblfds implement 128 bit instructions across platforms?

Under liblfds, the CAS instruction is LFDS710_PAL_ATOMIC_DWCAS macro. Its complete form is:

LFDS710_PAL_ATOMIC_DWCAS( pointer_to_destination, pointer_to_compare, pointer_to_new_destination, cas_strength, result)
  • pointer_to_destination: [in, out], a pointer to the target, is an array of two 64 bit integers;
  • pointer_to_compare: [in, out], the pointer used to compare with the target pointer, is also an array of two 64 bit integers;
  • pointer_to_new_destination: [in], a new pointer exchanged with the target pointer;
  • result: [out], if the pointer_to_comparison of 128 bit s is equal to the pointer_to_destination, the pointer_to_new_destination is used to cover the pointer_to_destination, and the result returns 1; if not, the pointer_to_destination remains unchanged, and the value of pointer_to_comparison becomes pointer_to_destination.

As you can see above, the liblfds library uses a one-dimensional array of two elements to represent 128 bit pointers.

Linux provides cmpxchg16b for implementing 128 bit CAS instructions, while in Windows _InterlockedCompareExchange 128 is used. Only when 128-bit pointers are completely equal can they be considered equal.

Refer to the windows implementation of CAS under liblfds/liblfds7.1.0/liblfds710/inc/liblfds710/lfds710_porting_abstraction_layer_compiler.h:

#define LFDS710_PAL_ATOMIC_DWCAS( pointer_to_destination, pointer_to_compare, pointer_to_new_destination, cas_strength, result ) \
{ \                                              
    LFDS710_PAL_BARRIER_COMPILER_FULL; \
    (result) = (char unsigned) _InterlockedCompareExchange128( (__int64 volatile *) (pointer_to_destination), (__int64) (pointer_to_new_destination[1]), (__int64) (pointer_to_new_destination[0]), (__int64 *) (pointer_to_compare) ); \
    LFDS710_PAL_BARRIER_COMPILER_FULL; \
}

The definition of new_top and the process of submitting modifications are studied.

new_top is a one-dimensional array with two elements, the struct lfds710_stack_element pointer, and the two elements are marked with POINTER 0 and COUNTER 1, respectively. COUNTER is equivalent to the tag tag tag mentioned earlier, the real pointer to the node when POINTER saves. Under X64, the pointer length is 64 bit, so the 64 bit tag record pointer modification record is used here.

liblfds initializes the new top COUNTER with COUNTER + 1 of the original top, even if the number of changes of SS - > top is marked with COUNTER, so that every time the top is changed, the COUNTER in the top will change.

Only when the POINT ER and COUNTER of SS - > top and original_top are exactly equal, the new_top will cover SS - > top, otherwise the original_top will be covered by SS - > top, and the latest original_top will be operated and compared again in the next cycle.

Refer to liblfds/liblfds7.1.0/liblfds710/src/lfds710_stack/lfds710_stack_push.c, the implementation of lock-free stack:

void lfds710_stack_push( struct lfds710_stack_state *ss,
                         struct lfds710_stack_element *se )
{
  char unsigned
    result;

  lfds710_pal_uint_t
    backoff_iteration = LFDS710_BACKOFF_INITIAL_VALUE;

  struct lfds710_stack_element LFDS710_PAL_ALIGN(LFDS710_PAL_ALIGN_DOUBLE_POINTER)
    *new_top[PAC_SIZE],
    *volatile original_top[PAC_SIZE];

  LFDS710_PAL_ASSERT( ss != NULL );
  LFDS710_PAL_ASSERT( se != NULL );

  new_top[POINTER] = se;

  original_top[COUNTER] = ss->top[COUNTER];
  original_top[POINTER] = ss->top[POINTER];

  do
  {
    se->next = original_top[POINTER];
    LFDS710_MISC_BARRIER_STORE;

    new_top[COUNTER] = original_top[COUNTER] + 1;
    LFDS710_PAL_ATOMIC_DWCAS( ss->top, original_top, new_top, LFDS710_MISC_CAS_STRENGTH_WEAK, result );

    if( result == 0 )
      LFDS710_BACKOFF_EXPONENTIAL_BACKOFF( ss->push_backoff, backoff_iteration );
  }
  while( result == 0 );

  LFDS710_BACKOFF_AUTOTUNE( ss->push_backoff, backoff_iteration );

  return;
}

Application of CAS Principle

  1. Unlocked data structure, reference https://github.com/liblfds/li...
  2. CAS in high performance memory queue disruptor, reference http://ifeve.com/disruptor/
  3. Database Optimistic Lock

Reference resources

[wiki Compare-and-swap] https://en.wikipedia.org/wiki...
[wiki ABA problem] https://en.wikipedia.org/wiki...
[Realization of left-ear lockless queue of mice] ___________ https://coolshell.cn/articles...
[IBM designed concurrent data structures that do not use mutexes] https://www.ibm.com/developer...
[ABA problem] https://lumian2015.github.io/...
[_InterlockedCompareExchange128] https://docs.microsoft.com/en...
[Implementation Principle of Linux Mutex Lock (pthread_mutex_t)] https://www.bbsmax.com/A/x9J2...
[Introduction to futex mechanism] https://blog.csdn.net/y339889...
[an-introduction-to-lock-free-programming] https://preshing.com/20120612...
[Performance issues of multi-process, multi-threading and multi-processor computing platforms] https://blog.csdn.net/Jmilk/a...
[Implement Lock-Free Queue] http://citeseerx.ist.psu.edu/...
[Context switching and thread scheduling performance test] https://github.com/tsuna/cont...
[pure context switching performance test] https://github.com/tsuna/cont...
[Lock overhead] http://xbay.github.io/2015/12...
[mutex implementation analysis of pthread package] https://blog.csdn.net/tlxamul...
[IBM Universal Threads: POSIX Procedure Details] https://www.ibm.com/developer...

Posted by roby2411 on Tue, 23 Apr 2019 12:51:34 -0700