This series is the reading notes of the art of multiprocessor programming. It is understood and implemented on the basis of the original book and combined with the code of OpenJDK 11 or above. And share some personal information with those who want to have a deeper understanding according to their personal data search and understanding experience
Spin lock and contention
3. Queue lock
In addition to generality, the previously implemented fallback based lock also has the following two problems:
- CPU cache consistency traffic: Although the traffic is smaller than TASLock due to fallback, the multi-threaded access lock status still consumes traffic due to cache consistency.
- It may reduce the efficiency of accessing the critical area: because the sleep delay of all threads is too large, all threads are currently sleeping, but the lock has actually been released.
You can put threads into a queue to solve the above two problems:
- In the queue, each thread checks whether its predecessor thread has completed, determines whether the lock has been released, and does not need to access the lock state. This accesses different memories and reduces the CPU cache consistency traffic caused by the lock release modification state
- sleep is not required. The precursor thread can tell the thread that the lock is released and try to obtain the lock, which improves the efficiency of accessing the critical area
Finally, the fairness of FIFO is realized through queue.
3.1. Array based locks
We implement the queue function through an array. The process is as follows:
Storage required:
- boolean array. If true, it means that the thread of the corresponding slot has obtained the lock. If false, it means that the thread of the corresponding slot has not obtained the lock
- Save the atomic variable of the current latest slot. Each lock will add 1 to this atomic variable, and then remainder the size of the boolean array. This value represents that the thread occupies this position of the boolean array, and the value of this position of the boolean array represents whether the thread has obtained the lock. This also shows that the capacity of the boolean array determines how many threads can compete for this lock at the same time
- ThreadLocal: records the position of the boolean array occupied by the current thread
Locking process:
- Atomic variable + 1, the boolean array size is subtracted to obtain current
- Record current to ThreadLocal
- When the value of boolean array cuurent position is false, spin wait
Unlocking process:
- Get the location mine corresponding to the current thread from ThreadLocal
- Mark the mine position of the boolean array as false
- Mark the position where mine + 1 of the boolean array takes the remainder of the array size (to prevent the array from crossing the boundary) as true
Its source code is:
public class ArrayLock implements Lock { private final ThreadLocal<Integer> mySlotIndex = ThreadLocal.withInitial(() -> 0); private final AtomicInteger tail = new AtomicInteger(0); private final boolean[] flags; private final int capacity; public ALock(int capacity) { this.capacity = capacity; this.flags = new boolean[capacity]; } @Override public void lock() { int current = this.tail.getAndIncrement() % capacity; this.mySlotIndex.set(current); while (!this.flags[current]) { } } @Override public void unlock() { int mine = this.mySlotIndex.get(); this.flags[mine] = false; this.flags[(mine + 1) % capacity] = true; } }
In the implementation of this source code, we can also do many optimizations:
- Spin waiting does not need a strong spin, but a Thread.onSpinWait() with lower CPU consumption and CPU instruction optimization for different architectures and spins.
- Each slot of the boolean array needs to be filled with cache lines to prevent the occurrence of CPU false sharing from causing too many cache line invalidation signals.
- The update of boolean array needs to be volatile. Ordinary update will delay the bus signal, causing other threads with locks to feel slower and idle more times.
- Remainder is a very inefficient operation, which needs to be transformed into and operation. Taking remainder to the nth power of 2 is equivalent to subtracting 1 from the nth power of 2. We need to convert the incoming capacity value into a value greater than the nearest nth power of capacity.
- this.flags[current] the operation of reading the array needs to be placed outside the loop to prevent the performance consumption of reading the array each time.
The optimized source code is:
public class ArrayLock implements Lock { private final ThreadLocal<Integer> mySlotIndex = ThreadLocal.withInitial(() -> 0); private final AtomicInteger tail = new AtomicInteger(0); private final ContendedBoolean[] flags; private final int capacity; private static class ContendedBoolean { //Cache line filling through annotations @Contended private boolean flag; } //volatile update via handle private static final VarHandle FLAG; static { try { //Initialization handle FLAG = MethodHandles.lookup().findVarHandle(ContendedBoolean.class, "flag", boolean.class); } catch (Exception e) { throw new Error(e); } } public ArrayLock(int capacity) { capacity |= capacity >>> 1; capacity |= capacity >>> 2; capacity |= capacity >>> 4; capacity |= capacity >>> 8; capacity |= capacity >>> 16; capacity += 1; //The N-th power of the smallest 2 greater than N this.flags = new ContendedBoolean[capacity]; for (int i = 0; i < this.flags.length; i++) { this.flags[i] = new ContendedBoolean(); } this.capacity = capacity; this.flags[0].flag = true; } @Override public void lock() { int current = this.tail.getAndIncrement() & (capacity - 1); this.mySlotIndex.set(current); ContendedBoolean contendedBoolean = this.flags[current]; while (!contendedBoolean.flag) { Thread.onSpinWait(); } } @Override public void unlock() { int mine = this.mySlotIndex.get(); FLAG.setVolatile(this.flags[mine], false); FLAG.setVolatile(this.flags[(mine + 1) & (capacity - 1)], true); } }
However, even with these optimizations, the performance of this lock will still be poor when there are a large number of lock calls with high concurrency. We will analyze and optimize this later.