Netty high performance component -- FastThreadLocal source code analysis (see the real chapter for details)

Keywords: Java Netty JDK Programming

1. Preface

netty encapsulates FastThreadLocal to replace the ThreadLocal provided by jdk. Combined with the encapsulated FastThreadLocalThread, variables in multithreaded environment improve the query and update efficiency of ThreadLocal objects.
In the following, by comparing ThreadLocal with FastThreadLocal, and by analyzing the source code, we will explore the mystery of the performance of FastThreadLocal and FastThreadLocalThread after they are used together.

2. ThreadLocalMap

ThreadLocalMap is a static class defined in ThreadLocal, which is used to save ThreadLocal objects referenced in ThreadLocal.
In jdk, each Thread object contains the following two variables:

public
class Thread implements Runnable {

    // Several codes are omitted here

    // Store ThreadLocal variables, store a ThreadLocalMap for each Thread, and realize the Thread isolation of variables.
    ThreadLocal.ThreadLocalMap threadLocals = null;

    ThreadLocal.ThreadLocalMap inheritableThreadLocals = null;
}

In programming practice, threads may contain multiple ThreadLocal to reference, which are saved in threadlocal.ThreadLocalMap threadlocales (each thread contains its own ThreadLocalMap to avoid multi-threaded Contention).

 static class ThreadLocalMap {

        // Note that here, the Entry uses WeakReference
        (Soft reference),In this way, when resources are tight, some of them can be recycled. ThreadLocal variable
        static class Entry extends WeakReference<ThreadLocal<?>> {
            /** The value associated with this ThreadLocal. */
            Object value;

            Entry(ThreadLocal<?> k, Object v) {
                super(k);
                value = v;
            }
        }
        
        // Initialization length of ThreadLocal object storage array
        private static final int INITIAL_CAPACITY = 16;
        
        // ThreadLocal object storage array
        private Entry[] table;
        
        // Initialize ThreadLocalMap, use array to store ThreadLocal resource, use threadLocalHashCode of ThreadLocal object to hash to get index
        // Here, use the object array to store the ThreadLocal object. The operation is similar to HashMap. Interested readers can view the source code of HashMap for comparison.
        ThreadLocalMap(ThreadLocal<?> firstKey, Object firstValue) {
            table = new Entry[INITIAL_CAPACITY];
            int i = firstKey.threadLocalHashCode & (INITIAL_CAPACITY - 1);
            table[i] = new Entry(firstKey, firstValue);
            size = 1;
            setThreshold(INITIAL_CAPACITY);
        }
        
        // Get the ThreadLocal object. Here, you need to hash the threadLocalHashCode to get the index.
        private Entry getEntry(ThreadLocal<?> key) {
            int i = key.threadLocalHashCode & (table.length - 1);
            Entry e = table[i];
            if (e != null && e.get() == key)
                return e;
            else
                return getEntryAfterMiss(key, i, e);
        }
    }

From the above code, when ThreadLocalMap is initialized, an array of objects will be created.
The initial length of the object array is 16. In the subsequent expansion, the array length will be kept at the level of 2^n for hash operation to determine the index of the thredlocal object.
When getting ThreadLocal objects, the object index will be determined according to the sum of threadLocalHashCode and the length of the object array minus one, so as to get the value quickly.

Using hash to determine array subscript has the following problems:

  • Resolve the hash conflict;
  • rehash brought by the expansion of object array.

ThreadLocal is a general class provided by jdk. In most scenarios, there are fewer ThreadLocal variables in threads, so there are fewer hash conflicts and rehash.
Even if there are occasional hash conflicts and rehash, there will be no significant performance loss to the application.

3. FastThreadLocalThread

Netty transforms ThreadLocal into FastThreadLocal to cope with its own application scenarios with large concurrent volume and large data throughput.
For better use, Netty also inherits Thread and constructs FastThreadLocalThread.
When and only when FastThreadLocal and FastThreadLocalThread are used together, they can really play a role in speeding up.

// Limited to space, omitting more functions
public class FastThreadLocalThread extends Thread {

    // Compared with using ThreadLocal.ThreadLocalMap to store ThreadLocal resources in Thread, FastThreadLocalThread uses InternalThreadLocalMap to store ThreadLocal resources
    private InternalThreadLocalMap threadLocalMap;

    public final InternalThreadLocalMap threadLocalMap() {
        return threadLocalMap;
    }

    public final void setThreadLocalMap(InternalThreadLocalMap threadLocalMap) {
        this.threadLocalMap = threadLocalMap;
    }
    
    @UnstableApi
    public boolean willCleanupFastThreadLocals() {
        return cleanupFastThreadLocals;
    }

    @UnstableApi
    public static boolean willCleanupFastThreadLocals(Thread thread) {
        return thread instanceof FastThreadLocalThread &&
                ((FastThreadLocalThread) thread).willCleanupFastThreadLocals();
    }
}

As can be seen from the above code, compared with Thread,FastThreadLocalThread adds threadLocalMap object and the function to get the cleaning flag of threadLocalMap.

Even though ThreadLocal uses WeakReference to ensure resource release, there is still a possibility of memory leak.
FastThreadLocalThread and FastThreadLocal are both customized by Netty. After the thread task is executed, the cleanup function removeAll of InternalThreadLocalMap can be enforced (see the following for details).

4. FastThreadLocal

4.1 InternalThreadLocalMap

Previously:

The InternalThreadLocalMap object threadLocalMap is declared in FastThreadLocalThread.

public final class InternalThreadLocalMap extends UnpaddedInternalThreadLocalMap{
    
}

From the above code, we can see that InternalThreadLocalMap inherits from UnpaddedInternalThreadLocalMap.
Therefore, we need to explore the definition of unpadded internal thread local map.

//
class UnpaddedInternalThreadLocalMap {

    // If 'FastThreadLocal' is used in 'Thread', then 'ThreadLocal' is actually used to store resources.
    static final ThreadLocal<InternalThreadLocalMap> slowThreadLocalMap = new ThreadLocal<InternalThreadLocalMap>();
    // Resource index. Each FastThreadLocal object will have a corresponding ID, that is, it will be automatically increased by nextIndex.
    static final AtomicInteger nextIndex = new AtomicInteger();

    // The resource storage address of FastThreadLocal. In ThreadLocal, resources are stored through ThreadLocalMap. The index is obtained by hash from threadLocalHashCode of ThreadLocal object.
    // FastThreadLocal uses the Object [] array, and uses the value automatically increased by nextIndex as the index to ensure that each query value is an O(1) operation.
    // Note that FastThreadLocal object in order to avoid performance loss caused by pseudo sharing, using padding makes the object size of FastThreadLocal exceed 128byte
    // To avoid pseudo sharing, multiple consecutive values of indexedVariables can be cached in the cpu chache line without updating, which greatly improves the query efficiency.
    Object[] indexedVariables;

    // Core thread-locals
    int futureListenerStackDepth;
    int localChannelReaderStackDepth;
    Map<Class<?>, Boolean> handlerSharableCache;
    IntegerHolder counterHashCode;
    ThreadLocalRandom random;
    Map<Class<?>, TypeParameterMatcher> typeParameterMatcherGetCache;
    Map<Class<?>, Map<String, TypeParameterMatcher>> typeParameterMatcherFindCache;

    // String-related thread-locals
    StringBuilder stringBuilder;
    Map<Charset, CharsetEncoder> charsetEncoderCache;
    Map<Charset, CharsetDecoder> charsetDecoderCache;

    // ArrayList-related thread-locals
    ArrayList<Object> arrayList;

    // Constructor, need to pay attention to later
    UnpaddedInternalThreadLocalMap(Object[] indexedVariables) {
        this.indexedVariables = indexedVariables;
    }
}

In the above code, please note:

    static final ThreadLocal<InternalThreadLocalMap> slowThreadLocalMap = new ThreadLocal<InternalThreadLocalMap>();

The reason for declaring slowThreadLocalMap is that users may invoke FastThreadLocal. in Thread instead of FastThreadLocalThread.
Therefore, in order to ensure the compatibility of the program, this variable is declared to save common ThreadLocal related variables (see the instructions later for specific use).

// Delete some functions for space
public final class InternalThreadLocalMap extends UnpaddedInternalThreadLocalMap {

    private static final int DEFAULT_ARRAY_LIST_INITIAL_CAPACITY = 8;
    
    // Variable quality of resource without assignment
    public static final Object UNSET = new Object();

    // To get the ThreadLocal object, we will determine the type of the calling thread currently calling different resources separately.
    public static InternalThreadLocalMap getIfSet() {
        Thread thread = Thread.currentThread();
        if (thread instanceof FastThreadLocalThread) {
            return ((FastThreadLocalThread) thread).threadLocalMap();
        }
        return slowThreadLocalMap.get();
    }

    // To get the ThreadLocal object, we will determine the type of the current call thread, so that we can call fastGet or slowGet.
    public static InternalThreadLocalMap get() {
        Thread thread = Thread.currentThread();
        if (thread instanceof FastThreadLocalThread) {
            return fastGet((FastThreadLocalThread) thread);
        } else {
            return slowGet();
        }
    }

    // If the current FastThreadLocal object is called FastThreadLocalThread, the threadLocalMap object of FastThreadLocalThread is called to get the relevant resources.
    private static InternalThreadLocalMap fastGet(FastThreadLocalThread thread) {
        InternalThreadLocalMap threadLocalMap = thread.threadLocalMap();
        if (threadLocalMap == null) {
            thread.setThreadLocalMap(threadLocalMap = new InternalThreadLocalMap());
        }
        return threadLocalMap;
    }

    // If the current FastThreadLocal object is called Thread, the slowThreadLocalMap object is called to get the relevant resources (slowThreadLocalMap is actually invoked by ThreadLocalMap provided by jdk).
    private static InternalThreadLocalMap slowGet() {
        ThreadLocal<InternalThreadLocalMap> slowThreadLocalMap = UnpaddedInternalThreadLocalMap.slowThreadLocalMap;
        InternalThreadLocalMap ret = slowThreadLocalMap.get();
        if (ret == null) {
            ret = new InternalThreadLocalMap();
            slowThreadLocalMap.set(ret);
        }
        return ret;
    }

    // Ensure that the entity object size of FastThreadLocal exceeds 128byte to avoid pseudo sharing
    // If resources can avoid pseudo sharing, FastThreadLocal's entity objects can be partially cached in L1 cache, and query speed can be accelerated by improving cache hit rate (query L1 cache is much faster than query main memory).
    // For further explanation, see
    public long rp1, rp2, rp3, rp4, rp5, rp6, rp7, rp8, rp9;

    private InternalThreadLocalMap() {
        super(newIndexedVariableTable());
    }

    // Initialize the resource. The length of initialization is 32, and it is UNSET.
    private static Object[] newIndexedVariableTable() {
        Object[] array = new Object[32];
        Arrays.fill(array, UNSET);
        return array;
    }
}

The above code is the main implementation of InternalThreadLocalMap. For users, you need to pay attention to the following functions:

  • getIfSet();
  • get();
  • fastGet();
  • slowGet();

There are two situations:

(1) calling FastThreadLocal in Thread;

(2) calling FastThreadLocal. in FastThreadLocalThread

Because of the above two call scenarios, instanceof will be used to judge when obtaining InternalThreadLocalMap, as shown below:

        if (thread instanceof FastThreadLocalThread) {
            // Corresponding to fastGet and other operations
        } else {
            // Corresponding operations such as slowGet
        }

If the calling thread is

  • Thread: call the slowThreadLocalMap variable in the UnpaddedInternalThreadLocalMap;
  • FastThreadLocalThread: call the threadLocalMap variable in FastThreadLocalThread.

Because the InternalThreadLocalMap constructor is a private function, the getIfSet/fastGet function gets the threadLocalMap variable of FastThreadLocalThread. If the variable is empty, the private constructor is called for assignment.

    // Cache line padding (must be public)
    // With CompressedOops enabled, an instance of this class should occupy at least 128 bytes.
    public long rp1, rp2, rp3, rp4, rp5, rp6, rp7, rp8, rp9;

    private InternalThreadLocalMap() {
        super(newIndexedVariableTable());
    }

    private static Object[] newIndexedVariableTable() {
        Object[] array = new Object[32];
        Arrays.fill(array, UNSET);
        return array;
    }

Constructor, an Object array (initialization length of 32) will be created, and the values will be initialized one by one to UNSET, providing judgment basis for subsequent assignment operations (see removeindeindexedvariable and isindeindexedvariableset function for details).

Tips:

The constructor has a section of code public long rp1, rp2, rp3, rp4, rp5, rp6, rp7, rp8, rp9;.
This code has no practical significance. It exists to ensure that the instance size of InternalThreadLocalMap exceeds 128 bytes (72 bytes of the long variable above, and there are several variables in the base class UnpaddedInternalThreadLocalMap of InternalThreadLocalMap).
Generally, the size of cpu cache line is 64k or 128k. If the size of variables exceeds 128byte, the pseudo sharing will be greatly reduced.
(the current version number of Netty is 4.1.38, and the instance size of InternalThreadLocalMap is 136byte. This is because after the version 4.0.33 of Netty, the clearflags and arrayList variables were introduced, and rp9 variables were forgotten to be removed).
About pseudo sharing, you can pay attention to JAVA scraps - CPU Cache and cache lines One article.

4.2 FastThreadLocal initialization

public class FastThreadLocal<V> {
    
    private final int index;

    // Auto increment of atomic variable, get ID as storage index of FastThreadLocal
    // public static int nextVariableIndex() {
    //     int index = nextIndex.getAndIncrement();
    //     if (index < 0) {
    //         nextIndex.decrementAndGet();
    //         throw new IllegalStateException("too many thread-local indexed variables");
    //     }
    //     return index;
    // }
    public FastThreadLocal() {
        index = InternalThreadLocalMap.nextVariableIndex();
    }
    
    // Set FastThreadLocal resource
    public final void set(V value) {
        if (value != InternalThreadLocalMap.UNSET) {
            InternalThreadLocalMap threadLocalMap = InternalThreadLocalMap.get();
            setKnownNotUnset(threadLocalMap, value);
        } else {
            // If the set resource is UNSET, destroy the resource object corresponding to the current FastThreadLocal
            remove();
        }
    }
    
    // Set the resource and add the set FastThreadLocal variable to the list of resources to be destroyed for subsequent destruction
    private void setKnownNotUnset(InternalThreadLocalMap threadLocalMap, V value) {
        if (threadLocalMap.setIndexedVariable(index, value)) {
            addToVariablesToRemove(threadLocalMap, this);
        }
    }
    
    // According to the index initialized by FastThreadLocal, determine its location in the resource list, and then query the resource quickly according to the index.
    public boolean setIndexedVariable(int index, Object value) {
        Object[] lookup = indexedVariables;
        if (index < lookup.length) {
            Object oldValue = lookup[index];
            lookup[index] = value;
            return oldValue == UNSET;
        } else {
            expandIndexedVariableTableAndSet(index, value);
            return true;
        }
    }
    
    // Expand the resource pool array length by a multiple of 2
    private void expandIndexedVariableTableAndSet(int index, Object value) {
        Object[] oldArray = indexedVariables;
        final int oldCapacity = oldArray.length;
        int newCapacity = index;
        newCapacity |= newCapacity >>>  1;
        newCapacity |= newCapacity >>>  2;
        newCapacity |= newCapacity >>>  4;
        newCapacity |= newCapacity >>>  8;
        newCapacity |= newCapacity >>> 16;
        newCapacity ++;

        Object[] newArray = Arrays.copyOf(oldArray, newCapacity);
        Arrays.fill(newArray, oldCapacity, newArray.length, UNSET);
        newArray[index] = value;
        indexedVariables = newArray;
    }
}

The above is a partial function excerpt of FastThreadLocal.
It can be seen from the constructor that FastThreadLocal will use the nextVariableIndex of InternalThreadLocalMap to obtain a unique ID during initialization.
This ID is obtained by auto increment of atomic variable. Subsequent update or deletion operations of this variable are performed through this index.
When setting variables, there is insufficient space for indexedVariables (initialization length is 32), the array will be expanded through expandindexedvariables tableandset (> > is unsigned shift right, that is, if the number is positive, the high-order value will be filled with 0, and if the number is negative, the high-order value will also be filled with 0). Through such displacement operation, each array will be multiplied by 2 (keep 2^n).
Because constant index is used, the speed of querying FastThreadLocal variable in Netty is O(1). It is also very simple to use Arrays.Copy when expanding capacity (compared with rehash operation of jdk's ThreadLocal).

4.3 obtaining and deleting fastthreadlocal variables

public class FastThreadLocal<V> {

    private static final int variablesToRemoveIndex = InternalThreadLocalMap.nextVariableIndex();
    

    // After the thread finishes executing the resource, you need to determine whether to call this function to destroy the FastThreadLocal resource in the thread according to the business scenario.
    public static void removeAll() {
        InternalThreadLocalMap threadLocalMap = InternalThreadLocalMap.getIfSet();
        if (threadLocalMap == null) {
            return;
        }

        try {
            Object v = threadLocalMap.indexedVariable(variablesToRemoveIndex);
            if (v != null && v != InternalThreadLocalMap.UNSET) {
                @SuppressWarnings("unchecked")
                Set<FastThreadLocal<?>> variablesToRemove = (Set<FastThreadLocal<?>>) v;
                FastThreadLocal<?>[] variablesToRemoveArray =
                        variablesToRemove.toArray(new FastThreadLocal[0]);
                for (FastThreadLocal<?> tlv: variablesToRemoveArray) {
                    tlv.remove(threadLocalMap);
                }
            }
        } finally {
            // In fact, it just sets the threadLocalMap in FastThreadLocalThread to null, or destroys the slowThreadLocalMap.
            InternalThreadLocalMap.remove();
        }
    }
    
    @SuppressWarnings("unchecked")
    public final V get(InternalThreadLocalMap threadLocalMap) {
        Object v = threadLocalMap.indexedVariable(index);
        if (v != InternalThreadLocalMap.UNSET) {
            return (V) v;
        }

        // If the current resource to be acquired is empty, perform the initial operation and return the corresponding resource.
        return initialize(threadLocalMap);
    }

    // Initialize the resource to be acquired according to the initialValue function overloaded by the user
    private V initialize(InternalThreadLocalMap threadLocalMap) {
        V v = null;
        try {
            v = initialValue();
        } catch (Exception e) {
            PlatformDependent.throwException(e);
        }

        threadLocalMap.setIndexedVariable(index, v);
        addToVariablesToRemove(threadLocalMap, this);
        return v;
    }
    
    // Add FastThreadLocal variable to the list of resources to be deleted
    @SuppressWarnings("unchecked")
    private static void addToVariablesToRemove(InternalThreadLocalMap threadLocalMap, FastThreadLocal<?> variable) {
        Object v = threadLocalMap.indexedVariable(variablesToRemoveIndex);
        Set<FastThreadLocal<?>> variablesToRemove;
        // If the list of resources to be deleted is empty, initialize the list of resources to be deleted (Set)
        if (v == InternalThreadLocalMap.UNSET || v == null) {
            variablesToRemove = Collections.newSetFromMap(new IdentityHashMap<FastThreadLocal<?>, Boolean>());
            threadLocalMap.setIndexedVariable(variablesToRemoveIndex, variablesToRemove);
        } else {
            variablesToRemove = (Set<FastThreadLocal<?>>) v;
        }

        variablesToRemove.add(variable);
    }
    

    @SuppressWarnings("unchecked")
    public final void remove(InternalThreadLocalMap threadLocalMap) {
        if (threadLocalMap == null) {
            return;
        }

        Object v = threadLocalMap.removeIndexedVariable(index);
        removeFromVariablesToRemove(threadLocalMap, this);
    
        // If the FastThreadLocal variable has been assigned, the user overloaded onRemoval function needs to be called to destroy the resource.
        if (v != InternalThreadLocalMap.UNSET) {
            try {
                onRemoval((V) v);
            } catch (Exception e) {
                PlatformDependent.throwException(e);
            }
        }
    }
    
    // Determines the initialization function of the resource (returns NULL if the user does not overload)
    protected V initialValue() throws Exception {
        return null;
    }

    // The user needs to overload the secondary function to destroy the requested resource
    protected void onRemoval(@SuppressWarnings("UnusedParameters") V value) throws Exception { }
}

When using FastThreadLocal, users need to inherit initialValue and onRemoval functions (initialization and destruction of FastThreadLocal objects are controlled by users).

  • initialValue: when getting FastThreadLocal object, if the object is not set, call initialValue to initialize the resource (if the judgment object in get and other functions is empty, call initialize to initialize the resource);
  • onRemoval: when the FastThreadLocal updates the object or finally destroys the resource, call onRemoval to destroy the resource (if the set function determines that the object to be set has been set, call onRemoval to destroy the resource).
    this.threadLocal = new FastThreadLocal<Recycler.Stack<T>>() {
        protected Recycler.Stack<T> initialValue() {
            return new Recycler.Stack(Recycler.this, Thread.currentThread(), Recycler.this.maxCapacityPerThread, Recycler.this.maxSharedCapacityFactor, Recycler.this.ratioMask, Recycler.this.maxDelayedQueuesPerThread);
        }

        protected void onRemoval(Recycler.Stack<T> value) {
            if (value.threadRef.get() == Thread.currentThread() && Recycler.DELAYED_RECYCLED.isSet()) {
                ((Map)Recycler.DELAYED_RECYCLED.get()).remove(value);
            }

        }
    };

The above code is an example of using Recycler to call FastThreadLocal (Recycler is the lightweight object pool of Netty).
Note that in FastThreadLocal, there is a static variable variablesToRemoveIndex, which is used to occupy a fixed position in the object pool and store a set < FastThreadLocal <? > > variablestoremove.
Each time the variable is initialized, the corresponding FastThreadLocal will be stored in variablesToRemove. When the object is updated (set and other functions) or when the variable in FastThreadLocalThread is cleaned (removeAll function), the program will clean up according to variablesToRemove.
In this way, when users use FastThreadLocalThread, they don't need to spend too many managers to pay attention to thread safety (in Netty, the life cycle of thread pool is long, and they don't need to pay too much attention to memory cleaning. However, if users use FastThreadLocalThread in scenarios such as online process pool, they need to clean up FastThreadLocal parameter after performing tasks to avoid subsequent business production Influence).

summary

Through the above source code analysis, we can know that Netty has done a lot of improvements to improve ThreadLocal performance.

  • Customize FastThreadLocalThread and FastThreadLocal;
  • Using padding to expand the instance size of FastThreadLocal to avoid false sharing;
  • Using the ID obtained from the auto increment of atomic variables as the constant index, the query speed is optimized to O(1), avoiding the hash conflict and rehash operation caused by the expansion.
  • Provide initialValue and onRemoval functions. Users can overload functions by themselves to realize highly customized operation of FastThreadLocal resources;
  • The expansion of FastThreadLocal object array (expandIndexedVariableTableAndSet) adopts bit operation to calculate the array length;
  • In view of calling FastThreadLocal in Thread and calling FastThreadLocal in FastThreadLocalThread, different ways of obtaining information are adopted to enhance compatibility.
  • More details, readers can refer to the source code for further analysis.

My personal understanding is as follows: if FastThreadLocal variables are stored in Object [] array, whether there is sacrificing space for performance:
The default startup thread of Netty is 2 * cpu core, that is, twice the number of cpu cores, and this thread group will continue to exist in the life cycle of Netty.
There is no phenomenon that too many threads are created to occupy too much memory in Netty (it will be prudent for users to manually adjust the number of Netty's boss group and worker group threads).
In addition, there is a large demand for reading and updating FastThreadLocal in Netty, and there is a demand for optimizing ThreadLocal.
Therefore, it is appropriate to waste some space in exchange for the performance improvement of query and update.

PS:
If you think my article is helpful to you, please pay attention to my wechat public account, thank you!

Posted by guoxin on Sun, 20 Oct 2019 01:48:00 -0700