What if Lambda is slow to use for the first time? Look at the magic weapon

Keywords: Java Back-end

Don't say much about the description. Go directly to the figure above:

See the output? Why is the time difference between the first time and the second time so much? Let's think about it together. We can also look at the conclusion first and then look back at the analysis

Note: not only the second time, but every time after the first time

Give conjecture

  1. Is it related to operating system warm-up?
  2. Is it related to JIT (just in time compilation)?
  3. Is it related to ClassLoader class loading?
  4. Whether it is related to Lambda is not a foreach issue

Verify conjecture

Operating system warm-up

The concept of operating system preheating is the conclusion I got from consulting a big man. I didn't find the corresponding words in Baidu / Google, but in the simulation test, I tested it by ordinary traversal:

Basically, the speed of the first few times is slower and the speed of the later times is faster. Therefore, this factor may affect it, but the gap is not large. Therefore, this conclusion can not be used as the answer to the question.

JIT instant compilation

First, let's introduce JIT instant compilation:

After the initialization of the JVM is completed, during the calling and execution of the class, the execution engine will convert the bytecode into machine code, and then it can be executed in the operating system. In the process of converting bytecode into machine code, there is also a compilation in the virtual machine, that is, instant compilation.

Initially, the bytecode in the JVM is compiled by the Interpreter. When the virtual machine finds that a method or code block runs particularly frequently, it will recognize these codes as hot code.

In order to improve the execution efficiency of hot codes, the Just In Time compiler (JIT) will compile these codes into machine codes related to the local platform, optimize them at all levels, and then save them in memory

Another concept, back edge counter

The edge back counter is used to count the execution times of the loop body code in a method. The instruction that jumps after the control flow direction is encountered in the bytecode is called "Back Edge"

The main purpose of establishing the back edge counter is to trigger OSR (on stack replacement) compilation, that is, compilation on the stack. In some code segments with long cycle cycles, when the cycle reaches the back edge counter threshold, the JVM will consider this segment as hot code, and the JIT compiler will compile this segment of code into machine language and cache it. During this cycle period, It will directly replace the execution code and execute the cached machine language

From the above concepts, we can draw a conclusion: the first so-called operating system preheating probability is incorrect, because the ordinary traversal method is executed N times, and the subsequent execution time is relatively small, which is likely caused by JIT.

Is JIT real-time compilation the final answer? We tried to turn off the JIT to test it. Through querying the data, we found the following contents:


  • Use the  - D   option on the JVM command line to set the   java.compiler   Property to none or the empty string. Type the following command at a shell or command prompt: Java - DJava. Compiler = none < class > copy code

Note: this paragraph is from the official data of IBM. See < harvest >, let's not stop thinking

By configuring the IDEA JVM parameters:

The code test results in the execution problem are as follows:

# Before disabling
foreach time one: 38
 Split line...
foreach time two: 1

# After disabling
foreach time one: 40
 Split line...
foreach time two: 5

I have tested it many times and the results are very similar, so I can get another conclusion: JIT is not the cause of the problem (but it can improve execution efficiency)

Is it related to class loading?

During class loading verification, I still can't give up JIT. Therefore, after consulting a lot of materials, I know that a command can view the time-consuming situation of JIT compilation. The command is as follows:

java -XX:+CITime com.code.jvm.preheat.Demo >> log.txt

Explain the meaning of the order

# Directory of execution

# Command meaning: prints time spend in JIT compiler. (introduced in 1.4.0.)
# Print the time consumed by JIT compilation

# Represents the class I specified

# Output to log.txt file for easy viewing
>> log.txt

Show the whole content of a result:

foreach time one: 35
 Split line...
foreach time two: 1

Accumulated compiler times (for compiled methods only)
  Total compilation time   :  0.044 s
    Standard compilation   :  0.041 s, Average : 0.000
    On stack replacement   :  0.003 s, Average : 0.002
    Detailed C1 Timings
       Setup time:         0.000 s ( 0.0%)
       Build IR:           0.010 s (38.8%)
         Optimize:            0.001 s ( 2.3%)
         RCE:                 0.000 s ( 0.7%)
       Emit LIR:           0.010 s (40.7%)
         LIR Gen:           0.002 s ( 9.3%)
         Linear Scan:       0.008 s (31.0%)
       LIR Schedule:       0.000 s ( 0.0%)
       Code Emission:      0.003 s (12.4%)
       Code Installation:  0.002 s ( 8.2%)
       Instruction Nodes:   9170 nodes

  Total compiled methods   :    162 methods
    Standard compilation   :    160 methods
    On stack replacement   :      2 methods
  Total compiled bytecodes :  13885 bytes
    Standard compilation   :  13539 bytes
    On stack replacement   :    346 bytes
  Average compilation speed: 312157 bytes/s

  nmethod code size        : 168352 bytes
  nmethod total size       : 276856 bytes

The test results are as follows:

Through the above test results, it further illustrates a problem: as long as Lambda participates in the program, the compilation time is always longer

Again, by querying the data, we learned about the new command

java -verbose:class -verbose:jni -verbose:gc -XX:+PrintCompilation com.code.jvm.preheat.Demo

Explain the meaning of the order

# Output information about the jvm load class

# Output information about native method calls

# Output relevant information of each GC

# Print relevant information when a method is compiled

Execute the commands for the with and without Lambda respectively, and the results are as follows:

In terms of log file size, there is a difference of more than ten kb

Note: the file is too large and only part of the content is displayed

# Including Lambda
[Loaded java.lang.invoke.LambdaMetafactory from D:\JDK\jre1.8\lib\rt.jar]

# A lot of content is omitted in the middle, and lambda Metafactory is the most obvious difference (only from the name)

[Loaded java.lang.invoke.InnerClassLambdaMetafactory$1 from D:\JDK\jre1.8\lib\rt.jar]
   5143  220       4       java.lang.String::equals (81 bytes)
[Loaded java.lang.invoke.LambdaForm$MH/471910020 from java.lang.invoke.LambdaForm]
   5143  219       3       jdk.internal.org.objectweb.asm.ByteVector::<init> (13 bytes)
[Loaded java.lang.invoke.LambdaForm$MH/531885035 from java.lang.invoke.LambdaForm]
   5143  222       3       jdk.internal.org.objectweb.asm.ByteVector::putInt (74 bytes)
   5143  224       3       com.code.jvm.preheat.Demo$$Lambda$1/834600351::accept (8 bytes)
   5143  225       3       com.code.jvm.preheat.Demo::lambda$getTime$0 (6 bytes)
   5144  226       4       com.code.jvm.preheat.Demo$$Lambda$1/834600351::accept (8 bytes)
   5144  223       1       java.lang.Integer::intValue (5 bytes)
   5144  221       3       jdk.internal.org.objectweb.asm.ByteVector::putByteArray (49 bytes)
   5144  224       3       com.code.jvm.preheat.Demo$$Lambda$1/834600351::accept (8 bytes)   made not entrant
   5145  227 %     4       java.util.ArrayList::forEach @ 27 (75 bytes)
   5146    3       3       java.lang.String::equals (81 bytes)   made not entrant
foreach time one: 50
 Split line...
   5147  227 %     4       java.util.ArrayList::forEach @ -2 (75 bytes)   made not entrant
foreach time two: 1
[Loaded java.lang.Shutdown from D:\JDK\jre1.8\lib\rt.jar]
[Loaded java.lang.Shutdown$Lock from D:\JDK\jre1.8\lib\rt.jar]

# Lambda not included
   5095   45       1       java.util.ArrayList::access$100 (5 bytes)
   5095   46       1       java.lang.Integer::intValue (5 bytes)
   5096   47       3       java.util.ArrayList$Itr::hasNext (20 bytes)
   5096   49       3       java.util.ArrayList$Itr::checkForComodification (23 bytes)
   5096   48       3       java.util.ArrayList$Itr::next (66 bytes)
   5096   50       4       java.util.ArrayList$Itr::hasNext (20 bytes)
   5096   51       4       java.util.ArrayList$Itr::checkForComodification (23 bytes)
   5096   52       4       java.util.ArrayList$Itr::next (66 bytes)
   5097   47       3       java.util.ArrayList$Itr::hasNext (20 bytes)   made not entrant
   5097   49       3       java.util.ArrayList$Itr::checkForComodification (23 bytes)   made not entrant
   5097   48       3       java.util.ArrayList$Itr::next (66 bytes)   made not entrant
   5099   53 %     4       com.code.jvm.preheat.Demo::getTimeFor @ 11 (47 bytes)
   5101   50       4       java.util.ArrayList$Itr::hasNext (20 bytes)   made not entrant
foreach time one: 7
 Split line...
   5102   54       3       java.util.ArrayList$Itr::hasNext (20 bytes)
   5102   55       4       java.util.ArrayList$Itr::hasNext (20 bytes)
   5103   53 %     4       com.code.jvm.preheat.Demo::getTimeFor @ -2 (47 bytes)   made not entrant
foreach time two: 1
   5103   54       3       java.util.ArrayList$Itr::hasNext (20 bytes)   made not entrant
[Loaded java.lang.Shutdown from D:\JDK\jre1.8\lib\rt.jar]
[Loaded java.lang.Shutdown$Lock from D:\JDK\jre1.8\lib\rt.jar]

We can find two conclusions in combination with JIT compilation time and JVM loading class logs:

  1. If Lambda is used, the JVM will load LambdaMetafactory class additionally, which takes a long time
  2. When the Lambda method is called the second time, the JVM no longer needs to load the LambdaMetafactory class, so it performs faster

It perfectly confirms the question raised before: why is foreach slow for the first time and fast in the future, but is this the truth? Let's keep looking down

Eliminate interference from foreach

Let's first look at the implementation of foreach method in ArrayList:

public void forEach(Consumer<? super E> action) {
    final int expectedModCount = modCount;
    final E[] elementData = (E[]) this.elementData;
    final int size = this.size;
    for (int i=0; modCount == expectedModCount && i < size; i++) {
    if (modCount != expectedModCount) {
        throw new ConcurrentModificationException();

At first glance, it seems nothing special. Let's try to define the Consumer in advance. The code is as follows:

It can be found that the speed is very fast. Check the JIT compilation time and class loading. It is found that the time is short and there is no LambdaMetafactory loading

According to the conclusion just obtained, let's try to define Consumer in the form of Lambda

Consumer consumer = o -> {
    int curr = (int) o;

# Time consuming execution results
foreach time: 3

Let's take a look at the compilation time and class loading. We are surprised to find that JIT compilation takes a long time and LambdaMetafactory is loaded

Re explore the implementation principle of Lambda

For the details of the implementation principle of Lambda expression, I will publish a new article later. Let's talk about the conclusion today:

  • The anonymous inner class will have one more class at the compilation stage, but Lambda will not. It will only generate one more function
  • This function will generate a class through LambdaMetafactory factory in the running phase for subsequent calls

Why is Lamdba implemented like this?

Anonymous inner classes have some defects:

  1. The compiler generates a new class file for each anonymous internal class. It is not advisable to generate many class files, because each class file needs to be loaded and verified before use, which will affect the startup performance of the application. Loading may be an expensive operation, including disk I/O and decompressing the JAR file itself.
  2. If lambdas are converted to anonymous inner classes, each lambda has a new class file. Since each anonymous inner class will be loaded, it will occupy the meta space of the JVM. If the JVM compiles the code in each such anonymous inner class into machine code, it will be stored in the code cache.
  3. In addition, these anonymous inner classes will be instantiated as separate objects. Therefore, anonymous inner classes will increase the memory consumption of the application.
  4. Most importantly, choosing to implement lambdas using anonymous inner classes from the beginning will limit the scope of future lambda implementation changes and their ability to evolve based on future JVM improvements.


After understanding the implementation principle of anonymous internal classes and Lambda expressions, I am even more confused about the reason why Lambda takes a long time. After all, the time difference between generating a new Class from an anonymous internal Class and generating a new method from Lambda is not too much. Then there is also Class generation during operation, and the time consumption should not be too different. What is the problem?

Finally, we found the answer by querying the data:

You are obviously encountering the first-time initialization overhead of lambda expressions. As already mentioned in the comments, the classes for lambda expressions are generated at runtime rather than being loaded from your class path.

However, being generated isn't the cause for the slowdown. After all, generating a class having a simple structure can be even faster than loading the same bytes from an external source. And the inner class has to be loaded too. But when the application hasn't used lambda expressions before, even the framework for generating the lambda classes has to be loaded (Oracle's current implementation uses ASM under the hood). This is the actual cause of the slowdown, loading and initialization of a dozen internally used classes, not the lambda expression itself.

Roughly translated as follows:

Obviously, you encounter the first initialization overhead of a lambda expression. As already mentioned in the comments, the classes of lambda expressions are generated at run time, not loaded from the classpath.

However, generating classes is not the reason for the slowdown. After all, generating a simple analogy is faster than loading the same bytes from an external source. Inner classes must also be loaded. However, when an application has not previously used lambda expressions, it even has to load the framework for generating lambda classes (Oracle's current implementation uses ASM behind the scenes). This is the real reason why more than a dozen internally used classes (rather than lambda expressions themselves) slow down, load, and initialize.

Truth: when an application uses Lambda for the first time, it must load the framework for generating Lambda classes, so it needs more compilation and loading time

Looking back at the class loading log, I found the introduction of ASM framework:

[Loaded jdk.internal.org.objectweb.asm.ClassVisitor from F:\Java_JDK\JDK1.8\jre\lib\rt.jar]
[Loaded jdk.internal.org.objectweb.asm.ClassWriter from F:\Java_JDK\JDK1.8\jre\lib\rt.jar]
[Loaded jdk.internal.org.objectweb.asm.ByteVector from F:\Java_JDK\JDK1.8\jre\lib\rt.jar]
[Loaded jdk.internal.org.objectweb.asm.Item from F:\Java_JDK\JDK1.8\jre\lib\rt.jar]
[Loaded jdk.internal.org.objectweb.asm.MethodVisitor from F:\Java_JDK\JDK1.8\jre\lib\rt.jar]
[Loaded jdk.internal.org.objectweb.asm.MethodWriter from F:\Java_JDK\JDK1.8\jre\lib\rt.jar]
[Loaded jdk.internal.org.objectweb.asm.Type from F:\Java_JDK\JDK1.8\jre\lib\rt.jar]
[Loaded jdk.internal.org.objectweb.asm.Label from F:\Java_JDK\JDK1.8\jre\lib\rt.jar]
[Loaded jdk.internal.org.objectweb.asm.Frame from F:\Java_JDK\JDK1.8\jre\lib\rt.jar]
[Loaded jdk.internal.org.objectweb.asm.AnnotationVisitor from F:\Java_JDK\JDK1.8\jre\lib\rt.jar]
[Loaded jdk.internal.org.objectweb.asm.AnnotationWriter from F:\Java_JDK\JDK1.8\jre\lib\rt.jar]


  • The culprit causing abnormal data in foreach test is Lambda expression
  • When Lambda expressions are used for the first time in an application, the ASM framework needs to be loaded additionally, so it needs more compilation and loading time
  • The underlying implementation of Lambda expression is not the syntax sugar of anonymous inner class, but its optimized version
  • In fact, the underlying implementation of foreach is not fundamentally different from the enhanced for loop. One is an external iterator and the other is an internal iterator
  • Through foreach + Lambda, the efficiency is not low, but it needs to be preheated in advance (loading the frame)                     Hope to help you!

Posted by phprocky on Mon, 29 Nov 2021 04:34:09 -0800