A strange memcpy problem

Keywords: Programming less CentOS Linux glibc

When you see this title, you must be thinking about the bug s caused by memory overlap. I admit that you are right, but not all of them. If you are interested in this, you can continue to look down, and you will surely get something.

First of all, I will briefly introduce the background. Recently, there is a strange problem in the company's project. When the client sends multiple packets at the same time, the server has a certain probability that it can't parse them. Our protocol format is 4-byte header + message body. Every time a complete packet is parsed, we use memcpy to copy the remaining data to the buffer header. Similar code: memcpy(buffer, buffer+offset, len) . It is such a simple line of code, but there is a certain probability of data copy error. [careful readers may notice that there is no need for memcpy. Every time, memcpy is too wasteful. You can use pointer offset directly. Yes, the correct posture should be like this, but I'm glad to see such code. Otherwise, you may miss such interesting things. If the subsequent data is less than a complete package, you still need memcpy In the end, this problem will be met.

Before I go into details, I will briefly introduce memcpy. Those engaged in c/c + + must be familiar with it. Its main purpose is to copy memory. However, it should be noted that when memory overlaps, bugs may occur. In copy memory, the source address and the destination address may have three situations: 1. The source address does not overlap the destination address at all; 2. The source address overlaps the destination address and the destination address is smaller than the source address; 3. The source address overlaps the destination address and the destination address is larger than the source address. In the first case, there will be no problem with memcpy; in the third case, there will be a problem with memcpy, which needs to be replaced by memmove; but in the second case, most people (including me) and online introductions think that there is no problem with memcpy, but now I can say it is wrong responsibly, because this bug is caused by him.

The server is purchased from the cloud. The environment is CentOS Linux release 7.4.1708 (Core), gcc 4.8.5, glibc 2.17. To facilitate testing, I will separate the code with bug s. The same logic, the same data, and the normal result should be that the two packages can be correctly parsed, but the body not enough error will appear after running several times. Students who are interested can test by themselves, Different environments may result in different results. The code is as follows:

#include <stdio.h>
#include <string.h>

int main()
{
        int count = 1;
        char ptr[] = {0,0,0,20,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,0,0,0,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60};
        int len = sizeof(ptr);
        int chunkSize = 0;
        char *ptr2 = malloc(len);

        memcpy(ptr2, ptr, len); //This code is mainly for dynamic connection of memcpy, which can be directly implemented in function when gdb is used to debug memcpy later

        printf("total %d\n", len);

        do{
                if(len < 4){
                        printf("head not enough\n");
                        break;
                }

                chunkSize = (ptr[0] << 24) & 0x7fffffff;
                chunkSize|= ((ptr[1] << 16) & 0xff0000);
                chunkSize|= ((ptr[2] << 8) & 0xff00);
                chunkSize|= (ptr[3] & 0xff);
                if(len - 4 < chunkSize){
                        printf("body not enough, chunkSize=%d\n", chunkSize);
                        break;
                }

                len-= (4 + chunkSize);
                if(len > 0){
                        memcpy(ptr, ptr + 4 + chunkSize, len);
                }

                printf("parse chunk:%d, len=%d\n", chunkSize, len);
        }while(len > 0);

        printf("done\n");
}

The code in question above is memcpy(ptr, ptr + 4 + chunkSize, len). In order to find out the reason, I found the source code of glibc2.17 first, but I didn't find the answer from the source code. Maybe the source code didn't find the right place. Therefore, the problem can only be located by assembly. The assembly of memcpy called in the above paragraph is as follows:

0x0000000000401185 <main+987>:       mov    -0x4(%rbp),%eax
0x0000000000401188 <main+990>:       movslq %eax,%rdx
0x000000000040118b <main+993>:       mov    -0xc(%rbp),%eax
0x000000000040118e <main+996>:       cltq   
0x0000000000401190 <main+998>:       lea    0x4(%rax),%rcx
0x0000000000401194 <main+1002>:      mov    -0x18(%rbp),%rax
0x0000000000401198 <main+1006>:      add    %rax,%rcx
0x000000000040119b <main+1009>:      mov    -0x18(%rbp),%rax
0x000000000040119f <main+1013>:      mov    %rcx,%rsi
0x00000000004011a2 <main+1016>:      mov    %rax,%rdi
0x00000000004011a5 <main+1019>:      callq  0x400520 <memcpy@plt>

Among them, ptr, ptr+4+chunkSize, len are respectively placed in RDI, RSI and RDX registers. Let's check the contents first

(gdb) x/8xb $rdi
0x7fffffffe3b0: 0x00    0x00    0x00    0x14    0x0b    0x0c    0x0d    0x0e
(gdb) x/8xb $rsi
0x7fffffffe3c8: 0x00    0x00    0x00    0x1e    0x1f    0x20    0x21    0x22
(gdb) p $rdx
$1 = 34

Through single step assembly, the complete execution flow is obtained

=> 0x0000000000400520 <memcpy@plt+0>:               jmpq   *0x201b12(%rip)        # 0x602038 <memcpy@got.plt>
=> 0x00007ffff7b619f0 <__memcpy_ssse3_back+0>:      mov    %rdi,%rax
=> 0x00007ffff7b619f3 <__memcpy_ssse3_back+3>:      cmp    $0x90,%rdx
=> 0x00007ffff7b619fa <__memcpy_ssse3_back+10>:     jae    0x7ffff7b61a30 <__memcpy_ssse3_back+64>
=> 0x00007ffff7b619fc <__memcpy_ssse3_back+12>:     cmp    %dil,%sil
=> 0x00007ffff7b619ff <__memcpy_ssse3_back+15>:     jbe    0x7ffff7b61a1a <__memcpy_ssse3_back+42>
=> 0x00007ffff7b61a01 <__memcpy_ssse3_back+17>:     add    %rdx,%rsi
=> 0x00007ffff7b61a04 <__memcpy_ssse3_back+20>:     add    %rdx,%rdi
=> 0x00007ffff7b61a07 <__memcpy_ssse3_back+23>:     lea    0x39202(%rip),%r11        # 0x7ffff7b9ac10
=> 0x00007ffff7b61a0e <__memcpy_ssse3_back+30>:     movslq (%r11,%rdx,4),%rdx
=> 0x00007ffff7b61a12 <__memcpy_ssse3_back+34>:     lea    (%r11,%rdx,1),%rdx
=> 0x00007ffff7b61a16 <__memcpy_ssse3_back+38>:     jmpq   *%rdx
=> 0x00007ffff7b63d12 <__memcpy_ssse3_back+8994>:   lddqu  -0x22(%rsi),%xmm0
=> 0x00007ffff7b63d17 <__memcpy_ssse3_back+8999>:   movdqu %xmm0,-0x22(%rdi)
=> 0x00007ffff7b63d1c <__memcpy_ssse3_back+9004>:   lddqu  -0x12(%rsi),%xmm0
=> 0x00007ffff7b63d21 <__memcpy_ssse3_back+9009>:   lddqu  -0x10(%rsi),%xmm1
=> 0x00007ffff7b63d26 <__memcpy_ssse3_back+9014>:   movdqu %xmm0,-0x12(%rdi)
=> 0x00007ffff7b63d2b <__memcpy_ssse3_back+9019>:   movdqu %xmm1,-0x10(%rdi)
=> 0x00007ffff7b63d30 <__memcpy_ssse3_back+9024>:   retq

The key part of the above code is converted into C, which is about the case

dst+= 34
src+= 34
memcpy(dst-34, src-34, 16)
memcpy(dst-18, src-18, 16)
memcpy(dst-16, src-16, 16)

The above code makes it clear that this method is to copy from the start of src. After execution, the content of dst is as follows

(gdb) x/8xb 0x7fffffffe3b0
0x7fffffffe3b0: 0x00    0x00    0x00    0x1e    0x1f    0x20    0x21    0x22

It can be seen that after the execution of memcpy, the copy from src to dst is completed without exception. Here, everything goes well, but what happens next is the most important part of this article. Keep your eyes open. By adding a row of char tmp[] in front of char ptr[], the content is exactly the same as that of PTR.

char tmp[] = {0,0,0,20,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,0,0,0,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60};

Once again, through gdb execution, from viewing memory, assembly tracking to viewing results, the following results are obtained

(gdb) x/8xb $rdi
0x7fffffffe3f0: 0x00    0x00    0x00    0x14    0x0b    0x0c    0x0d    0x0e
(gdb) x/8xb $rsi
0x7fffffffe408: 0x00    0x00    0x00    0x1e    0x1f    0x20    0x21    0x22
(gdb) p $rdx
$1 = 34


=> 0x0000000000400520 <memcpy@plt+0>:               jmpq   *0x201b12(%rip)        # 0x602038 <memcpy@got.plt>
=> 0x00007ffff7b619f0 <__memcpy_ssse3_back+0>:      mov    %rdi,%rax
=> 0x00007ffff7b619f3 <__memcpy_ssse3_back+3>:      cmp    $0x90,%rdx
=> 0x00007ffff7b619fa <__memcpy_ssse3_back+10>:     jae    0x7ffff7b61a30 <__memcpy_ssse3_back+64>
=> 0x00007ffff7b619fc <__memcpy_ssse3_back+12>:     cmp    %dil,%sil
=> 0x00007ffff7b619ff <__memcpy_ssse3_back+15>:     jbe    0x7ffff7b61a1a <__memcpy_ssse3_back+42>
=> 0x00007ffff7b61a1a <__memcpy_ssse3_back+42>:     lea    0x38faf(%rip),%r11        # 0x7ffff7b9a9d0
=> 0x00007ffff7b61a21 <__memcpy_ssse3_back+49>:     movslq (%r11,%rdx,4),%rdx
=> 0x00007ffff7b61a25 <__memcpy_ssse3_back+53>:     lea    (%r11,%rdx,1),%rdx
=> 0x00007ffff7b61a29 <__memcpy_ssse3_back+57>:     jmpq   *%rdx
=> 0x00007ffff7b6440c <__memcpy_ssse3_back+10780>:  lddqu  0x12(%rsi),%xmm0
=> 0x00007ffff7b64411 <__memcpy_ssse3_back+10785>:  movdqu %xmm0,0x12(%rdi)
=> 0x00007ffff7b64416 <__memcpy_ssse3_back+10790>:  lddqu  0x2(%rsi),%xmm0
=> 0x00007ffff7b6441b <__memcpy_ssse3_back+10795>:  lddqu  (%rsi),%xmm1
=> 0x00007ffff7b6441f <__memcpy_ssse3_back+10799>:  movdqu %xmm0,0x2(%rdi)
=> 0x00007ffff7b64424 <__memcpy_ssse3_back+10804>:  movdqu %xmm1,(%rdi)
=> 0x00007ffff7b64428 <__memcpy_ssse3_back+10808>:  retq 


(gdb) x/8xb 0x7fffffffe3f0
0x7fffffffe3f0: 0x33    0x34    0x35    0x36    0x37    0x38    0x39    0x3a

It can be seen that after memcpy execution, memory copy fails, and the content of dst is completely different from that of src. So what caused it? Through the above assembly code, we convert it into C code, as follows:

memcpy(dst+18, src+18, 16)
memcpy(dst+2, src+2, 16)
memcpy(dst, src, 16)

The problem lies in this code. From the above code, we can see that its implementation starts from the tail of src, so how does the tail start to copy lead to data confusion? Let's take a look at the process

//Initial state
//Note: the starting address of src is immediately after dst
dst = 00,00,00,20,11,12,13,14,15,16,17,18,19,20,21,22,23,24,  25,26,27,28,29,30
src = 00,00,00,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,  45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60

//Execute memcpy(dst+18, src+18, 16)
//The position of src+18 (the part after the space) is copied to the position of src+18, a total of 16 bytes. Since dst+24 is equal to src, the data of src+24 is actually copied to src. Note that src has been modified
dst = 00,00,00,20,11,12,13,14,15,16,17,18,19,20,21,22,23,24,  45,46,47,48,49,50
src = 51,52,53,54,55,56,57,58,59,60,37,38,39,40,41,42,43,44,  45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60

//Execute memcpy(dst+2, src+2, 16)
//There is no overlap between the two sections of memory. Normal copy,
dst = 00,00,  53,54,55,56,57,58,59,60,37,38,39,40,41,42,43,44,  45,46,47,48,49,50
src = 51,52,  53,54,55,56,57,58,59,60,37,38,39,40,41,42,43,44,  45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60


//Execute memcpy(dst, src, 16)
//Two sections of memory do not overlap, normal copy
dst = 51,52,53,54,55,56,57,58,59,60,37,38,39,40,41,42,  43,44,45,46,47,48,49,50
src = 51,52,53,54,55,56,57,58,59,60,37,38,39,40,41,42,  43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60

After this analysis, we know that the root of the problem lies in the copy mode of memcpy. If there is no problem from the front to the back, but if there is a problem from the back to the front, then why are the execution results totally different for the same platform, the same code and the same data? Carefully, you will find that the code executed by memcpy is completely different after line 6. The key problem is lines 5 and 6:

=> 0x00007ffff7b619fc <__memcpy_ssse3_back+12>:     cmp    %dil,%sil
=> 0x00007ffff7b619ff <__memcpy_ssse3_back+15>:     jbe    0x7ffff7b61a1a <__memcpy_ssse3_back+42>

The above two lines of code are the lowest byte of the src address of the comparison dst. It can be understood that if the lowest byte of the src is less than or equal to the lowest byte of the dst, the method from the back to the front is used, otherwise the method from the front to the back is used. Let's go back and compare the addresses of dst and src at the time of execution

//Normal address
dst = 0x7fffffffe3b0
dil = 0xb0

src = 0x7fffffffe3c8
sil = 0xc8

//Abnormal address
dst = 0x7fffffffe3f0
dil = 0xf0
src = 0x7fffffffe408
sil = 0x08

Sure enough, because the memory address is different, the result is different. So far, it's all true, because the implementation of memcpy determines the copy mode by comparing the lowest byte of memory address. Can't help but look up to the sky and roar. This operation is really coquettish. At present, I can't understand why. Maybe it's performance. In fact, the official manual makes it clear that when memcpy has memory overlaps, memmove should be used instead, because some implementations may copy from the back to the front, and memmove should be used instead. In fact, at the beginning of the problem, everyone should guess that memmove can solve the problem, but if we don't trace the root of the problem, how can we find such interesting things?

Finally, I'd like to remind you that if you don't know the specific implementation, you'd better use memmove when memory overlaps (no matter in case 2 or 3).

Posted by cordex on Sat, 14 Dec 2019 11:41:16 -0800