One line of code to solve the problem of slow scp transmission over the Internet

Keywords: openssh TCP/IP

Encountered a late case, using scp to transfer files over a long link is unbearably slow! For a 100-200 ms round-trip delay link, wget can download files with a throughput of 40MBps, while scp has only 9MBps.

I started to think this was due to bandwidth loss for encryption, but HTTPS testing was okay.

To avoid pacing smoothing edge events, set CC to CUIBIC and grab the following trace waveforms:

This is either a rwnd limited scenario or an app limited scenario, definitely not a cwnd limited scenario, and I have not simulated any packet loss and speed limits.

strace confirms whether there is setsockopt to set the receive and receive buffer:

$ sudo strace -f -F -e trace=setsockopt -p 1181607
...
[pid 1181977] setsockopt(5, SOL_SOCKET, SO_RCVBUFFORCE, [8388608], 4) = 0
[pid 1181977] setsockopt(5, SOL_SOCKET, SO_SNDBUFFORCE, [8388608], 4) = 0
...

Use the code bypass below to exclude setsockopt effects:

#define _GNU_SOURCE
#include <dlfcn.h>
#include <stdio.h>
#include <sys/socket.h>

typedef int (*orig_setsockopt_f_type)(int sockfd, int level, int optname,
                      const void *optval, socklen_t optlen);

int setsockopt(int sockfd, int level, int optname,
                      const void *optval, socklen_t optlen)
{
	int size;
	orig_setsockopt_f_type orig_setsockopt;
	orig_setsockopt = (orig_setsockopt_f_type)dlsym(RTLD_NEXT,"setsockopt");
	if (optname == SO_SNDBUFFORCE || optname == SO_RCVBUFFORCE ||
	    optname == SO_SNDBUF || optname == SO_RCVBUF) {
		//size = *(int *)optval;
		//size *= 10;
		//*(int *)optval = size;
		return 0;
	}
	return orig_setsockopt(sockfd, level, optname, optval, optlen);
}

// gcc -shared -fPIC -o bypass.so bypass.c  -ldl
// LD_PRELOAD=/root/bypass.so /usr/sbin/sshd -D
// LD_PRELOAD=/root/bypass.so scp root@192.168.56.101:/var/www/html/big /dev/null

The result is still the same.

The purpose of this article is not to describe how to optimize this case, but to put forward an idea about how to optimize network transmission. Without understanding protocols, it is easy to get into the abyss of detail by confining itself to the host protocol stack and programming.

I'm not proficient in SSH protocol, and it took me almost an evening to get to the core of the problem:

  • SSH allows multiple channels to be multiplexed over a single TCP connection and requires flow control over each channel to ensure fairness, so each channel must do its own flow control instead of using TCP. OpenSSH implementation is problematic.

For historical reasons, not only SSH, but also many protocols do end-to-end flow control without considering the BDP of the network itself, including TCP protocol. Assuming that the bandwidth is unlimited and there is no loss of packets, the processing rate of the receiving side is fixed. Before the next batch of data arrives, the receiving side needs to wait longer for the far sending side than the near one. If you want the receiving side to have data during this period of timeProcessing, the remote sender must send more data.

This explains the relationship between the pacing send and burst send and receive windows:

  • pacing send: pacing rate and the receiving end need to be processed at the same speed.
  • burst send: the average rate between bursts needs to be the same as the processing speed at the receiving end.

You need to confirm how OpenSSH maintains the channel receive window and whether BDP is taken into account.

Simple guess is no, because it's difficult to accurately collect this information at the TCP layer, let alone in applications.

Download OpenSSH code:

git clone git://anongit.mindrot.org/openssh.git

Finding the function channel_check_window encapsulating the WINDOW_ADJUST message without considering BDP is equivalent to notifying the window exactly according to the channel unit time processing power of the receiver.

Opening debug, you can see that the notification window on the receiver changes with processing power, no matter how much latency is set:

debug2: channel 0: window 1982464 sent adjust 114688  

It's like my_read function in UNIX Network Programming.

Instead, add a small margin to the notification window to smooth the wait time in network transmission:

// Modify the function on the data receiver side.
static int
channel_check_window(struct ssh *ssh, Channel *c)
{
        int r;

        if (c->type == SSH_CHANNEL_OPEN &&
            !(c->flags & (CHAN_CLOSE_SENT|CHAN_CLOSE_RCVD)) &&
            ((c->local_window_max - c->local_window >
            c->local_maxpacket*3) ||
            c->local_window < c->local_window_max/2) &&
            c->local_consumed > 0) {
                if (!c->have_remote_id)
                        fatal_f("channel %d: no remote id", c->self);
                if ((r = sshpkt_start(ssh,
                    SSH2_MSG_CHANNEL_WINDOW_ADJUST)) != 0 ||
                    (r = sshpkt_put_u32(ssh, c->remote_id)) != 0 ||
                    //(r = sshpkt_put_u32(ssh, c->local_consumed)) != 0 ||
                    // Add 2000 to try!
                    (r = sshpkt_put_u32(ssh, c->local_consumed + 2000)) != 0 ||
                    (r = sshpkt_send(ssh)) != 0) {
                        fatal_fr(r, "channel %i", c->self);
                }
                debug2("channel %d: window %d sent adjust %d", c->self,
                    c->local_window, c->local_consumed);
                c->local_window += c->local_consumed;
                c->local_consumed = 0;
        }
        return 1;
}

Change this line of code, the effect bars, see the effect, the left side is the receiver rate, the right side is the sender CPU utilization. First look at the previous frustration:

After changing that line:

The CPU is full of encryption and decryption, and outsourcing is fast.

I tried to increase the balance in an attempt to reach the limit faster and got an error during transmission:

client_loop: send disconnect: Broken pipe
lost connection

This is expected, as the margin will accumulate until overflow. The correct practice is to add the margin after subtracting the used portion from each announcement.

Changing one line of code only ensures that the transfer is complete before the accumulated surplus overflows. It is not difficult to fix this problem by changing a few more lines of code:

static int
channel_check_window(struct ssh *ssh, Channel *c)
{
        int r;
+       int extra = 0;
        
        if (c->type == SSH_CHANNEL_OPEN &&
            !(c->flags & (CHAN_CLOSE_SENT|CHAN_CLOSE_RCVD)) &&
            ((c->local_window_max - c->local_window >
            c->local_maxpacket*3) ||
            c->local_window < c->local_window_max/2) &&
            c->local_consumed > 0) {
+           	// Cannot exceed the size of 8388608 set by SO_RCVBUF
+               if (c->local_window_max < 8000000) { 
+                       extra = 200000; 
+                       c->local_window_max += extra;
+               }
                if (!c->have_remote_id)
                        fatal_f("channel %d: no remote id", c->self);
                if ((r = sshpkt_start(ssh,
                    SSH2_MSG_CHANNEL_WINDOW_ADJUST)) != 0 ||
                    (r = sshpkt_put_u32(ssh, c->remote_id)) != 0 ||
-                   (r = sshpkt_put_u32(ssh, c->local_consumed)) != 0 ||
+                   (r = sshpkt_put_u32(ssh, c->local_consumed + extra)) != 0 ||
                    (r = sshpkt_send(ssh)) != 0) {
                        fatal_fr(r, "channel %i", c->self);
                }
                debug2("channel %d: window %d sent adjust %d", c->self,
                    c->local_window, c->local_consumed);
                c->local_window += c->local_consumed;
+               // local_window plus the remainder.
+               c->local_window += extra;
                c->local_consumed = 0;
        }
        return 1;
}

OK, problem solving.

Although more than one line has been changed eventually, there are not many, maybe less than 10 lines. But it is only a POC after all. The problem is how the margin adapts to different environments. It is not difficult, as long as the current kernel TCP buffer margin can be calculated, and the TCP buffer can be obtained through getsockopt.

Ultimately, it's protocol, not code, craftsmanship, not tools.

If you don't understand SSH protocol Multichannel Multiplexed TCP, you don't know channel control, which is the root cause of scp program app limited. Without understanding this, it's hard to find which line to change. Code tools can play better, languages can be mixed, and protocols can't be understood.

In order to normalize and engineer things, I feel powerless and unaware of the potential dangers of the above modifications. I admit that I can't make the most of the big things.

Fortunately, by consulting various resources, in 2004 there were people who were aware of the problem and did great things, which is HPN-SSH:
https://www.psc.edu/hpn-ssh-home/

I also found an explanation for this by the author of HPN-SSH:
https://stackoverflow.com/questions/8849240/why-when-i-transfer-a-file-through-sftp-it-takes-longer-than-ftp

Here is an overview of SSH performance issues:
http://www.allanjude.com/bsd/AsiaBSDCon2017_-_SSH_Performance.pdf

Looking forward to Chris Rapier's HPN-SSH joining the OpenSSH main line as soon as possible, look forward to it perfectly!

Okay, that's the story I'm going to tell this week.

Zhejiang Wenzhou leather shoes are wet, so you won't get fat when it rains.

Posted by pixy on Fri, 24 Sep 2021 14:52:44 -0700