A thorough explanation of the 1024 limitation of Linux select (is the selection really limited by 1024?No)

Keywords: Linux Windows socket less

Many years ago, I was interviewed. Why does the select call support up to 1024 file descriptors?

I didn't answer, I didn't even know what select was doing.

Over the years, I interviewed people with this question...

At that time, I already had an expected answer in my mind that would satisfy me, and my expectation was probably:

  • Macros in the Linux kernel limit fd_set to 1024 at most...

To avoid talking is cheap, I can also show you the code:

// include/uapi/linux/posix_types.h
#define __FD_SETSIZE    1024

typedef struct {
    unsigned long fds_bits[__FD_SETSIZE / (8 * sizeof(long))];
} __kernel_fd_set;

Well, yes, during that time, like many people, I read a few Linux kernel codes and understood them, then I began to feel that I understood everything.

To get it right, if you want to break through the fd_set limit of 1024, recompile the kernel!

It's a little humiliating to think about this after many years. I used to scare people by reading the Linux kernel source code. I used to be a source analyst. Before I understood a problem in depth, I started to talk about it unilaterally based on the source code.

It's embarrassing to have someone look at the documentation and recompile the kernel after redefining the u FD_SETSIZE value.

It's so simple that I didn't think I'd try it!Don't you know if you try it yourself?Why don't you think that's right when you listen to others everyday?Well, I really should be rebellious with myself at that time. If I can meet you, I'll eat my own food!

The Tutuo School was transformed into an engineering school.

I have no experiments to write, so today I write, there must be something you can see and feel.

Is select really limited by 1024?

It's such a simple thing, just try it.The following experiment starts with the Linux platform.

From the definition of fd_set and FD_SET macro, we can see that fd_set is an array of bitmaps and FD_SET is a bitmap set operation. I will not analyze the source code, but do the following experiment directly:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv)
{
	int i = 2000, j;
	unsigned char gap = 0x12; // This value acts as an anchor point and is overridden by bitwise operations.
	fd_set readfds;
	unsigned char *addr = &gap;
	int sd[2500];
	unsigned long dist;
	unsigned total;

	printf("gap value is :0x%x\n", gap);
	// dist means the size of space between readfds and nearby gap s, which is the maximum available space for readfds.
	dist = (unsigned long)&gap - (unsigned long)&readfds;
	FD_ZERO(&readfds);
	// dist*8 + 1 crosses the readfds by one bit.
	// Since gap is 0x12, binary 10010, and 1 bit out of bounds, FD_SET can be expected to place the lowest bit of 0x12.
	// The result is 0x13
	for (j = 0; j < dist*8 + 1; j++) {
		sd[j] = j;
		FD_SET(sd[j], &readfds);
	}
	printf("j %d .", j);
	printf("after FD_SET. gap value is :0x%01x   bytes space:%d\n", gap, dist);
}

Look at the execution results:

[root@localhost select]# ./set
gap value is :0x12
j 1145 .after FD_SET. gap value is :0x13   bytes space:143

As expected.

This means that, in fact, the fd_set is not strictly limited to 1024, regardless of whether or not the FD_SET macro crosses the boundary or the consequences of crossing it.

In fact, if you do analyze the source code, it does:

// /usr/include/sys/select.h
#define FD_SET(fd, fdsetp)  __FD_SET (fd, fdsetp)

// /usr/include/sys/select.h
#define __NFDBITS   (8 * (int) sizeof (__fd_mask))
#define __FD_ELT(d) ((d) / __NFDBITS)
#define __FD_MASK(d)    ((__fd_mask) 1 << ((d) % __NFDBITS))

// /usr/include/bits/select.h
#define __FD_SET(d, set) \
  ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d)))

As you can see, there is no limit 1024!Simple dislocation!

If the above example does not show the effect of cross-border coverage, see the following:

#include <stdio.h>
#include <stdlib.h>

char stub = 0x65; // 'e'of ascii code

int main(int argc, char **argv)
{
	int i = 2000, j;
	unsigned char *pgap = &stub;
	fd_set readfds;
	int sd[2500];
	unsigned long dist;
	unsigned total;
	// We don't touch pgap from start to finish
	printf("gap value is :%c\n", *pgap);
	FD_ZERO(&readfds);
	for (j = 0; j < dist*8 + 1; j++) {
		sd[j] = j;
		FD_SET(sd[j], &readfds);
	}
	printf("gap value is :%c\n", *pgap);
}

We have not operated on the pgap pointer at all, and expect FD_SET to override the pgap pointer beyond the bounds:

[root@localhost select]# ./null
gap value is :e
//Segmentation fault

Overwriting the pgap pointer is, of course, a segment error.

As to how much space there is under readfds and its stack, it depends on 1024 and alignment constraints working together.In our experiment, it is:

&pgap - &readfds;

How and which variables are overridden depends on the layout of local variables on the stack.

The conclusion now is that a value of FD_SET1024 would cause a crossover, but that crossover might not have fatal consequences (for example, you would never touch a gap, pgap...).

This is what is said on the so-called select manual:

The behavior of these macros is undefined if a descriptor value is less than zero or
greater than or equal to FD_SETSIZE, which is normally at least equal to the maximum num-
ber of descriptors supported by the system.

OK, we already know that FD_SET will cross the border, so next step, when FD_SET sets a file descriptor of 1024, will it work correctly?

Another experiment verifies that, in the following experiment, we remove variables from the stack to avoid the effects of fd_set crossing the boundary:

#include <stdio.h>
#include <stdlib.h>
#include <netdb.h>
#include <sys/socket.h>

#define SIZE 1200
// These variables are no longer on the stack in case they are overwritten.
int i = 1001, j;
int sd[SIZE];
struct sockaddr_in serveraddr;
int main(int argc, char **argv)
{
	// Make readfds the first, overwriting the memory that we no longer care about about.
	fd_set readfds;
	int childfd;

	FD_ZERO(&readfds);
	for (j = 0; j < SIZE; j++) {
		sd[j] = socket(AF_INET, SOCK_STREAM, 0);
		serveraddr.sin_family = AF_INET;
		serveraddr.sin_addr.s_addr = htonl(INADDR_ANY);
		serveraddr.sin_port = htons(i++);
		bind(sd[j], (struct sockaddr *) &serveraddr, sizeof(serveraddr));
		listen(sd[j], 5);
		FD_SET(sd[j], &readfds);
	}

	while (1) {
		// select1024...
		if (select(1100, &readfds, 0, 0, 0) < 0) {
			perror("ERROR in select");
		}
		for (j = 0; j < SIZE; j++) {
			if (FD_ISSET(sd[j], &readfds)) {
      			childfd = accept(sd[j], NULL, NULL);
				printf("#### %d\n", j);
      			close(childfd);
			}
		}
	}
}

Obviously, the parameter 1100 of select exceeds 1024, so what is the result?

[root@localhost ~]# telnet 127.0.0.1 2050
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
Connection closed by foreign host.

Successfully connected, the fact that the file descriptor exceeds 1024 still OK:

[root@localhost select]# ./selectserver
#### 1049

How many file descriptors pass in a select call depends on its first parameter:

#include <stdio.h>
#include <stdlib.h>

int num = 1024;
int i;

int main(int argc, char **argv)
{
	fd_set readfds;

	num = atoi(argv[1]);
	FD_ZERO(&readfds);
	for (i = 0; i < num; i++) {
		FD_SET(i, &readfds);
	}

	if (select(num, &readfds, 0, 0, 0) < 0) {
		perror("ERROR in select");
	}
}

Execution knows:

[root@localhost select]# strace -e trace=select ./num 1234
select(1234, [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ... 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233], NULL, NULL, NULL) = -1 EBADF (Bad file descriptor)
# Ignore this error because I didn't really create a socket
ERROR in select: Bad file descriptor
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xffffffffffffff07} ---
+++ killed by SIGSEGV +++

This happens in the user state.

The limit of 1024 is only the agreement of POSIX. If you don't abide by it, you can bear the boundary yourself.

What is the kernel state?To be honest, the fd_set seen by the kernel state is just the bitmap itself, and it has no limitations.

What if you want to break the 1024 limit?

  • Allocate heap memory using malloc/mmap!How big is it!

Let's try:

#include <stdio.h>
#include <stdlib.h>
#include <netdb.h>
#include <sys/socket.h>

int num = 1024;

int main(int argc, char **argv)
{
	// We move the variable back to stack because it won't be overwritten!
	unsigned char *pgap = (unsigned char *)&num;
	fd_set *readfds;
	int childfd;
	int i = 1000, j;
	int sd[10000];
	struct sockaddr_in serveraddr;

	readfds = (fd_set *)malloc(8000/8);
	num = atoi(argv[1]);

	FD_ZERO(readfds);
	printf("pgap :%p\n", pgap);
	for (j = 0; j < num; j++) {
		sd[j] = socket(AF_INET, SOCK_STREAM, 0);
		serveraddr.sin_family = AF_INET;
		serveraddr.sin_addr.s_addr = htonl(INADDR_ANY);
		serveraddr.sin_port = htons(i++);
		bind(sd[j], (struct sockaddr *) &serveraddr, sizeof(serveraddr));
		listen(sd[j], 5);
		FD_SET(sd[j], readfds);
	}
	printf("after setting, pgap :%p\n", pgap);

	while (1) {
		if (select(num, readfds, 0, 0, 0) < 0) {
			perror("ERROR in select");
		}
		for (j = 0; j < num; j++) {
			if (FD_ISSET(sd[j], readfds)) {
      			childfd = accept(sd[j], NULL, NULL);
				printf("#### %d\n", j);
      			close(childfd);
			}
		}
	}
}

Come on:

[root@localhost select]# ulimit -a |grep open
open files                      (-n) 20000
[root@localhost select]# ./a.out 5000
pgap :0x601084
after setting, pgap :0x6010840xb1b010

TCP connection:

[root@localhost ~]# telnet 127.0.0.1 3050
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
Connection closed by foreign host.

[root@localhost select]# ./a.out 5000
pgap :0x601084
after setting, pgap :0x6010840xb1b010
#### 2050

Moon snow shoes!

Next, let's look at the behavior of the Windows platform select.

I don't have a Windows environment, nor do I normally involve the development and debugging of Windows platforms at all. I may be hesitant about what's going on, but I may be in a bad state.

If the Linux platform likes to boast of having read Kernel source code, then the Windows platform is MSDN, and like Linux source analysis, I don't like MSDN documents either.(Of course, I'm not qualified to comment on anything metaphysical on the Windows platform, so let's say a little less.)

So I can only download a Dev-C++ in my Win8 virtual machine to make a simple toss.My code is as follows:

#include <stdio.h>
#include <stdlib.h>
#include <windows.h>

int var = 0x1234; // The Linux test code is copied, so don't worry.
/* run this program using the console pauser or add your own getch, system("pause") or input loop */
int main(int argc, char *argv[]) {

    fd_set fset;
   
    printf("size of fset:%d   %d\n", sizeof(fset), FD_SETSIZE);
   
    FD_ZERO(&fset); // Here's the breakpoint below to see how FD_SET behaves!
   
    FD_SET(0, &fset);
    FD_SET(1, &fset);
    FD_SET(3, &fset);
   
    return 0;
}

In fact, the initial test code was not the one above, but the one on Linux. However, when I found that no matter how hard I tried, I couldn't achieve the effect of overwriting, I decided to figure out the simple operation of the fd_set structure on the Windows platform before I said it. So I changed to the above code, print out the size of the fset and the FD_SETSIZE to confirm it.Are there any 1024 restrictions on the next Windows platform?

To my surprise, the FD_SETSIZE of the Windows platform only has 64 (not 1024)!But fd_set is 520 bytes big!

Maybe it's 64*8=512 bytes, 8 bytes to 520, so I guess Windows uses a byte map for Linux bitmaps.

The so-called byte map is essentially the same as bitmap, but at a higher level, byte operations can be implemented in a uniform way, since efficient bitmap operations take into account many cross-platform characteristics.

Let's go debug mode and set a breakpoint at FD_ZERO, then follow it one step to confirm:

OK, perfect!With such a simple data structure, you can hack the data structure directly with a little experience.

Thus, Windows does limit the maximum number of file descriptors for a select, that is, FD_SETSIZE=64.What if we want to break this limit?

This is simpler than Linux, is it not an FD_SETSIZE macro definition?That's it!

OK, the puzzle is completely unraveled.

Let's briefly summarize:

  • Selectfd_set for Linux Platform
    • The Posix interface is limited to 1024 file descriptors.
    • The bit index of fd_set is the file descriptor index.
    • The Linux kernel has no limitations.
    • Posix's 1024 limit is easy to break through on the stack, but may cause data coverage across boundaries.
    • Posix's 1024 limit requires heap memory to break through, how big or how big do you want it to be?
    • The size of fd_set in a select call depends on the first parameter.
  • Windows
    • windows.h is limited by default to 64 file descriptors.
    • The array element of fd_set is the file descriptor, and the array subscript is the file descriptor index.
    • Whether the kernel is restricted is unknown and debug is not available.
    • The 64 limit of windows.h cannot be directly broken through, and FD_SETSIZE needs to be redefined before the include header file.
    • The size of fd_set in a select call depends on the fd_count field of its structure.

...

After this exploration, don't trust any more Linux kernel source in front of you. It's not good at all. In addition to Linux kernel source, glibc, various intermediate libraries, even your bug s, you don't necessarily use Linux,...Things are far more complex than the Linux kernel source covers.

So, indeed, it's not a big deal to be able to read the Linux kernel and write two comments as source analysis.

My experiments on Linux are all based on 2.6.8, 2.6.18, 2.6.32 and 3.10. Windows XP, Win7/8 and even DoS are old-fashioned platforms that don't innovate, patch, community, manager and artist.

Zhejiang Wenzhou leather shoes are wet, so you won't get fat when it rains.

Posted by mash on Mon, 04 May 2020 05:49:53 -0700