Solve the problem of creating too many threads for Go file operation

Keywords: Go Linux Optimize TCP/IP

Solve the problem of creating too many threads for Go file operation

github code repository

A remote file performance test service implemented using Go. It is used to record the tuning process and solve the problem of too many threads created by Go processing blocked file IO.

1. Configuration

There is a config.json configuration file on the client and server respectively, as described below:

 "Port":"9999", // Service port number
 "UserProtoMsg":true, // Whether to use protobuf protocol, otherwise use json protocol
 "PageSize":4096, // Maximum data read per connection
 "MaxOpenFiles":102400, // Sets the maximum number of file handles that can be used by the service process
 "LogCountPerFile":2000000, // Maximum number of log lines recorded in a single log file
 "DataDir":"xx", // Server is the upper directory where files are stored
 "UserPoolIoSched":"G", // c: Enable c thread pool to process files; G: Enable go collaboration pool; Other: do not use IO pool
 "IoThreads":16, // Enable the number of threads / coroutines in the IO pool
 "PriorIoThreads":3, // Number of high priority cgo IO threads enabled
 "SetCpuAffinity":true, // Set CPU adhesion of IO threads in cgo
 "WaitingQueueLen":1000000 // Enable buffered task queue length for IO pool

2. Server

Code in fs_server directory. The server provides file upload, download, query, protocol performance test and other services.


$ make clean
$ make
$ ./FsServer 
current thread id: 140694714423040, CPU 0
current thread id: 140694689244928, CPU 3
current thread id: 140694706030336, CPU 1
current thread id: 140694337353472, CPU 4
current thread id: 140694320568064, CPU 1
current thread id: 140694697637632, CPU 2
Use C io thread pool.......
raise pprof http server....

3. Client

Code in fs_client directory. The client is mainly used to test the performance of various interfaces, and does not provide normal client functions. The client only uses memory for uploading and downloading, and does not actually read and write files from the local disk. This mainly refers to the performance test method of fastdfs. If the network bandwidth of the client and server is not as fast as the local disk, it is recommended to run the client and server on the same machine to reduce the impact of the network.


$ make clean
$ make
$ ./FsClient
[clean/bench/upload/download/exist/delete] corotine_num loop_num [file_size]

Running example:

$ ./FsClient upload 500 2000000 5000
Parmas:  &{3 500 2000000 5000}
Total success tasks:  2000000 , cost:  252746 ms.
qps:  7913.0826996272945
  • Test interface:
    • Clean: used to clean up the files that may be left in the performance test
    • bench: test protocol performance
    • Upload: test the upload function performance
    • Download: Test download performance
    • exist: test query performance
    • delete: Test deletion performance
  • corotine_num: concurrent number, that is, the number of coprocessors used
  • loop_num: total number of requests
  • file_size: the size of the request file (only used for the upload interface, others can be left blank)

4. Solve the IO thread problem of Go

Partners who have used the Go scheduling mechanism should know that for any file in Linux, if it is in the F of the file_ The poll function is not implemented in OP, so it is impossible to use epoll, poll and other multiplexing mechanisms. Therefore, Go's approach is to create a new thread to serve the blocked IO separately for those disk files that do not implement the poll method if the IO operation is blocked for a period of time and affects the operation of other coprocesses (G) in the same thread (M). For the remote file service system, the bottleneck lies in disk performance, that is, file IO operations, which eventually leads to the creation of a large number of threads, basically connecting one thread to another. By default, Go supports the creation of 10000 threads, that is, to exceed 1w concurrency, you need to modify MaxProcs in the runtime. However, if you want to support more, the system may have ulimit -u more restrictions.

The difficulty is that the Go scheduling mechanism is not suitable for scenarios with a large number of file IO.

This project solves this problem by combining thread pool, cgo, system performance analysis tool and pprof killer of Go.

Main solutions:

  • Encapsulate the file IO operation task, and encapsulate the interface in Go that will lead to the creation of threads independently;
  • Use channel to wait for the completion of file IO operation, and hand over the execution right of the thread to other coprocessors to avoid creating threads due to too long waiting time for IO operation;
  • Execute file IO task:
    • (1) Combined with cgo, put the file IO operation task into the thread pool under glibc to complete; Call glibc's interface to complete disk IO;
    • (2) Put the task into the process pool for execution.

4.1 data examples

  • PAGE_SIZE = 4k, request file 50k, number of requests 200000500, concurrent
Implementation method / dataTime consuming msqpsNumber of system threads created
50K does not use IO pooling1699341176512
Go collaboration pool: 32173167115448
C thread pool: 32. CPU bonding is used instead of priority collaboration191358104550
C thread pool: 32, no CPU bonding, no priority collaboration181389110250

5. Performance test records

Virtual machine specifications / log performance

fio disk performance test

rename optimization

Protocol resolution performance

Raw Go IO interface performance

IO pooling performance

6. Conclusion

The idea of using IO pool comes from the fastdfs project. If you are interested in fastdfs (implemented in C), you can take a look. There is also a fastdfs implemented by go in github, but I haven't seen the way to solve the problem of creating too many threads in the code. Maybe I didn't read it carefully enough.

Based on the implementation of Go coroutines, basically wait for these coroutines to upgrade them to threads when performing IO operations, so that the purpose can be achieved. The advantages of this implementation are: simple implementation and clear thinking;

For the IO thread pool based on cgo implementation, it is necessary to consider the call overhead of inserting IO operation tasks from Go to the task queue in C, lock competition, thread pool implementation, notification to callers after task completion, etc. The advantages are: let yourself have a certain sense of achievement, and you can control thread properties (priority, CPU adhesion), life cycle, etc. in C code.

I hope my project can bring you some help. I also look forward to your advice and opinions.

The github code is here

Posted by sarabjit on Mon, 18 Oct 2021 11:34:54 -0700