Troubleshooting of ES client request timeout

Keywords: computer networks

1, Phenomenon

The three es produced use Nginx as the agent, and the client - > Nginx - > es are all long connections. The ES client will have about 30 timeout logs every day, which is caused by the client setting a 60s request timeout and actively reporting an error. The error information is as follows (timeout will occur when searching, obtaining by ID and establishing index):

java.net.SocketTimeoutException: 60,000 milliseconds timeout on connection http-outgoing-922 [ACTIVE]
	at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:808)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:248)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235)
	at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1514)
	at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1484)
	at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1454)
	at org.elasticsearch.client.RestHighLevelClient.update(RestHighLevelClient.java:921)

The client has set the socket timeout to 60 seconds, so it is certain that the request has no data return for a long time, resulting in an error. The error log in Nginx is as follows:

2021/05/20 13:26:58 [warn] 491406#491406: *94078113 an upstream response is buffered to a temporary file /data/openresty/nginx/proxy_temp/1/65/0000049651 while reading upstream, client: 10.81.127.209, server: 10.30.223.84, request: "POST /spprjm_customer/_search HTTP/1.1", upstream: "http://10.29.56.90:39200/spprjm_customer/_search", host: "10.30.223.84:86"
2021/05/20 13:26:59 [info] 491406#491406: *94076470 epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while connecting to upstream, client: 10.45.179.90, server: 10.30.223.84, request: "POST /enterprise_v1/_search HTTP/1.1", upstream: "http://10.30.222.121:39200/enterprise_v1/_search", host: "10.30.223.84:86"

10.45.182.231 - - [21/Jun/2021:16:41:01 +0800] "POST /person_search_user_v1/_search HTTP/1.1" "/person_search_user_v1/_search" "60.052" "-" 499 0 "-" "Apache-HttpAsyncClient/4.1.4 (Java/1.8.0_144)"

Epoll here_ The wait () error corresponds to the timeout exception in the service one by one, that is, the client disconnects the long connection in advance after waiting for 60s, so Nginx also disconnects the long connection with the upstream agent, and finally responds to the 499 error code.

2, Troubleshooting problems

Seeing these mistakes, I look for answers on the Internet. Here are some valuable information:

Configure net.ipv4.tcp by troubleshooting the server_ keepalive_ If time = 1200, it can be ruled out that the network is blocked due to the premature firewall connection timeout. Generally, the firewall connection timeout is 1 hour (I didn't find any tutorial on the configuration of this time, but in Iptables Default connections The default is 600s timeout. Due to limited capacity, there is no way to know the real value in the server). TCP connections will be detected many times after 1200s, so it should not be the problem that the firewall disconnects the connection prematurely.

I pointed the problem to the ES service. After monitoring in kibana, I learned that the recent request delay is not high, but there is a delay protrusion consistent with the client timeout, indicating that it is still related to es. I found it An article introduces the monitoring parameters , it can be considered that the monitoring value here is the average value. Then open the slow request log of ES to see how much delay there is. The configuration is as follows:

PUT /_all/_settings?preserve_existing=true
{
  "index.indexing.slowlog.level" : "info",
  "index.indexing.slowlog.source" : "1000",
  "index.indexing.slowlog.threshold.index.info" : "100ms",
  "index.search.slowlog.threshold.fetch.info" : "100ms",
  "index.search.slowlog.threshold.query.info" : "100ms"
}

After several hours of waiting, the production finally reported that the business exceeded the time limit, and then checked the ES slow log. The maximum delay was only about 1s (it is also possible that the request was not logged at all, and careful investigation and confirmation is needed here). It seems that it is not the problem of slow response of ES. I have no idea here.

3, Reproduce problems in the test environment

I start multiple threads locally to query ES (the number of client connections is configured sufficiently, otherwise it will stop at the place where the connection is obtained, but the request timeout will not be reproduced). When the number of online processes is 50, the downlink bandwidth reaches 100%, and 10s timeout begins to occur frequently when the thread is opened to 200. Using jvisualvm to view the thread status, it is found that a large number of threads are suspended for a long time (IO waiting due to insufficient bandwidth). Timeout exception occurs after the thread wakes up, but the error log of Nginx is different from that of production, because production only has epoll_wait() is an error, but the following errors occur in the test environment. Errors consistent with production will occur only when I force the program to close:

2021/06/21 15:07:52 [info] 14749#0: *58644745 recv() failed (104: Connection reset by peer) while sending to client, client: 172.16.9.119, server: 172.16.1.78, request: "POST /prep_statistics_service_process_detail/_search HTTP/1.1", upstream: "http://172.16.1.78:9203/prep_statistics_service_process_detail/_search", host: "172.16.1.78:86"
2021/06/21 20:01:13 [info] 14749#0: *61424251 writev() failed (104: Connection reset by peer) while sending to client, client: 172.16.20.14, server: 172.16.1.78, request: "POST /prep_srms_service_record_group/_search HTTP/1.1", upstream: "http://172.16.1.78:9201/prep_srms_service_record_group/_search", host: "172.16.1.78:86"
2021/06/21 15:07:52 [info] 14749#0: *58658403 client 172.16.9.119 closed keepalive connection (104: Connection reset by peer)

After thinking, I think it may be because the client thread hangs, and the request is constantly reading and writing in Nginx. After the client times out, Nginx may be reading or writing data, and finally these network errors appear.

So I put the test program on a better virtual machine on the network, but the CPU was full at 200 threads. Finally, 500 threads only used 50% of the bandwidth. At this time, the 5s timeout I expected appeared. Through the analysis of jvisual VM, it is also found that the thread is suspended. This time, it is a long wait caused by CPU competition.

So I began to locate the load of the service server. By observing the historical monitoring data, I learned that the GC pause was not large, the CPU load was not high, and the access bandwidth was only used by 10%. So far, I have no idea. However, today, the test team is doing the production search pressure test, resulting in about 100 service 60s request timeout logs in nearly four hours, which makes me suspect that the ES load is high.

Summary:

  • Nginx's proxy_next_upstream The default configuration is error timeout
  • Proxy for production configuration_ connect_ It is unreasonable that the timeout is 3600 seconds (the client will actively report an error in 60s), so it is impossible to use the retry mechanism of Nginx
  • A reasonable configuration should set proxy_connect_timeout 10 quickly triggers a timeout and reconnects the next machine, so that the ES client has a great chance to communicate normally without waiting for 60s

Posted by Snewzzer on Mon, 22 Nov 2021 00:25:58 -0800