spring gateway failure retry mechanism

Keywords: Java Spring Spring Boot

Global routing provides a failure retry mechanism for all routing services. Global routing, that is, default filters

# Open route:
spring.cloud.gateway.default-filters[0].name=Retry

#According to the default parameters, only the GET request is retried, and the parameters are modified to support POST
spring.cloud.gateway.default-filters[0].args.methods[0]=GET
spring.cloud.gateway.default-filters[0].args.methods[1]=POST

The retry of spring cloud gateway can not be automatically retried by adding spring retry, nor can it be retried by adding some time parameters.
Let's take a look at the spring documentation:

6.26. The Retry GatewayFilter Factory

The Retry GatewayFilter factory supports the following parameters:

  • retries: The number of retries that should be attempted.

  • statuses: The HTTP status codes that should be retried, represented by using org.springframework.http.HttpStatus.

  • methods: The HTTP methods that should be retried, represented by usingorg.springframework.http.HttpMethod.

  • series: The series of status codes to be retried, represented by using
    org.springframework.http.HttpStatus.Series.

  • exceptions: A list of thrown exceptions that should be retried.

  • backoff: The configured exponential backoff for the retries. Retries are performed after a backoff interval of firstBackoff * (factor ^ n), where n is the iteration. If maxBackoff is configured, the maximum backoff applied is limited to maxBackoff. If basedOnPreviousValue is true, the backoff is calculated byusing prevBackoff * factor.

The following defaults are configured for Retry filter, if enabled:

  • retries: Three times

  • series: 5XX series

  • methods: GET method

  • exceptions: IOException and TimeoutException

  • backoff: disabled

The following listing configures a Retry GatewayFilter:

Example 55. application.yml

spring:
  cloud:
    gateway:
      routes:
      - id: retry_test
        uri: http://localhost:8080/flakey
        predicates:
        - Host=*.retry.com
        filters:
        - name: Retry
          args:
            retries: 3
            statuses: BAD_GATEWAY
            methods: GET,POST
            backoff:
              firstBackoff: 10ms
              maxBackoff: 50ms
              factor: 2
              basedOnPreviousValue: false

The above configuration is an example in the official document, which is configured to a specific route.

Route configuration parameters are defined in org.springframework.cloud.gateway.config.GatewayProperties;
Route definition is completed by routedefinitionroutelocator implementations, routelocator, beanfactoryaware, applicationeventpublisheraware;

	private Route convertToRoute(RouteDefinition routeDefinition) {
		AsyncPredicate<ServerWebExchange> predicate = combinePredicates(routeDefinition);
		List<GatewayFilter> gatewayFilters = getFilters(routeDefinition);

		return Route.async(routeDefinition).asyncPredicate(predicate)
				.replaceFilters(gatewayFilters).build();
	}
    private List<GatewayFilter> getFilters(RouteDefinition routeDefinition) {
		List<GatewayFilter> filters = new ArrayList<>();

		// TODO: support option to apply defaults after route specific filters?
		if (!this.gatewayProperties.getDefaultFilters().isEmpty()) {
			filters.addAll(loadGatewayFilters(routeDefinition.getId(),
					new ArrayList<>(this.gatewayProperties.getDefaultFilters())));
		}

		if (!routeDefinition.getFilters().isEmpty()) {
			filters.addAll(loadGatewayFilters(routeDefinition.getId(),
					new ArrayList<>(routeDefinition.getFilters())));
		}

		AnnotationAwareOrderComparator.sort(filters);
		return filters;
	}

Determine whether to retry. Specific code: 1

ServerWebExchange exchange = context.applicationContext();
if (exceedsMaxIterations(exchange, retryConfig)) {
	return false;
}
// Judge the status code first. The priority of status code is higher than that of series
HttpStatus statusCode = exchange.getResponse().getStatusCode();

boolean retryableStatusCode = retryConfig.getStatuses()
		.contains(statusCode);

// null status code might mean a network exception?
// If the status code does not exist, try again before judging the series
if (!retryableStatusCode && statusCode != null) {
	// try the series
	retryableStatusCode = false;
	for (int i = 0; i < retryConfig.getSeries().size(); i++) {
		if (statusCode.series().equals(retryConfig.getSeries().get(i))) {
			retryableStatusCode = true;
			break;
		}
	}
}

final boolean finalRetryableStatusCode = retryableStatusCode;
trace("retryableStatusCode: %b, statusCode %s, configured statuses %s, configured series %s",
		() -> finalRetryableStatusCode, () -> statusCode,
		retryConfig::getStatuses, retryConfig::getSeries);

// Determine whether the http method needs to be retried
HttpMethod httpMethod = exchange.getRequest().getMethod();
boolean retryableMethod = retryConfig.getMethods().contains(httpMethod);

trace("retryableMethod: %b, httpMethod %s, configured methods %s",
		() -> retryableMethod, () -> httpMethod, retryConfig::getMethods);
// Finally, both the status code and the request method need to be met before retrying
return retryableMethod && finalRetryableStatusCode;

The default parameter series is SERVER_ERROR server error, so try again on the server, including all 5xx errors. For individual 4xx errors, add the statuses parameter.

Add: why do I toss retry

In the previously used eureka, there is no configuration failure retry when the service goes online and offline. Some column delays such as the online and offline of a service will lead to the failure of access to the interface provided by the gateway.

Later, with nacos, the service goes online and offline faster. By adjusting some column timeout and refresh time parameters, the time range of api access failure during the process of online and offline can be reduced (1-10 seconds)

# Reduce the interval, quickly update the service list of the gateway, and keep the list up to date
# However, with the reduction of refresh interval and frequent thread sleep and wake-up, the efficiency is certainly not good
ribbon.ServerListRefreshInterval=1000

I also looked up other ribbon parameters, tried to add a retry mechanism, and found that, um... The parameters set may be wrong. Anyway, I didn't try again.

hystrix.command.default.execution.timeout.enabled=true
hystrix.command.default.execution.isolation.thread.timeoutInMilliseconds=25000
ribbon.ReadTimeout=20000
ribbon.ConnectTimeout=5000
ribbon.MaxAutoRetries=1
ribbon.MaxAutoRetriesNextServer=1

However, when the service goes offline, there is neither connection timeout nor read timeout for a period of time, and the connection is rejected;
In addition, if the service is not graceful offline, such as kill -9, the gateway will encounter the following errors

2021-09-07 11:52:51,446 [reactor-http-epoll-1] TRACE o.s.c.g.f.LoadBalancerClientFilter - LoadBalancerClientFilter url before: lb://xxx/xx-api/ext/wanyee/msg/list?pageNo=30&pageSize=2&beginTime=2020-09-07%2000:00:00&endTime=2021-09-07%2023:59:59&keyword=&suid=&uid=
2021-09-07 11:52:51,446 [reactor-http-epoll-1] TRACE o.s.c.g.f.LoadBalancerClientFilter - LoadBalancerClientFilter url chosen: http://192.168.2.1:9898/xx-api/list?pageNo=30&pageSize=2&beginTime=2020-09-07%2000:00:00&endTime=2021-09-07%2023:59:59&keyword=&suid=&uid=
2021-09-07 11:52:51,450 [reactor-http-epoll-1] ERROR o.s.b.a.w.r.e.AbstractErrorWebExceptionHandler - [dc72bc7f-127]  500 Server Error for HTTP GET "/xx-api/list?pageNo=30&pageSize=2&beginTime=2020-09-07%2000:00:00&endTime=2021-09-07%2023:59:59&keyword=&suid=&uid="
io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: connection denied: /192.168.2.56:9198
	Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException: 
Error has been observed at the following site(s):
	|_ checkpoint ⇢ org.springframework.cloud.gateway.filter.WeightCalculatorWebFilter [DefaultWebFilterChain]
	|_ checkpoint ⇢ org.springframework.boot.actuate.metrics.web.reactive.server.MetricsWebFilter [DefaultWebFilterChain]
	|_ checkpoint ⇢ HTTP GET "/ams-api/ext/wanyee/msg/list?pageNo=30&pageSize=2&beginTime=2020-09-07%2000:00:00&endTime=2021-09-07%2023:59:59&keyword=&suid=&uid=" [ExceptionHandlingWebHandler]
Stack trace:
Caused by: java.net.ConnectException: finishConnect(..) failed: connection denied
	at io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)
	at io.netty.channel.unix.Socket.finishConnect(Socket.java:251)
	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:673)
	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:650)
	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:530)
	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
2021-09-07 11:52:51,452 [reactor-http-epoll-1] TRACE o.s.c.g.f.GatewayMetricsFilter - gateway.requests tags: [tag(httpMethod=GET),tag(httpStatusCode=500),tag(outcome=SERVER_ERROR),tag(routeId=ams-api),tag(routeUri=lb://ams),tag(status=INTERNAL_SERVER_ERROR)]

Search the wrong keywords online. Well, it's useless
However, spring will definitely solve this problem. This is the retry mechanism.

Posted by jdsflash on Tue, 07 Sep 2021 16:40:26 -0700