The implementation and use of KeepAlive in jetcd

Keywords: Java less github MySQL

Preface

There are many open source implementations of Etcd's Java client. Jetcd is the Java client of Etcd's official warehouse. The overall api interface design and implementation are similar to the official go client, simple and easy to use. Among them, the lease renewal interface provides two interfaces: keepAliveOnce and keepAlive. As its name implies, keepAliveOnce is a single renewal interface. If you want to keep the lease, you need to trigger this interface manually, so this interface is basically not used. And keepAlive is the interface for automatic renewal. In most scenarios, keepAlive can be used, but for different scenarios, we need to consider several issues, such as the setting of lease ttl, and the handling of keepAlive exceptions.
Jetcd project address: https://github.com/etcd-io/jetcd

Background questions

We have an application based on mysql for binlog subscription data change. There are very important applications online based on this service. Because there is a single point of failure, jetcd is used later.
The lock + keepAlive mechanism of implements the second level switch between the active and standby services. For details, see etcd electing the primary and realizing the secondary switching high availability architecture After the system is online and running, it is found that the binlog service often switches between the master and the standby. In fact, the binlog service is very stable, and there has never been an online binlog service outage before the service is online. Finally, we find out that the problem lies in the setting of TTL. Here, we will first throw out the problem and locate it. Next, we will look at the specific implementation of keepAlive of Jetcd, and then analyze why this problem is caused.

KeepAlive implementation

Let's look at the usage of keepAlive.

    private long acquireActiveLease() throws InterruptedException, ExecutionException {
        long leaseId = leaseClient.grant(leaseTTL).get().getID();
        logger.debug("LeaderSelector get leaseId:[{}] and ttl:[{}]", leaseId, leaseTTL);
        this.leaseCloser = leaseClient.keepAlive(leaseId, new StreamObserver<LeaseKeepAliveResponse>() {
            @Override
            public void onNext(LeaseKeepAliveResponse value) {
                logger.debug("LeaderSelector lease keeps alive for [{}]s:", value.getTTL());
            }
            @Override
            public void onError(Throwable t) {
                logger.debug("LeaderSelector lease renewal Exception!", t.fillInStackTrace());
                cancelTask();
            }
            @Override
            public void onCompleted() {
                logger.info("LeaderSelector lease renewal completed! start canceling task.");
                cancelTask();
            }
        });
        return leaseId;
    }

The lease implementation is in the LeaseImpl class. After getting the LeaseImpl instance through EtcdClient, first set ttl to get the lease id through grant method, and then call keepAlive method as the input parameter. The second input parameter is an observer object with three built-in interfaces, which are onNext: trigger after determining the next lease renewal time, onError: trigger when the renewal is abnormal, onco Completed: triggered after the lease expires.

keepAlive method code:

  public synchronized CloseableClient keepAlive(long leaseId, StreamObserver<LeaseKeepAliveResponse> observer) {
    if (this.closed) {
      throw newClosedLeaseClientException();
    }

    KeepAlive keepAlive = this.keepAlives.computeIfAbsent(leaseId, (key) -> new KeepAlive(leaseId));
    keepAlive.addObserver(observer);

    if (!this.hasKeepAliveServiceStarted) {
      this.hasKeepAliveServiceStarted = true;
      this.start();
    }

    return new CloseableClient() {
      @Override
      public void close() {
        keepAlive.removeObserver(observer);
      }
    };
  }

The LeaseImpl internally maintains a map with LeaseId as key and KeepAlive object as value. The KeepAlive class maintains a streamombserver collection, which expires.
Time deadLine, next renewal time nextKeepAlive and renewal leaseId. When the keepAlive method is called for the first time, start, start the renewal thread (sendKeepAliveExecutor()) and check whether
Expired thread (deadLineExecutor()).

  private void sendKeepAliveExecutor() {
    this.keepAliveResponseObserver = Observers.observer(
      response -> processKeepAliveResponse(response),
      error -> processOnError()
    );
    this.keepAliveRequestObserver = this.leaseStub.leaseKeepAlive(this.keepAliveResponseObserver);
    this.keepAliveFuture = scheduledExecutorService.scheduleAtFixedRate(
        () -> {
            // send keep alive req to the leases whose next keep alive is before now.
            this.keepAlives.entrySet().stream()
                .filter(entry -> entry.getValue().getNextKeepAlive() < System.currentTimeMillis())
                .map(Entry::getKey)
                .map(leaseId -> LeaseKeepAliveRequest.newBuilder().setID(leaseId).build())
                .forEach(keepAliveRequestObserver::onNext);
        },
        0,
        500,
        TimeUnit.MILLISECONDS
    );
  }

The sendKeepAliveExecutor method is the core of the whole keepalive function implementation. This method can only be triggered once in the LeaseImpl instance, enabling a scheduled task scheduling with a time interval of 500ms. Each time, the keepalive object whose nextkeepAlive time is less than the current time is filtered out from keepAlives, triggering renewal. The nextkeepAlive initialization value is the current time when the keepalive instance is created. Then, in the renewed response flow observer instance, the processKeepAliveResponse method is executed, in which nextkeepAlive of the keepalive object is maintained.

private synchronized void processKeepAliveResponse(io.etcd.jetcd.api.LeaseKeepAliveResponse leaseKeepAliveResponse) {
    if (this.closed) {
      return;
    }
    final long leaseID = leaseKeepAliveResponse.getID();
    final long ttl = leaseKeepAliveResponse.getTTL();
    final KeepAlive ka = this.keepAlives.get(leaseID);
    if (ka == null) {
      // return if the corresponding keep alive has closed.
      return;
    }
    if (ttl > 0) {
      long nextKeepAlive = System.currentTimeMillis() + ttl * 1000 / 3;
      ka.setNextKeepAlive(nextKeepAlive);
      ka.setDeadLine(System.currentTimeMillis() + ttl * 1000);
      ka.onNext(leaseKeepAliveResponse);
    } else {
      // lease expired; close all keep alive
      this.removeKeepAlive(leaseID);
      ka.onError(
          newEtcdException(
            ErrorCode.NOT_FOUND,
            "etcdserver: requested lease not found"
          )
      );
    }
  }

It can be seen that in the response processing after the first renewal, nextKeepAlive is set to the current time plus 1 / 3 of ttl time, that is to say, if we set the expiration time of a key to 6s, then when keepAlive is used, the renewal interval is once every 2s. If the ttl is less than zero, it means that the key has expired and been deleted, then the onError will be triggered directly, and a requested leave not found exception object will be passed.

Summary at the end of the article

Go back to the problem of frequent switching between the primary and the standby in the top binlog, because we set the time of ttl too small for 5s. As long as the client and etcd services are disconnected for more than 5s, during which keepAlive may not be renewed normally due to various reasons, the master-slave switch will be triggered. At this time, the binlog service itself has no problems, but it has to choose suicide because of the loss of leadership. After the ttl is adjusted to 20s later, the active standby switch is less sensitive.
In another scenario, when etcd is used as a service registry, keepalive will also be used. Even if ttl is set to 20s, there may be no renewal, resulting in the expiration of the registered service and deletion. At this time, our service process is still healthy. In this scenario, you need to retrieve the lease and add a new keepalive in the onError and onCompleted events.

Posted by nalleyp23 on Wed, 23 Oct 2019 00:24:06 -0700