Microservice: analyze the source code. The health check of Nacos is so simple

preface

We mentioned the health examination of Nacos many times, such as< Micro service: the service is too straightforward, and Nacos hasn't responded yet. What should we do? >In the article, the health check is also customized and tuned. So, how is Nacos's health check and heartbeat mechanism realized? Can Nacos's health examination mechanism be used in other places in project practice?

This article takes you to uncover the veil of Nacos health examination mechanism.

Health examination of Nacos

Temporary instances in Nacos remain active based on heartbeat reporting. The basic health check process is basically as follows: the Nacos client maintains a scheduled task and sends heartbeat requests every 5 seconds to ensure that it is active. If the Nacos server does not receive the heartbeat request from the client within 15 seconds, it will set the instance as unhealthy. If it does not receive the heartbeat within 30 seconds, it will remove the temporary instance.

The principle is very simple. For the implementation of the code layer, let's analyze it step by step.

Heartbeat of client

The instance maintains its activity based on the form of heartbeat report. Of course, it is inseparable from the realization of happy hop function. The analysis is based on the implementation of client heartbeat.

Spring Cloud provides a standard interface ServiceRegistry, and the corresponding implementation class of Nacos is NacosServiceRegistry. When the Spring Cloud project starts, it will instantiate the NacosServiceRegistry and call its register method to register the instance.

@Override
public void register(Registration registration) { 
   // ...
   NamingService namingService = namingService();
   String serviceId = registration.getServiceId();
   String group = nacosDiscoveryProperties.getGroup();

   Instance instance = getNacosInstanceFromRegistration(registration);

   try {
      namingService.registerInstance(serviceId, group, instance);
      log.info("nacos registry, {} {} {}:{} register finished", group, serviceId,
            instance.getIp(), instance.getPort());
   }catch (Exception e) {
      // ...
   }
}

There are two points to note in this method. The first is to build the getNacosInstanceFromRegistration method of the Instance. The metadata of the Instance will be set in this method. The parameters of the server-side health check can be configured through the source metadata. For example, the following parameters configured in Spring Cloud can be passed to the service side of Nacos during service registration through metadata items.

spring:
  application:
    name: user-service-provider
  cloud:
    nacos:
      discovery:
        server-addr: 127.0.0.1:8848
        heart-beat-interval: 5000
        heart-beat-timeout: 15000
       ip-delete-timeout: 30000

The health check parameters such as heart beat interval, heart beat timeout and IP delete timeout are reported based on metadata.

The second part of the register method is to call NamingService#registerInstance to register the instance. NamingService is provided by the Nacos client, that is, the heartbeat of the Nacos client itself is provided by the Nacos ecosystem.

In the registerInstance method, the following methods will eventually be called:

@Override
public void registerInstance(String serviceName, String groupName, Instance instance) throws NacosException {
    NamingUtils.checkInstanceIsLegal(instance);
    String groupedServiceName = NamingUtils.getGroupedName(serviceName, groupName);
    if (instance.isEphemeral()) {
        BeatInfo beatInfo = beatReactor.buildBeatInfo(groupedServiceName, instance);
        beatReactor.addBeatInfo(groupedServiceName, beatInfo);
    }
    serverProxy.registerService(groupedServiceName, groupName, instance);
}

BeatInfo#addBeatInfo is the entry for heartbeat processing. Of course, the precondition is that the current instance needs to be a temporary (transient) instance.

The corresponding methods are as follows:

public void addBeatInfo(String serviceName, BeatInfo beatInfo) {
    NAMING_LOGGER.info("[BEAT] adding beat: {} to beat map.", beatInfo);
    String key = buildKey(serviceName, beatInfo.getIp(), beatInfo.getPort());
    BeatInfo existBeat = null;
    //fix #1733
    if ((existBeat = dom2Beat.remove(key)) != null) {
        existBeat.setStopped(true);
    }
    dom2Beat.put(key, beatInfo);
    executorService.schedule(new BeatTask(beatInfo), beatInfo.getPeriod(), TimeUnit.MILLISECONDS);
    MetricsMonitor.getDom2BeatSizeMonitor().set(dom2Beat.size());
}

In the penultimate line, you can see that the client processes the heartbeat through scheduled tasks, and the specific heartbeat requests are completed by BeatTask. The execution frequency of scheduled tasks is encapsulated in BeatInfo. If you look back, you will find that the Period of BeatInfo comes from Instance#getInstanceHeartBeatInterval(). The specific implementation of the method is as follows:

public long getInstanceHeartBeatInterval() {
    return this.getMetaDataByKeyWithDefault("preserved.heart.beat.interval", Constants.DEFAULT_HEART_BEAT_INTERVAL);
}

It can be seen that the execution interval of scheduled tasks is the data preserved.heart.beat.interval in the configured metadata, which is essentially the same as the above-mentioned configuration of heart beat interval. The default is 5 seconds.

The implementation of the BeatTask class is as follows:

class BeatTask implements Runnable {
    
    BeatInfo beatInfo;
    
    public BeatTask(BeatInfo beatInfo) {
        this.beatInfo = beatInfo;
    }
    
    @Override
    public void run() {
        if (beatInfo.isStopped()) {
            return;
        }
        long nextTime = beatInfo.getPeriod();
        try {
            JsonNode result = serverProxy.sendBeat(beatInfo, BeatReactor.this.lightBeatEnabled);
            long interval = result.get("clientBeatInterval").asLong();
            boolean lightBeatEnabled = false;
            if (result.has(CommonParams.LIGHT_BEAT_ENABLED)) {
                lightBeatEnabled = result.get(CommonParams.LIGHT_BEAT_ENABLED).asBoolean();
            }
            BeatReactor.this.lightBeatEnabled = lightBeatEnabled;
            if (interval > 0) {
                nextTime = interval;
            }
            int code = NamingResponseCode.OK;
            if (result.has(CommonParams.CODE)) {
                code = result.get(CommonParams.CODE).asInt();
            }
            if (code == NamingResponseCode.RESOURCE_NOT_FOUND) {
                Instance instance = new Instance();
                instance.setPort(beatInfo.getPort());
                instance.setIp(beatInfo.getIp());
                instance.setWeight(beatInfo.getWeight());
                instance.setMetadata(beatInfo.getMetadata());
                instance.setClusterName(beatInfo.getCluster());
                instance.setServiceName(beatInfo.getServiceName());
                instance.setInstanceId(instance.getInstanceId());
                instance.setEphemeral(true);
                try {
                    serverProxy.registerService(beatInfo.getServiceName(),
                            NamingUtils.getGroupName(beatInfo.getServiceName()), instance);
                } catch (Exception ignore) {
                }
            }
        } catch (NacosException ex) {
            NAMING_LOGGER.error("[CLIENT-BEAT] failed to send beat: {}, code: {}, msg: {}",
                    JacksonUtils.toJson(beatInfo), ex.getErrCode(), ex.getErrMsg());
            
        }
        executorService.schedule(new BeatTask(beatInfo), nextTime, TimeUnit.MILLISECONDS);
    }
}

In the run method, the heartbeat request is sent through namingproxy#sendeat. At the end of the run method, a timing task is started again to make heartbeat requests periodically.

The namingproxy#sendeat method is implemented as follows:

public JsonNode sendBeat(BeatInfo beatInfo, boolean lightBeatEnabled) throws NacosException {
    
    if (NAMING_LOGGER.isDebugEnabled()) {
        NAMING_LOGGER.debug("[BEAT] {} sending beat to server: {}", namespaceId, beatInfo.toString());
    }
    Map<String, String> params = new HashMap<String, String>(8);
    Map<String, String> bodyMap = new HashMap<String, String>(2);
    if (!lightBeatEnabled) {
        bodyMap.put("beat", JacksonUtils.toJson(beatInfo));
    }
    params.put(CommonParams.NAMESPACE_ID, namespaceId);
    params.put(CommonParams.SERVICE_NAME, beatInfo.getServiceName());
    params.put(CommonParams.CLUSTER_NAME, beatInfo.getCluster());
    params.put("ip", beatInfo.getIp());
    params.put("port", String.valueOf(beatInfo.getPort()));
    String result = reqApi(UtilAndComs.nacosUrlBase + "/instance/beat", params, bodyMap, HttpMethod.PUT);
    return JacksonUtils.toObj(result);
}

In fact, it calls the "/ nacos/v1/ns/instance/beat" service provided by the Nacos server.

The default parameters related to heartbeat are defined in the constant class Constants of the client:

static {
    DEFAULT_HEART_BEAT_TIMEOUT = TimeUnit.SECONDS.toMillis(15L);
    DEFAULT_IP_DELETE_TIMEOUT = TimeUnit.SECONDS.toMillis(30L);
    DEFAULT_HEART_BEAT_INTERVAL = TimeUnit.SECONDS.toMillis(5L);
}

This echoes several time dimensions of the Nacos health examination mechanism at the beginning.

Server receiving heartbeat

During the analysis of the client, we can see that the requested service is / nacos/v1/ns/instance/beat. The Nacos server is implemented in the InstanceController in the Naming project.

@CanDistro
@PutMapping("/beat")
@Secured(parser = NamingResourceParser.class, action = ActionTypes.WRITE)
public ObjectNode beat(HttpServletRequest request) throws Exception {

    // ...
    Instance instance = serviceManager.getInstance(namespaceId, serviceName, clusterName, ip, port);

    if (instance == null) {
        // ...
        instance = new Instance();
        instance.setPort(clientBeat.getPort());
        instance.setIp(clientBeat.getIp());
        instance.setWeight(clientBeat.getWeight());
        instance.setMetadata(clientBeat.getMetadata());
        instance.setClusterName(clusterName);
        instance.setServiceName(serviceName);
        instance.setInstanceId(instance.getInstanceId());
        instance.setEphemeral(clientBeat.isEphemeral());

        serviceManager.registerInstance(namespaceId, serviceName, instance);
    }

    Service service = serviceManager.getService(namespaceId, serviceName);
    // ...
    service.processClientBeat(clientBeat);
    // ...
    return result;
}

When the server receives the request, it mainly does two things: first, if the instance sending heartbeat does not exist, it will be registered; Second, call the processClientBeat method of its Service for heartbeat processing.

The processClientBeat method is implemented as follows:

public void processClientBeat(final RsInfo rsInfo) {
    ClientBeatProcessor clientBeatProcessor = new ClientBeatProcessor();
    clientBeatProcessor.setService(this);
    clientBeatProcessor.setRsInfo(rsInfo);
    HealthCheckReactor.scheduleNow(clientBeatProcessor);
}

ClientBeatProcessor is also a Runnable Task, which is executed immediately through the scheduleNow method defined by HealthCheckReactor.

scheduleNow method implementation:

public static ScheduledFuture<?> scheduleNow(Runnable task) {
    return GlobalExecutor.scheduleNamingHealth(task, 0, TimeUnit.MILLISECONDS);
}

Let's take a look at the implementation of specific tasks in ClientBeatProcessor:

@Override
public void run() {
    Service service = this.service;
    // logging    
    String ip = rsInfo.getIp();
    String clusterName = rsInfo.getCluster();
    int port = rsInfo.getPort();
    Cluster cluster = service.getClusterMap().get(clusterName);
    List<Instance> instances = cluster.allIPs(true);
    
    for (Instance instance : instances) {
        if (instance.getIp().equals(ip) && instance.getPort() == port) {
            // logging
            instance.setLastBeat(System.currentTimeMillis());
            if (!instance.isMarked()) {
                if (!instance.isHealthy()) {
                    instance.setHealthy(true);
                    // logging
                    getPushService().serviceChanged(service);
                }
            }
        }
    }
}

In the run method, first check whether the instance sending heartbeat is consistent with the IP. If so, update the last heartbeat time. At the same time, if the instance has not been marked before and is in an unhealthy state, it will be changed to a healthy state, and the changes will be published through the event mechanism provided by PushService. The event is published by the ApplicationContext of Spring, and the event is ServiceChangeEvent.

Through the above heartbeat operation, the health status and the last heartbeat time of the Nacos server instance have been refreshed. So, if no heartbeat is received, how does the server judge?

Server heartbeat check

The client initiates heartbeat, and the server checks whether the heartbeat of the client is normal, or whether the heartbeat update time in the corresponding instance is normal.

The server-side heartbeat is triggered when the service instance is registered. Similarly, in InstanceController, register registration is implemented as follows:

@CanDistro
@PostMapping
@Secured(parser = NamingResourceParser.class, action = ActionTypes.WRITE)
public String register(HttpServletRequest request) throws Exception {
    // ...
    final Instance instance = parseInstance(request);

    serviceManager.registerInstance(namespaceId, serviceName, instance);
    return "ok";
}

The implementation code of ServiceManager#registerInstance is as follows:

public void registerInstance(String namespaceId, String serviceName, Instance instance) throws NacosException {
    
    createEmptyService(namespaceId, serviceName, instance.isEphemeral());
    // ...
}

The heartbeat related implementation is implemented in the empty Service created for the first time, and will eventually be adjusted to the following method:

public void createServiceIfAbsent(String namespaceId, String serviceName, boolean local, Cluster cluster)
        throws NacosException {
    Service service = getService(namespaceId, serviceName);
    if (service == null) {
        
        Loggers.SRV_LOG.info("creating empty service {}:{}", namespaceId, serviceName);
        service = new Service();
        service.setName(serviceName);
        service.setNamespaceId(namespaceId);
        service.setGroupName(NamingUtils.getGroupName(serviceName));
        // now validate the service. if failed, exception will be thrown
        service.setLastModifiedMillis(System.currentTimeMillis());
        service.recalculateChecksum();
        if (cluster != null) {
            cluster.setService(service);
            service.getClusterMap().put(cluster.getName(), cluster);
        }
        service.validate();
        
        putServiceAndInit(service);
        if (!local) {
            addOrReplaceService(service);
        }
    }
}

Initialize the Service in the putServiceAndInit method:

private void putServiceAndInit(Service service) throws NacosException {
    putService(service);
    service = getService(service.getNamespaceId(), service.getName());
    service.init();
    consistencyService
            .listen(KeyBuilder.buildInstanceListKey(service.getNamespaceId(), service.getName(), true), service);
    consistencyService
            .listen(KeyBuilder.buildInstanceListKey(service.getNamespaceId(), service.getName(), false), service);
    Loggers.SRV_LOG.info("[NEW-SERVICE] {}", service.toJson());
}

service.init() method implementation:

public void init() {
    HealthCheckReactor.scheduleCheck(clientBeatCheckTask);
    for (Map.Entry<String, Cluster> entry : clusterMap.entrySet()) {
        entry.getValue().setService(this);
        entry.getValue().init();
    }
}

HealthCheckReactor#scheduleCheck method implementation:

public static void scheduleCheck(ClientBeatCheckTask task) {
    futureMap.computeIfAbsent(task.taskKey(),
            k -> GlobalExecutor.scheduleNamingHealth(task, 5000, 5000, TimeUnit.MILLISECONDS));
}

Execution is delayed for 5 seconds and checked every 5 seconds.

In the first line of the init method, you can see the Task to perform health check. The specific Task is implemented by the ClientBeatCheckTask. The corresponding core code of the run method is as follows:

@Override
public void run() {
    // ...        
    List<Instance> instances = service.allIPs(true);
    
    // first set health status of instances:
    for (Instance instance : instances) {
        if (System.currentTimeMillis() - instance.getLastBeat() > instance.getInstanceHeartBeatTimeOut()) {
            if (!instance.isMarked()) {
                if (instance.isHealthy()) {
                    instance.setHealthy(false);
                    // logging...
                    getPushService().serviceChanged(service);
                    ApplicationUtils.publishEvent(new InstanceHeartbeatTimeoutEvent(this, instance));
                }
            }
        }
    }
    
    if (!getGlobalConfig().isExpireInstance()) {
        return;
    }
    
    // then remove obsolete instances:
    for (Instance instance : instances) {
        
        if (instance.isMarked()) {
            continue;
        }
        
        if (System.currentTimeMillis() - instance.getLastBeat() > instance.getIpDeleteTimeout()) {
            // delete instance
            deleteIp(instance);
        }
    }
}

In the first for loop, first judge whether the interval between the current time and the last heartbeat time is greater than the timeout time. If the instance has timed out, is marked, and the health status is healthy, set the health status to unhealthy, and publish the event of state change.

In the second for loop, if the instance has been marked, it jumps out of the loop. If it is not marked, and the interval between the current time and the last heartbeat time is greater than the IP deletion time, the corresponding instance will be deleted.

Summary

Through the source code analysis of this article, we start from the Spring Cloud, track the heartbeat time in the Nacos Client, and then track the implementation of the Nacos server receiving the heartbeat and check whether the instance is healthy. Presumably, through the combing of the whole source code, you have some understanding of the implementation of the whole Nacos heartbeat.

Nacos series

Posted by Paul Moran on Tue, 07 Dec 2021 00:38:09 -0800