k8s replicaset controller analysis - core processing logic analysis

Keywords: Kubernetes architecture source code source code analysis

replicaset controller analysis

Introduction to replicaset controller

Replicaset controller is one of many controllers in Kube controller manager component. It is the controller of replicaset resource object. It monitors replicaset and pod resources. When these two resources change, it will trigger the replicaset controller to tune the corresponding replicaset object, so as to complete the tuning of the expected number of replicas of replicaset, Create a pod when the actual number of pods does not meet the expectation, and delete a pod when the actual number of pods exceeds the expectation.

The replicaset controller is mainly used to compare the number of pods expected by the replicaset object with the number of existing pods, then create / delete pods according to the comparison results, and finally make the number of pods expected by the replicaset object equal to the number of existing pods.

replicaset controller architecture diagram

The general composition and processing flow of the replicaset controller are shown in the following figure. The replicaset controller registers an event handler for pod and replicaset objects. When there is an event, it will watch, and then put the corresponding replicaset object into the queue. Then, the syncReplicaSet method tunes the core processing logic of the replicaset object for the replicaset controller, Take out the replicaset object from the queue for tuning.

replicaset controller analysis is divided into three parts:
(1) replicaset controller initialization and startup analysis;
(2) replicaset controller core processing logic analysis;
(3) replicaset controller expectations mechanism analysis.

This blog analyzes the core processing logic of replicaset controller.

Analysis of replicaset controller core processing logic

Based on v1.17.4

After the initialization and startup of replicaset controller analyzed above, we know that replicaset controller listens to the add, update and delete events of replicaset and pod objects, and then performs corresponding tuning processing on replicaset objects. Here, we will analyze the tuning processing (core processing) logic of replicaset controller and take rsc.syncHandler as the entry.

rsc.syncHandler

rsc.syncHandler is the rsc.syncReplicaSet method. The main logic is:
(1) Obtain the replicaset object and the list of associated pod objects;
(2) calling rsc.expectations.SatisfiedExpectations to determine whether the replicaset of the previous pair of replicaset's expected copy is completed or not. It can also be considered as whether the rsc.manageReplicas method executed in the tuning operation of the last object is completed.
(3) If the creation and deletion of the replica desired in the previous round have been completed, and the DeletionTimestamp field of the replica object is nil, rsc.manageReplicas is called to do the core tuning of the replica desired, that is, create and delete pod;
(4) Call calculateStatus to calculate the status of replicaset and update it.

// syncReplicaSet will sync the ReplicaSet with the given key if it has had its expectations fulfilled,
// meaning it did not expect to see any more of its pods created or deleted. This function is not meant to be
// invoked concurrently with the same key.
func (rsc *ReplicaSetController) syncReplicaSet(key string) error {
	startTime := time.Now()
	defer func() {
		klog.V(4).Infof("Finished syncing %v %q (%v)", rsc.Kind, key, time.Since(startTime))
	}()

	namespace, name, err := cache.SplitMetaNamespaceKey(key)
	if err != nil {
		return err
	}
	rs, err := rsc.rsLister.ReplicaSets(namespace).Get(name)
	if errors.IsNotFound(err) {
		klog.V(4).Infof("%v %v has been deleted", rsc.Kind, key)
		rsc.expectations.DeleteExpectations(key)
		return nil
	}
	if err != nil {
		return err
	}

	rsNeedsSync := rsc.expectations.SatisfiedExpectations(key)
	selector, err := metav1.LabelSelectorAsSelector(rs.Spec.Selector)
	if err != nil {
		utilruntime.HandleError(fmt.Errorf("error converting pod selector to selector: %v", err))
		return nil
	}

	// list all pods to include the pods that don't match the rs`s selector
	// anymore but has the stale controller ref.
	// TODO: Do the List and Filter in a single pass, or use an index.
	allPods, err := rsc.podLister.Pods(rs.Namespace).List(labels.Everything())
	if err != nil {
		return err
	}
	// Ignore inactive pods.
	filteredPods := controller.FilterActivePods(allPods)

	// NOTE: filteredPods are pointing to objects from cache - if you need to
	// modify them, you need to copy it first.
	filteredPods, err = rsc.claimPods(rs, selector, filteredPods)
	if err != nil {
		return err
	}

	var manageReplicasErr error
	if rsNeedsSync && rs.DeletionTimestamp == nil {
		manageReplicasErr = rsc.manageReplicas(filteredPods, rs)
	}
	rs = rs.DeepCopy()
	newStatus := calculateStatus(rs, filteredPods, manageReplicasErr)

	// Always updates status as pods come up or die.
	updatedRS, err := updateReplicaSetStatus(rsc.kubeClient.AppsV1().ReplicaSets(rs.Namespace), rs, newStatus)
	if err != nil {
		// Multiple things could lead to this update failing. Requeuing the replica set ensures
		// Returning an error causes a requeue without forcing a hotloop
		return err
	}
	// Resync the ReplicaSet after MinReadySeconds as a last line of defense to guard against clock-skew.
	if manageReplicasErr == nil && updatedRS.Spec.MinReadySeconds > 0 &&
		updatedRS.Status.ReadyReplicas == *(updatedRS.Spec.Replicas) &&
		updatedRS.Status.AvailableReplicas != *(updatedRS.Spec.Replicas) {
		rsc.queue.AddAfter(key, time.Duration(updatedRS.Spec.MinReadySeconds)*time.Second)
	}
	return manageReplicasErr
}

1 rsc.expectations.SatisfiedExpectations

This method is mainly to determine whether the completion and deletion operation of the expected copy of the last pair of replicaset is completed. It can also be considered to determine whether the rsc.manageReplicas method that was invoked in the last replicaset object is executed. The next rsc.managereplicas method call can only be made after the last operation of creating and deleting pod is completed.

If the rsc.manageReplicas method has never been called in the tuning of a replicaset object, or the number of created / deleted pods has been reached in the last round of tuning, or the timeout period (timeout time 5 minutes) has been reached after calling rsc.manageReplicas, return true, indicating that the last operation of creating and deleting pods has been completed, and you can call the rsc.manageReplicas method next time, Otherwise, false is returned.

expectations records the number of pods that the replicaset object expects to create / delete during a tuning session. After the creation / deletion of pods is completed, the expected number will decrease accordingly. When the number of pods expected to be created / deleted is less than or equal to 0, it indicates that the number of pods expected to be created / deleted during the last tuning session has reached and returns true.

The Expectations mechanism will be analyzed in detail later.

// pkg/controller/controller_utils.go
func (r *ControllerExpectations) SatisfiedExpectations(controllerKey string) bool {
	if exp, exists, err := r.GetExpectations(controllerKey); exists {
		if exp.Fulfilled() {
			klog.V(4).Infof("Controller expectations fulfilled %#v", exp)
			return true
		} else if exp.isExpired() {
			klog.V(4).Infof("Controller expectations expired %#v", exp)
			return true
		} else {
			klog.V(4).Infof("Controller still waiting on expectations %#v", exp)
			return false
		}
	} else if err != nil {
		klog.V(2).Infof("Error encountered while checking expectations %#v, forcing sync", err)
	} else {
		// When a new controller is created, it doesn't have expectations.
		// When it doesn't see expected watch events for > TTL, the expectations expire.
		//	- In this case it wakes up, creates/deletes controllees, and sets expectations again.
		// When it has satisfied expectations and no controllees need to be created/destroyed > TTL, the expectations expire.
		//	- In this case it continues without setting expectations till it needs to create/delete controllees.
		klog.V(4).Infof("Controller %v either never recorded expectations, or the ttl expired.", controllerKey)
	}
	// Trigger a sync if we either encountered and error (which shouldn't happen since we're
	// getting from local store) or this controller hasn't established expectations.
	return true
}

func (exp *ControlleeExpectations) isExpired() bool {
	return clock.RealClock{}.Since(exp.timestamp) > ExpectationsTimeout // ExpectationsTimeout = 5 * time.Minute
}

2. Core create and delete pod method - rsc.manageReplicas

The core method of creating and deleting pods is mainly to compare the number of pods expected by replicaset with the number of existing pods, and then create / delete pods according to the comparison results. Finally, the number of pods expected by replicaset object is equal to the number of existing pods. It should be noted that the maximum number of created / deleted pods is 500 each time the rsc.manageReplicas method is called.

During the tuning of the replicaset object, the rsc.manageReplicas method may not be called and executed every time. The rsc.manageReplicas method will be called only when the rsc.expectations.satisfieddexpectations method returns true and the DeletionTimestamp attribute of the replicaset object is empty.

First, take a brief look at the code, and a detailed logical analysis will be done later.

// pkg/controller/replicaset/replica_set.go
func (rsc *ReplicaSetController) manageReplicas(filteredPods []*v1.Pod, rs *apps.ReplicaSet) error {
	diff := len(filteredPods) - int(*(rs.Spec.Replicas))
	rsKey, err := controller.KeyFunc(rs)
	if err != nil {
		utilruntime.HandleError(fmt.Errorf("Couldn't get key for %v %#v: %v", rsc.Kind, rs, err))
		return nil
	}
	if diff < 0 {
		diff *= -1
		if diff > rsc.burstReplicas {
			diff = rsc.burstReplicas
		}
		// TODO: Track UIDs of creates just like deletes. The problem currently
		// is we'd need to wait on the result of a create to record the pod's
		// UID, which would require locking *across* the create, which will turn
		// into a performance bottleneck. We should generate a UID for the pod
		// beforehand and store it via ExpectCreations.
		rsc.expectations.ExpectCreations(rsKey, diff)
		glog.V(2).Infof("Too few replicas for %v %s/%s, need %d, creating %d", rsc.Kind, rs.Namespace, rs.Name, *(rs.Spec.Replicas), diff)
		// Batch the pod creates. Batch sizes start at SlowStartInitialBatchSize
		// and double with each successful iteration in a kind of "slow start".
		// This handles attempts to start large numbers of pods that would
		// likely all fail with the same error. For example a project with a
		// low quota that attempts to create a large number of pods will be
		// prevented from spamming the API service with the pod create requests
		// after one of its pods fails.  Conveniently, this also prevents the
		// event spam that those failures would generate.
		successfulCreations, err := slowStartBatch(diff, controller.SlowStartInitialBatchSize, func() error {
			boolPtr := func(b bool) *bool { return &b }
			controllerRef := &metav1.OwnerReference{
				APIVersion:         rsc.GroupVersion().String(),
				Kind:               rsc.Kind,
				Name:               rs.Name,
				UID:                rs.UID,
				BlockOwnerDeletion: boolPtr(true),
				Controller:         boolPtr(true),
			}
			err := rsc.podControl.CreatePodsWithControllerRef(rs.Namespace, &rs.Spec.Template, rs, controllerRef)
			if err != nil && errors.IsTimeout(err) {
				// Pod is created but its initialization has timed out.
				// If the initialization is successful eventually, the
				// controller will observe the creation via the informer.
				// If the initialization fails, or if the pod keeps
				// uninitialized for a long time, the informer will not
				// receive any update, and the controller will create a new
				// pod when the expectation expires.
				return nil
			}
			return err
		})

		// Any skipped pods that we never attempted to start shouldn't be expected.
		// The skipped pods will be retried later. The next controller resync will
		// retry the slow start process.
		if skippedPods := diff - successfulCreations; skippedPods > 0 {
			glog.V(2).Infof("Slow-start failure. Skipping creation of %d pods, decrementing expectations for %v %v/%v", skippedPods, rsc.Kind, rs.Namespace, rs.Name)
			for i := 0; i < skippedPods; i++ {
				// Decrement the expected number of creates because the informer won't observe this pod
				rsc.expectations.CreationObserved(rsKey)
			}
		}
		return err
	} else if diff > 0 {
		if diff > rsc.burstReplicas {
			diff = rsc.burstReplicas
		}
		glog.V(2).Infof("Too many replicas for %v %s/%s, need %d, deleting %d", rsc.Kind, rs.Namespace, rs.Name, *(rs.Spec.Replicas), diff)

		// Choose which Pods to delete, preferring those in earlier phases of startup.
		podsToDelete := getPodsToDelete(filteredPods, diff)

		// Snapshot the UIDs (ns/name) of the pods we're expecting to see
		// deleted, so we know to record their expectations exactly once either
		// when we see it as an update of the deletion timestamp, or as a delete.
		// Note that if the labels on a pod/rs change in a way that the pod gets
		// orphaned, the rs will only wake up after the expectations have
		// expired even if other pods are deleted.
		rsc.expectations.ExpectDeletions(rsKey, getPodKeys(podsToDelete))

		errCh := make(chan error, diff)
		var wg sync.WaitGroup
		wg.Add(diff)
		for _, pod := range podsToDelete {
			go func(targetPod *v1.Pod) {
				defer wg.Done()
				if err := rsc.podControl.DeletePod(rs.Namespace, targetPod.Name, rs); err != nil {
					// Decrement the expected number of deletes because the informer won't observe this deletion
					podKey := controller.PodKey(targetPod)
					glog.V(2).Infof("Failed to delete %v, decrementing expectations for %v %s/%s", podKey, rsc.Kind, rs.Namespace, rs.Name)
					rsc.expectations.DeletionObserved(rsKey, podKey)
					errCh <- err
				}
			}(pod)
		}
		wg.Wait()

		select {
		case err := <-errCh:
			// all errors have been reported before and they're likely to be the same, so we'll only return the first one we hit.
			if err != nil {
				return err
			}
		default:
		}
	}

	return nil
}

diff = number of existing Pods - expected number of pod s

diff := len(filteredPods) - int(*(rs.Spec.Replicas))

(1) When the number of existing pods is less than expected, you need to create a pod and enter the logical code block of creating a pod.
(2) When the number of existing pods is more than expected, you need to delete the pod and enter the logical code block of deleting the pod.

The maximum number of pod s created or deleted in batch in a synchronization operation is rsc.burstReplicas, i.e. 500.

// pkg/controller/replicaset/replica_set.go
const (
	// Realistic value of the burstReplica field for the replica set manager based off
	// performance requirements for kubernetes 1.0.
	BurstReplicas = 500

	// The number of times we retry updating a ReplicaSet's status.
	statusUpdateRetries = 1
)
    if diff > rsc.burstReplicas {
		diff = rsc.burstReplicas
	}

Next, analyze the logical code block for creating / deleting a pod.

2.1 create pod logic code block

Main logic:
(1) Calculate and obtain the number of pod s to be created, and set the upper limit of 500;
(2) Call rsc.expectations.ExpectCreations and set the number of pod s expected to be created in this round of tuning into expectations;
(3) Call the slowStartBatch function to perform logical processing on pod creation;
(4) after calling the slowStartBatch function, calculate the number of pod that failed to create, and then call the corresponding number of rsc.expectations.CreationObserved methods to subtract the number of pod created during the current tuning period.
Why reduce it? Because expectations records the number of pods that the replicaset object expects to create / delete in a tuning session, the replicaset controller will watch the creation / deletion event of the pod after the creation / deletion of the pod, thus calling the rsc.expectations.CreationObserved method to reduce the number of pods that are expected to be created / deleted. When the creation / deletion of a corresponding number of pods fails, the replicaset controller will not watch the corresponding pod creation / deletion event. Therefore, the number of pods expected to be created / deleted in this round of tuning must be subtracted accordingly. Otherwise, the number of pods expected to be created / deleted in this round of tuning can never be less than or equal to 0. In this case, The rsc.expectations.satisfieddexpectations method will only return true when the expectations timeout period expires.

        diff *= -1
		if diff > rsc.burstReplicas {
			diff = rsc.burstReplicas
		}
		
		rsc.expectations.ExpectCreations(rsKey, diff)
		glog.V(2).Infof("Too few replicas for %v %s/%s, need %d, creating %d", rsc.Kind, rs.Namespace, rs.Name, *(rs.Spec.Replicas), diff)
		
        successfulCreations, err := slowStartBatch(diff, controller.SlowStartInitialBatchSize, func() error {
			boolPtr := func(b bool) *bool { return &b }
			controllerRef := &metav1.OwnerReference{
				APIVersion:         rsc.GroupVersion().String(),
				Kind:               rsc.Kind,
				Name:               rs.Name,
				UID:                rs.UID,
				BlockOwnerDeletion: boolPtr(true),
				Controller:         boolPtr(true),
			}
			err := rsc.podControl.CreatePodsWithControllerRef(rs.Namespace, &rs.Spec.Template, rs, controllerRef)
			if err != nil && errors.IsTimeout(err) {
				// Pod is created but its initialization has timed out.
				// If the initialization is successful eventually, the
				// controller will observe the creation via the informer.
				// If the initialization fails, or if the pod keeps
				// uninitialized for a long time, the informer will not
				// receive any update, and the controller will create a new
				// pod when the expectation expires.
				return nil
			}
			return err
		})

		if skippedPods := diff - successfulCreations; skippedPods > 0 {
			glog.V(2).Infof("Slow-start failure. Skipping creation of %d pods, decrementing expectations for %v %v/%v", skippedPods, rsc.Kind, rs.Namespace, rs.Name)
			for i := 0; i < skippedPods; i++ {
				// Decrement the expected number of creates because the informer won't observe this pod
				rsc.expectations.CreationObserved(rsKey)
			}
		}
		return err

2.1.1 slowStartBatch

Looking at slowStartBatch, you can see that the algorithm for creating a pod is:
(1) The number of pods created in each batch is 1, 2, 4, 8... In order, increasing exponentially. goroutine, which is the same as the number of pods to be created, is responsible for creating pods.
(2) The creation of pods is carried out in multiple batches according to the increasing trend of 1, 2, 4, 8... If the creation of pods in a batch fails (such as apiserver current limiting, discarding requests, etc., note: except for timeout, because the initialization process may timeout), the subsequent batches will not be carried out and the function call will be ended.

// pkg/controller/replicaset/replica_set.go
// slowStartBatch tries to call the provided function a total of 'count' times,
// starting slow to check for errors, then speeding up if calls succeed.
//
// It groups the calls into batches, starting with a group of initialBatchSize.
// Within each batch, it may call the function multiple times concurrently.
//
// If a whole batch succeeds, the next batch may get exponentially larger.
// If there are any failures in a batch, all remaining batches are skipped
// after waiting for the current batch to complete.
//
// It returns the number of successful calls to the function.
func slowStartBatch(count int, initialBatchSize int, fn func() error) (int, error) {
	remaining := count
	successes := 0
	for batchSize := integer.IntMin(remaining, initialBatchSize); batchSize > 0; batchSize = integer.IntMin(2*batchSize, remaining) {
		errCh := make(chan error, batchSize)
		var wg sync.WaitGroup
		wg.Add(batchSize)
		for i := 0; i < batchSize; i++ {
			go func() {
				defer wg.Done()
				if err := fn(); err != nil {
					errCh <- err
				}
			}()
		}
		wg.Wait()
		curSuccesses := batchSize - len(errCh)
		successes += curSuccesses
		if len(errCh) > 0 {
			return successes, <-errCh
		}
		remaining -= batchSize
	}
	return successes, nil
}

rsc.podControl.CreatePodsWithControllerRef

The previously defined method called when creating a pod is rsc.podControl.CreatePodsWithControllerRef.

func (r RealPodControl) CreatePodsWithControllerRef(namespace string, template *v1.PodTemplateSpec, controllerObject runtime.Object, controllerRef *metav1.OwnerReference) error {
	if err := validateControllerRef(controllerRef); err != nil {
		return err
	}
	return r.createPods("", namespace, template, controllerObject, controllerRef)
}

func (r RealPodControl) createPods(nodeName, namespace string, template *v1.PodTemplateSpec, object runtime.Object, controllerRef *metav1.OwnerReference) error {
	pod, err := GetPodFromTemplate(template, object, controllerRef)
	if err != nil {
		return err
	}
	if len(nodeName) != 0 {
		pod.Spec.NodeName = nodeName
	}
	if len(labels.Set(pod.Labels)) == 0 {
		return fmt.Errorf("unable to create pods, no labels")
	}
	newPod, err := r.KubeClient.CoreV1().Pods(namespace).Create(pod)
	if err != nil {
		// only send an event if the namespace isn't terminating
		if !apierrors.HasStatusCause(err, v1.NamespaceTerminatingCause) {
			r.Recorder.Eventf(object, v1.EventTypeWarning, FailedCreatePodReason, "Error creating: %v", err)
		}
		return err
	}
	accessor, err := meta.Accessor(object)
	if err != nil {
		klog.Errorf("parentObject does not have ObjectMeta, %v", err)
		return nil
	}
	klog.V(4).Infof("Controller %v created pod %v", accessor.GetName(), newPod.Name)
	r.Recorder.Eventf(object, v1.EventTypeNormal, SuccessfulCreatePodReason, "Created pod: %v", newPod.Name)

	return nil
}

2.2 deleting logical code blocks

Main logic:
(1) Calculate and obtain the number of pod s to be deleted, and set the upper limit of 500;
(2) According to the number of pods to be deleted, first call the getPodsToDelete function to find out the list of pods to be deleted;
(3) Call rsc.expectations.ExpectCreations and set the number of pod s expected to be deleted in this round of tuning into expectations;
(4) Pull up a goroutine for each pod and call rsc.podControl.DeletePod to delete the pod;
(5) For the failed pod deletion, the rsc.expectations.DeletionObserved method will be called to subtract the number of pods expected to be created during this round of tuning.
As for why to reduce, the reason is the same as that analyzed in creating logical code blocks above.
(6) Wait for all gorouutine s to complete and return.

if diff > rsc.burstReplicas {
			diff = rsc.burstReplicas
		}
		glog.V(2).Infof("Too many replicas for %v %s/%s, need %d, deleting %d", rsc.Kind, rs.Namespace, rs.Name, *(rs.Spec.Replicas), diff)

		// Choose which Pods to delete, preferring those in earlier phases of startup.
		podsToDelete := getPodsToDelete(filteredPods, diff)

		rsc.expectations.ExpectDeletions(rsKey, getPodKeys(podsToDelete))

		errCh := make(chan error, diff)
		var wg sync.WaitGroup
		wg.Add(diff)
		for _, pod := range podsToDelete {
			go func(targetPod *v1.Pod) {
				defer wg.Done()
				if err := rsc.podControl.DeletePod(rs.Namespace, targetPod.Name, rs); err != nil {
					// Decrement the expected number of deletes because the informer won't observe this deletion
					podKey := controller.PodKey(targetPod)
					glog.V(2).Infof("Failed to delete %v, decrementing expectations for %v %s/%s", podKey, rsc.Kind, rs.Namespace, rs.Name)
					rsc.expectations.DeletionObserved(rsKey, podKey)
					errCh <- err
				}
			}(pod)
		}
		wg.Wait()

		select {
		case err := <-errCh:
			// all errors have been reported before and they're likely to be the same, so we'll only return the first one we hit.
			if err != nil {
				return err
			}
		default:
		}

2.2.1 getPodsToDelete

getPodsToDelete: according to the number of pods to be deleted, and then return the list of pods to be deleted.

// pkg/controller/replicaset/replica_set.go
func getPodsToDelete(filteredPods, relatedPods []*v1.Pod, diff int) []*v1.Pod {
	// No need to sort pods if we are about to delete all of them.
	// diff will always be <= len(filteredPods), so not need to handle > case.
	if diff < len(filteredPods) {
		podsWithRanks := getPodsRankedByRelatedPodsOnSameNode(filteredPods, relatedPods)
		sort.Sort(podsWithRanks)
	}
	return filteredPods[:diff]
}

func getPodsRankedByRelatedPodsOnSameNode(podsToRank, relatedPods []*v1.Pod) controller.ActivePodsWithRanks {
	podsOnNode := make(map[string]int)
	for _, pod := range relatedPods {
		if controller.IsPodActive(pod) {
			podsOnNode[pod.Spec.NodeName]++
		}
	}
	ranks := make([]int, len(podsToRank))
	for i, pod := range podsToRank {
		ranks[i] = podsOnNode[pod.Spec.NodeName]
	}
	return controller.ActivePodsWithRanks{Pods: podsToRank, Rank: ranks}
}
Filter pod logic to delete

Sort from top to bottom according to the following sorting rules. Each condition is mutually exclusive. If one of the conditions is met, the sorting is completed:
(1) First delete the pod without node binding;
(2) First delete the pod in Pending status, then Unknown, and finally Running;
(3) Delete the Not ready pod first, and then the ready pod;
(4) Sort by the number of pods belonging to the replicaset on the same node, and give priority to deleting the pods on the node with a large number of pods belonging to the replicaset;
(5) Sort by the time of pod ready, and delete the pod with the shortest ready time first;
(6) Priority should be given to deleting pod s with more container restarts;
(7) Sort by pod creation time, and delete the pod with the shortest creation time first.

// pkg/controller/controller_utils.go
func (s ActivePodsWithRanks) Less(i, j int) bool {
	// 1. Unassigned < assigned
	// If only one of the pods is unassigned, the unassigned one is smaller
	if s.Pods[i].Spec.NodeName != s.Pods[j].Spec.NodeName && (len(s.Pods[i].Spec.NodeName) == 0 || len(s.Pods[j].Spec.NodeName) == 0) {
		return len(s.Pods[i].Spec.NodeName) == 0
	}
	// 2. PodPending < PodUnknown < PodRunning
	if podPhaseToOrdinal[s.Pods[i].Status.Phase] != podPhaseToOrdinal[s.Pods[j].Status.Phase] {
		return podPhaseToOrdinal[s.Pods[i].Status.Phase] < podPhaseToOrdinal[s.Pods[j].Status.Phase]
	}
	// 3. Not ready < ready
	// If only one of the pods is not ready, the not ready one is smaller
	if podutil.IsPodReady(s.Pods[i]) != podutil.IsPodReady(s.Pods[j]) {
		return !podutil.IsPodReady(s.Pods[i])
	}
	// 4. Doubled up < not doubled up
	// If one of the two pods is on the same node as one or more additional
	// ready pods that belong to the same replicaset, whichever pod has more
	// colocated ready pods is less
	if s.Rank[i] != s.Rank[j] {
		return s.Rank[i] > s.Rank[j]
	}
	// TODO: take availability into account when we push minReadySeconds information from deployment into pods,
	//       see https://github.com/kubernetes/kubernetes/issues/22065
	// 5. Been ready for empty time < less time < more time
	// If both pods are ready, the latest ready one is smaller
	if podutil.IsPodReady(s.Pods[i]) && podutil.IsPodReady(s.Pods[j]) {
		readyTime1 := podReadyTime(s.Pods[i])
		readyTime2 := podReadyTime(s.Pods[j])
		if !readyTime1.Equal(readyTime2) {
			return afterOrZero(readyTime1, readyTime2)
		}
	}
	// 6. Pods with containers with higher restart counts < lower restart counts
	if maxContainerRestarts(s.Pods[i]) != maxContainerRestarts(s.Pods[j]) {
		return maxContainerRestarts(s.Pods[i]) > maxContainerRestarts(s.Pods[j])
	}
	// 7. Empty creation time pods < newer pods < older pods
	if !s.Pods[i].CreationTimestamp.Equal(&s.Pods[j].CreationTimestamp) {
		return afterOrZero(&s.Pods[i].CreationTimestamp, &s.Pods[j].CreationTimestamp)
	}
	return false
}

2.2.2 rsc.podControl.DeletePod

Method of deleting pod.

// pkg/controller/controller_utils.go
func (r RealPodControl) DeletePod(namespace string, podID string, object runtime.Object) error {
	accessor, err := meta.Accessor(object)
	if err != nil {
		return fmt.Errorf("object does not have ObjectMeta, %v", err)
	}
	klog.V(2).Infof("Controller %v deleting pod %v/%v", accessor.GetName(), namespace, podID)
	if err := r.KubeClient.CoreV1().Pods(namespace).Delete(podID, nil); err != nil && !apierrors.IsNotFound(err) {
		r.Recorder.Eventf(object, v1.EventTypeWarning, FailedDeletePodReason, "Error deleting: %v", err)
		return fmt.Errorf("unable to delete pods: %v", err)
	}
	r.Recorder.Eventf(object, v1.EventTypeNormal, SuccessfulDeletePodReason, "Deleted pod: %v", podID)

	return nil
}

3 calculateStatus

The calculateStatus function calculates and returns the status of the replicaset object.

How to calculate status?
(1) Assign values to the replicaset object's status's Replicas, ReadyReplicas, AvailableReplicas and other fields according to the number of existing pods, the number of pods in Ready status, and the number of pods in availablestatus;
(2) According to the condition configuration in the existing status of the replicaset object and whether there is an error after calling the rsc.manageReplicas method, decide to add or remove a condition to the status. The condition type is ReplicaFailure.

When an error occurs when calling rsc.manageReplicas method, and there is no condition with conditionType of ReplicaFailure in the status of replicaset object, add a condition with conditionType of ReplicaFailure, indicating that the replicaset creates / deletes a pod error;
When there is no error in calling rsc.manageReplicas method and there is a condition with conditionType of ReplicaFailure in the status of replicaset object, the condition is removed, indicating that the replicaset creates / deletes a pod successfully.

func calculateStatus(rs *apps.ReplicaSet, filteredPods []*v1.Pod, manageReplicasErr error) apps.ReplicaSetStatus {
	newStatus := rs.Status
	// Count the number of pods that have labels matching the labels of the pod
	// template of the replica set, the matching pods may have more
	// labels than are in the template. Because the label of podTemplateSpec is
	// a superset of the selector of the replica set, so the possible
	// matching pods must be part of the filteredPods.
	fullyLabeledReplicasCount := 0
	readyReplicasCount := 0
	availableReplicasCount := 0
	templateLabel := labels.Set(rs.Spec.Template.Labels).AsSelectorPreValidated()
	for _, pod := range filteredPods {
		if templateLabel.Matches(labels.Set(pod.Labels)) {
			fullyLabeledReplicasCount++
		}
		if podutil.IsPodReady(pod) {
			readyReplicasCount++
			if podutil.IsPodAvailable(pod, rs.Spec.MinReadySeconds, metav1.Now()) {
				availableReplicasCount++
			}
		}
	}

	failureCond := GetCondition(rs.Status, apps.ReplicaSetReplicaFailure)
	if manageReplicasErr != nil && failureCond == nil {
		var reason string
		if diff := len(filteredPods) - int(*(rs.Spec.Replicas)); diff < 0 {
			reason = "FailedCreate"
		} else if diff > 0 {
			reason = "FailedDelete"
		}
		cond := NewReplicaSetCondition(apps.ReplicaSetReplicaFailure, v1.ConditionTrue, reason, manageReplicasErr.Error())
		SetCondition(&newStatus, cond)
	} else if manageReplicasErr == nil && failureCond != nil {
		RemoveCondition(&newStatus, apps.ReplicaSetReplicaFailure)
	}

	newStatus.Replicas = int32(len(filteredPods))
	newStatus.FullyLabeledReplicas = int32(fullyLabeledReplicasCount)
	newStatus.ReadyReplicas = int32(readyReplicasCount)
	newStatus.AvailableReplicas = int32(availableReplicasCount)
	return newStatus
}

4 updateReplicaSetStatus

Main logic:
(1) Judge whether the attributes in the newly calculated status, such as Replicas, ReadyReplicas, AvailableReplicas and Conditions, are consistent with those in the status of the existing replicaset object. If they are consistent, do not update and return directly;
(2) Call c.UpdateStatus to update the status of replicaset.

// pkg/controller/replicaset/replica_set_utils.go
func updateReplicaSetStatus(c appsclient.ReplicaSetInterface, rs *apps.ReplicaSet, newStatus apps.ReplicaSetStatus) (*apps.ReplicaSet, error) {
	// This is the steady state. It happens when the ReplicaSet doesn't have any expectations, since
	// we do a periodic relist every 30s. If the generations differ but the replicas are
	// the same, a caller might've resized to the same replica count.
	if rs.Status.Replicas == newStatus.Replicas &&
		rs.Status.FullyLabeledReplicas == newStatus.FullyLabeledReplicas &&
		rs.Status.ReadyReplicas == newStatus.ReadyReplicas &&
		rs.Status.AvailableReplicas == newStatus.AvailableReplicas &&
		rs.Generation == rs.Status.ObservedGeneration &&
		reflect.DeepEqual(rs.Status.Conditions, newStatus.Conditions) {
		return rs, nil
	}

	// Save the generation number we acted on, otherwise we might wrongfully indicate
	// that we've seen a spec update when we retry.
	// TODO: This can clobber an update if we allow multiple agents to write to the
	// same status.
	newStatus.ObservedGeneration = rs.Generation

	var getErr, updateErr error
	var updatedRS *apps.ReplicaSet
	for i, rs := 0, rs; ; i++ {
		klog.V(4).Infof(fmt.Sprintf("Updating status for %v: %s/%s, ", rs.Kind, rs.Namespace, rs.Name) +
			fmt.Sprintf("replicas %d->%d (need %d), ", rs.Status.Replicas, newStatus.Replicas, *(rs.Spec.Replicas)) +
			fmt.Sprintf("fullyLabeledReplicas %d->%d, ", rs.Status.FullyLabeledReplicas, newStatus.FullyLabeledReplicas) +
			fmt.Sprintf("readyReplicas %d->%d, ", rs.Status.ReadyReplicas, newStatus.ReadyReplicas) +
			fmt.Sprintf("availableReplicas %d->%d, ", rs.Status.AvailableReplicas, newStatus.AvailableReplicas) +
			fmt.Sprintf("sequence No: %v->%v", rs.Status.ObservedGeneration, newStatus.ObservedGeneration))

		rs.Status = newStatus
		updatedRS, updateErr = c.UpdateStatus(rs)
		if updateErr == nil {
			return updatedRS, nil
		}
		// Stop retrying if we exceed statusUpdateRetries - the replicaSet will be requeued with a rate limit.
		if i >= statusUpdateRetries {
			break
		}
		// Update the ReplicaSet with the latest resource version for the next poll
		if rs, getErr = c.Get(rs.Name, metav1.GetOptions{}); getErr != nil {
			// If the GET fails we can't trust status.Replicas anymore. This error
			// is bound to be more interesting than the update failure.
			return nil, getErr
		}
	}

	return nil, updateErr
}
c.UpdateStatus
// staging/src/k8s.io/client-go/kubernetes/typed/apps/v1/replicaset.go
func (c *replicaSets) UpdateStatus(replicaSet *v1.ReplicaSet) (result *v1.ReplicaSet, err error) {
	result = &v1.ReplicaSet{}
	err = c.client.Put().
		Namespace(c.ns).
		Resource("replicasets").
		Name(replicaSet.Name).
		SubResource("status").
		Body(replicaSet).
		Do().
		Into(result)
	return
}

summary

replicaset controller architecture diagram

The general composition and processing flow of the replicaset controller are shown in the following figure. The replicaset controller registers an event handler for pod and replicaset objects. When there is an event, it will watch, and then put the corresponding replicaset object into the queue. Then, the syncReplicaSet method tunes the core processing logic of the replicaset object for the replicaset controller, Take out the replicaset object from the queue for tuning.

replicaset controller core processing logic

The core processing logic of replicaset controller is based on the comparison between the expected number of pods in the replicaset object and the number of existing pods. When the expected number of pods is more than the number of existing pods, call the create pod algorithm to create a new pod until the expected number is reached; When the expected number of pods is less than the number of existing pods, call the delete pod algorithm, sort the list of existing pods according to certain strategies, select redundant pods in order, and then delete them until the expected number is reached.

replicaset controller creates pod algorithm

The replicaset controller creates pods in multiple batches according to the increasing trend of 1, 2, 4, 8... (the upper limit of the number of pods created in each tuning is 500, and those exceeding the upper limit will be created in the next tuning). If the creation of pods in a batch fails (such as apiserver current limit, discard request, etc., note: except timeout, because initialization processing may timeout), Then the pod creation of subsequent batches will not be carried out. You need to wait for the next tuning of the repliaset object before triggering the pod creation algorithm to create pods until the desired number is reached.

replicaset controller delete pod algorithm

The algorithm of replicaset controller to delete pods is to sort the existing pod list according to certain policies, then select a specified number of pods in order, pull up goroutines with the same number of pods to be deleted to delete pods (the maximum number of pods to be deleted in each tuning is 500), and wait for all goroutines to complete. If the deletion of pods fails (such as apiserver current limit, discard request) or exceeds the upper limit of 500, wait for the next tuning of the repliaset object to trigger the pod deletion algorithm to delete the pods until the desired number is reached.

Filter pod logic to delete

Sort from top to bottom according to the following sorting rules. Each condition is mutually exclusive. If one of the conditions is met, the sorting is completed:
(1) First delete the pod without node binding;
(2) First delete the pod in Pending status, then Unknown, and finally Running;
(3) Delete the Not ready pod first, and then the ready pod;
(4) Sort by the number of pods belonging to the replicaset on the same node, and give priority to deleting the pods on the node with a large number of pods belonging to the replicaset;
(5) Sort by the time of pod ready, and delete the pod with the shortest ready time first;
(6) Priority should be given to deleting pod s with more container restarts;
(7) Sort by pod creation time, and delete the pod with the shortest creation time first.

expectations mechanism

The analysis of expectations mechanism will be carried out in the next blog.

Posted by glueater on Sun, 24 Oct 2021 15:08:03 -0700