Kubernetes Deployment best practices

Keywords: Linux Kubernetes

Zero, example

First, give a complete demo of deployment + HPA + poddisruption budget, and then introduce each part in detail:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app-v3
  namespace: prod
  labels:
    app: my-app
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 10%  # When rolling updates, update up to 10% of Pods at a time
      maxUnavailable: 0  # During rolling update, unavailable Pods are not allowed, that is, three available copies must always be maintained
  selector:
    matchLabels:
      app: my-app
      version: v3
  template:
    metadata:
      labels:
        app: my-app
        version: v3
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution: # Non mandatory conditions
          - weight: 100  # weight is used to score nodes, and the node with the highest score will be selected first (this value is meaningless if there is only one rule)
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - my-app
                - key: version
                  operator: In
                  values:
                  - v3
              # Scatter the pod in multiple zones as much as possible
              topologyKey: topology.kubernetes.io/zone
          requiredDuringSchedulingIgnoredDuringExecution:  # Mandatory requirements (this recommendation is added as needed)
          # Note that there are no weights and all the conditions in the list must be met
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - my-app
              - key: version
                operator: In
                values:
                - v3
            # Pod must run on different nodes
            topologyKey: kubernetes.io/hostname
      securityContext:
        # runAsUser: 1000  # Set user
        # runAsGroup: 1000  # Set user group
        runAsNonRoot: true  # Pod must be run as a non root user
        seccompProfile:  # security compute mode
          type: RuntimeDefault
      nodeSelector:
        eks.amazonaws.com/nodegroup: common  # Use dedicated node groups. If you want to use multiple node groups, you can use node affinity instead
      volumes:
      - name: tmp-dir
        emptyDir: {}
      containers:
      - name: my-app-v3
        image: my-app:v3  # It is recommended to use a private image warehouse to avoid the image pull restrictions of docker.io
        imagePullPolicy: IfNotPresent
        volumeMounts:
        - mountPath: /tmp
          name: tmp-dir
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - "while [ $(netstat -plunt | grep tcp | wc -l | xargs) -ne 0 ]; do sleep 1; done"
        resources:  # Resource requests and restrictions are recommended to be configured equal to avoid resource competition
          requests:
            cpu: 1000m
            memory: 1Gi
          limits:
            cpu: 1000m
            memory: 1Gi
        securityContext:
          # Set the container layer to read-only to prevent the container file from being tampered with
          ## If you need to write temporary files, it is recommended to mount emptyDir to provide read-write data volumes
          readOnlyRootFilesystem: true
          # Forbid Pod to do any privilege promotion
          allowPrivilegeEscalation: false
          capabilities:
            # The permission of drop ALL is strict and can be modified as needed
            drop:
            - ALL
        startupProbe:  # Requirements kubernetes 1.18+
          httpGet:
            path: /actuator/health  # Just use the health check interface directly
            port: 8080
          periodSeconds: 5
          timeoutSeconds: 1
          failureThreshold: 20  # Up to 5s * 20 start time for service
          successThreshold: 1
        livenessProbe:
          httpGet:
            path: /actuator/health  # General health check path for spring
            port: 8080
          periodSeconds: 5
          timeoutSeconds: 1
          failureThreshold: 5
          successThreshold: 1
        # Readiness probes are very important for a RollingUpdate to work properly,
        readinessProbe:
          httpGet:
            path: /actuator/health  # For simplicity, you can directly use the same interface as livenessProbe. Of course, you can also define additional interfaces
            port: 8080
          periodSeconds: 5
          timeoutSeconds: 1
          failureThreshold: 5
          successThreshold: 1
---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  labels:
    app: my-app
  name: my-app-v3
  namespace: prod
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app-v3
  maxReplicas: 50
  minReplicas: 3
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-v3
  namespace: prod
  labels:
    app: my-app
spec:
  minAvailable: 75%
  selector:
    matchLabels:
      app: my-app
      version: v3

1, Graceful shutdown and 502 / 504 error

If the Pod is processing a large number of requests (such as 1000 QPS +) and is rescheduled due to node failure or "bidding node" recycling,
You may observe a small number of 502 / 504 during a period of time when the container is terminate d.

To understand this problem, you need to understand the process of terminating a Pod:

  1. The status of the Pod is set to "Terminating", and (almost) the Pod is removed from all associated Service Endpoints
  2. The preStop hook is executed, which can be a command or an http call to the container in the Pod
    1. If your program cannot exit gracefully when receiving SIGTERM signal, you can consider using preStop
    2. If it is troublesome for the program itself to support elegant exit, it is a very good way to use preStop to realize elegant exit
  3. Send SIGTERM to all containers in Pod
  4. Continue to wait until the time set in spec.terminationGracePeriodSeconds is exceeded. The default value is 30s
    1. It should be noted that the waiting time for graceful exit starts synchronously with preStop! And it won't wait for preStop to end!
  5. If spec.terminationGracePeriodSeconds is exceeded and the container still does not stop, k8s will send SIGKILL signal to the container
  6. After all processes are terminated, the whole Pod is completely cleaned up

Note: work 1 and 2 occur asynchronously, so the situation of "Pod is still in Service Endpoints, but preStop has been executed" may occur. We need to consider this situation.

After understanding the above process, we can analyze the causes of two error codes:

  • 502: the application directly terminates the operation after receiving the SIGTERM signal, resulting in the direct interruption of some unfinished requests. The agent layer returns 502 to indicate this situation
  • 504: the Service Endpoints are not removed timely. After the Pod has been terminated, individual requests are still routed to the Pod, and no response is received, resulting in 504

The usual solution is to add a 15s waiting time in the preStop step of Pod.
The principle is: when the Pod handles the terminating state, it will be removed from the Service Endpoints, and there will be no new requests.
Waiting for 15s in preStop can basically ensure that all requests are processed before the container dies (generally speaking, the processing time of most requests is less than 300ms).

A simple example is as follows. When the Pod is terminated, it always waits for 15s before sending the SIGTERM signal to the container:

    containers:
    - name: my-app
      # Add the following section
      lifecycle:
        preStop:
          exec:
            command:
            - /bin/sleep
            - "15"

A better solution is to wait until all tcp connections are closed (netstat is required in the image):

    containers:
    - name: my-app
      # Add the following section
      lifecycle:
      preStop:
          exec:
            command:
            - /bin/sh
            - -c
            - "while [ $(netstat -plunt | grep tcp | wc -l | xargs) -ne 0 ]; do sleep 1; done"

reference resources

II Node maintenance and Pod interference budget

When we expel the container on a node through kubectl drain,
kubernetes will expel Pod according to Pod's "Pod destruction buget".

If no explicit poddestructionbuget is set, the Pod will be directly killed and then rescheduled at other nodes, which may lead to service interruption!

PDB is a separate CR custom resource, for example:

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: podinfo-pdb
spec:
  # If PDB is not satisfied, Pod expulsion will fail!
  minAvailable: 1      # At least one Pod should be available
#   maxUnavailable: 1  # Maximum number of unavailable pods
  selector:
    matchLabels:
      app: podinfo

If Pod fails to meet PDB during node maintenance (kubectl drain), drain will fail. Example:

> kubectl drain node-205 --ignore-daemonsets --delete-local-data
node/node-205 cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-nfhj7, kube-system/kube-proxy-94dz5
evicting pod default/podinfo-7c84d8c94d-h9brq
evicting pod default/podinfo-7c84d8c94d-gw6qf
error when evicting pod "podinfo-7c84d8c94d-h9brq" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/podinfo-7c84d8c94d-h9brq
error when evicting pod "podinfo-7c84d8c94d-h9brq" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/podinfo-7c84d8c94d-h9brq
error when evicting pod "podinfo-7c84d8c94d-h9brq" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/podinfo-7c84d8c94d-h9brq
pod/podinfo-7c84d8c94d-gw6qf evicted
pod/podinfo-7c84d8c94d-h9brq evicted
node/node-205 evicted

In the above example, there are two copies of podinfo, both running on node-205. I set the interference budget PDB minAvailable: 1.

Then, when the Pod is expelled with kubectl drain, one Pod is expelled immediately, while the other Pod fails to be expelled for about 15 seconds.
Because the first Pod has not been started on the new node, it does not meet the condition of interference budget PDB minAvailable: 1.

About 15 seconds later, the first Pod to be expelled was started on the new node, and the other Pod met the PDB, so it was finally expelled. This completes the drain operation of a node.

Cluster node scaling components such as ClusterAutoscaler will also consider poddisruption budget when scaling nodes. If your cluster uses components of dynamic scaling nodes such as ClusterAutoscaler, it is strongly recommended to set poddisruption budget for all services

Best practices deployment + HPA + poddisruption budget

Generally speaking, each version of a service should contain the following three resources:

  • Deployment: manage the Pods of the service itself
  • HPA: responsible for the expansion and contraction of Pods, usually using CPU index
  • Poddisruption budget (PDB): it is recommended to set the PDB according to the target value of HPA
    • For example, if the target value of HPA CPU is 60%, you can consider setting PDB minAvailable=65% to ensure that at least 65% of Pods are available. In this way, under the theoretical limit, even if the QPS is spread to the remaining 65% Pods, there will be no Avalanche (here, it is assumed that the QPS and CPU are completely linear)

3, Node affinity and node group

In a cluster, we usually use different labels to classify node groups, such as some node labels automatically generated by kubernetes:

  • kubernetes.io/os: usually linux
  • kubernetes.io/arch: amd64, arm64
  • topology.kubernetes.io/region and topology.kubernetes.io/zone: regions and availability zones of cloud services

We often use node affinity and Pod anti affinity. The other two strategies are used as appropriate.

1. Node affinity

If you use aws, aws has some custom node labels:

  • Eks.amazonaws.com/nodegroup: the name of the AWS eks node group. The same aws ec2 instance template is used for the same node group
    • For example, arm64 node group and amd64/x64 node group
    • Node groups with high memory ratio, such as m-series instances, and node groups with high computing performance, such as c-series
    • Bidding instance node group: this saves money, but it is highly dynamic and may be recycled at any time
    • Pay as you go node group: these instances are expensive but stable.

Suppose you want to give priority to bidding instances to run your Pod. If bidding instances are temporarily full, you can choose pay as you go instances.
Then nodeSelector cannot meet your needs. You need to use nodeAffinity. An example is as follows:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: xxx
  namespace: xxx
spec:
  # ...
  template:
    # ...
    spec:
      affinity:
        nodeAffinity:
          # The node of spot-group-c is preferred
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: eks.amazonaws.com/nodegroup
                operator: In
                values:
                - spot-group-c
            weight: 80  # weight is used to score nodes. The node with the highest score will be selected first
         # If spot-group-c is not available, you can also select the node of ondemand-group-c
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: eks.amazonaws.com/nodegroup
                operator: In
                values:
                - spot-group-c
                - ondemand-group-c
      containers:
        # ...

2. Pod anti affinity

It is generally recommended to configure Pod anti affinity for each Deployment template to disperse Pods on all nodes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: xxx
  namespace: xxx
spec:
  # ...
  template:
    # ...
    spec:
      replicas: 3
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution: # Non mandatory conditions
          - weight: 100  # weight is used to score nodes. The node with the highest score will be selected first
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - xxx
                - key: version
                  operator: In
                  values:
                  - v12
              # Scatter the pod in multiple zones as much as possible
              topologyKey: topology.kubernetes.io/zone
          requiredDuringSchedulingIgnoredDuringExecution:  # Mandatory requirements
          # Note that there are no weights and all the conditions in the list must be met
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - xxx
              - key: version
                operator: In
                values:
                - v12
            # Pod must run on different nodes
            topologyKey: kubernetes.io/hostname

4, Ready probe, survival probe and start probe of Pod

Prior to Kubernetes 1.18, a common method was to add a long initialDelaySeconds to the ready probe and survival probe to realize the function similar to the start probe - wait for the container to start slowly before probing.

Pod provides the following three probes, all of which support the use of Command, HTTP API and TCP Socket to detect service availability.

  • startupProbe startup probe (Kubernetes v1.18 [beta]): after this probe passes, the "ready probe" and "survival probe" will be tested for viability and readiness
    • It is used to detect the viability of slow start containers to prevent them from being killed before starting operation
    • The program uses the maximum time of failureThreshold * periodSeconds for startup. For example, if failureThreshold=20 and periodSeconds=5 are set, the maximum startup time of the program is 100s. If the "startup detection" is still not passed after 100s, the container will be killed.
  • readinessProbe ready probe:

    • If the number of failures of the ready probe exceeds the failureThreshold limit (three by default), the Service will be kicked out of the Service Endpoints temporarily until the Service meets the successThreshold again
  • livenessProbe survival probe: it detects whether the service is alive. It can capture deadlock and kill such containers in time.
    • Possible causes of survival probe failure:
      • The service deadlocked and did not respond to all requests
      • All service threads are stuck waiting for external dependencies such as redis/mysql, resulting in no response to the request
    • If the number of failures of the survival probe exceeds the failureThreshold limit (three by default), the container will be killed and then restarted according to the restart policy.
      • Kubectl description pod will display the restart reason as State.Last State.Reason = Error, Exit Code=137, and there will be a description of Liveness probe failed:... In Events.

The parameters of the above three types of detectors are general, and the five time-related parameters are listed as follows:

# The following values are the default values for k8s
initialDelaySeconds: 0  # There is no delay time by default
periodSeconds: 10
timeoutSeconds: 1
failureThreshold: 3
successThreshold: 1

Example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app-v3
spec:
  # ...
  template:
    #  ...
    spec:
      containers:
      - name: my-app-v3
        image: xxx.com/app/my-app:v3
        imagePullPolicy: IfNotPresent 
        # ... omit several configurations
        startupProbe:
          httpGet:
            path: /actuator/health  # Just use the health check interface directly
            port: 8080
          periodSeconds: 5
          timeoutSeconds: 1
          failureThreshold: 20  # Up to 5s * 20 start time for service
          successThreshold: 1
        livenessProbe:
          httpGet:
            path: /actuator/health  # General health check path for spring
            port: 8080
          periodSeconds: 5
          timeoutSeconds: 1
          failureThreshold: 5
          successThreshold: 1
        # Readiness probes are very important for a RollingUpdate to work properly,
        readinessProbe:
          httpGet:
            path: /actuator/health  # For simplicity, you can directly use the same interface as livenessProbe. Of course, you can also define additional interfaces
            port: 8080
          periodSeconds: 5
          timeoutSeconds: 1
          failureThreshold: 5
          successThreshold: 1

5, Pod security {#security}

Only the security related parameters in Pod are introduced here. Other security policies such as global cluster are not discussed here.

1. Pod SecurityContext

By setting the SecurityContext of the Pod, you can set specific security policies for each Pod.

There are two types of SecurityContext:

  1. spec.securityContext: This is a PodSecurityContext object
    • As the name suggests, it is valid for all contaienrs in the Pod.
  2. spec.containers[*].securityContext: This is a SecurityContext object
    • container private SecurityContext

The parameters of the two SecurityContext only partially overlap, and the overlapping part spec.containers[*].securityContext has higher priority.

We often encounter some security policies to enhance permissions:

  1. Privilege container: spec.containers[*].securityContext.privileged
  2. Add optional system level capabilities: spec.containers[*].securityContext.capabilities.add
    1. Only a few containers, such as ntp synchronization service, can enable this function. Please note that this is very dangerous.
  3. Sysctls: system parameter: spec.securityContext.sysctls

Security policies related to permission restrictions are (it is strongly recommended to configure the following security policies on all pods as needed!):

  1. spec.volumes: read and write permissions can be set for all data volumes
  2. spec.securityContext.runAsNonRoot: true Pod must be run as a non root user
  3. Spec.containers [*]. Securitycontext. Readonlyrootfile system: true sets the container layer to read-only to prevent tampering with container files.
    1. If the microservice needs to read and write files, it is recommended to mount additional data volumes of emptydir type.
  4. spec.containers[*].securityContext.allowPrivilegeEscalation: false. Pod is not allowed to do any privilege promotion!
  5. spec.containers[*].securityContext.capabilities.drop: remove optional system level capabilities

There are other functions not listed, such as the running user / user group of the specified container. Please consult the Kubernetes related documents.

An example of stateless micro service Pod configuration:

apiVersion: v1
kind: Pod
metadata:
  name: <Pod name>
spec:
  containers:
  - name: <container name>
    image: <image>
    imagePullPolicy: IfNotPresent 
    # ... omit 500 words here
    securityContext:
      readOnlyRootFilesystem: true  # Set the container layer to read-only to prevent the container file from being tampered with.
      allowPrivilegeEscalation: false  # Forbid Pod to do any privilege promotion
      capabilities:
        drop:
        # It is forbidden for containers to use raw sockets. Generally, only hacker will use raw sockets.
        # raw_socket can customize network layer data, avoid tcp/udp protocol stack, and directly operate the underlying ip/icmp packets. It can realize ip camouflage, custom protocol and other functions.
        # Remove net_raw will make tcpdump unusable and unable to capture packets in the container. This configuration can be temporarily removed when packet capturing is required
        - NET_RAW
        # Better option: disable all capabilities directly
        # - ALL
  securityContext:
    # runAsUser: 1000  # Set user
    # runAsGroup: 1000  # Set user group
    runAsNonRoot: true  # Pod must be run as a non root user
    seccompProfile:  # security compute mode
      type: RuntimeDefault

2. seccomp: security compute mode

Seccomp and seccomp BPF allow filtering of system calls and prevent users' binaries from performing dangerous operations on host operating system devices that are not normally required. It is similar to Falco, but seccomp does not provide special support for containers.

Video:

Posted by isuckat_php on Tue, 30 Nov 2021 16:58:00 -0800