Zero, example
First, give a complete demo of deployment + HPA + poddisruption budget, and then introduce each part in detail:
apiVersion: apps/v1 kind: Deployment metadata: name: my-app-v3 namespace: prod labels: app: my-app spec: replicas: 3 strategy: type: RollingUpdate rollingUpdate: maxSurge: 10% # When rolling updates, update up to 10% of Pods at a time maxUnavailable: 0 # During rolling update, unavailable Pods are not allowed, that is, three available copies must always be maintained selector: matchLabels: app: my-app version: v3 template: metadata: labels: app: my-app version: v3 spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: # Non mandatory conditions - weight: 100 # weight is used to score nodes, and the node with the highest score will be selected first (this value is meaningless if there is only one rule) podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - my-app - key: version operator: In values: - v3 # Scatter the pod in multiple zones as much as possible topologyKey: topology.kubernetes.io/zone requiredDuringSchedulingIgnoredDuringExecution: # Mandatory requirements (this recommendation is added as needed) # Note that there are no weights and all the conditions in the list must be met - labelSelector: matchExpressions: - key: app operator: In values: - my-app - key: version operator: In values: - v3 # Pod must run on different nodes topologyKey: kubernetes.io/hostname securityContext: # runAsUser: 1000 # Set user # runAsGroup: 1000 # Set user group runAsNonRoot: true # Pod must be run as a non root user seccompProfile: # security compute mode type: RuntimeDefault nodeSelector: eks.amazonaws.com/nodegroup: common # Use dedicated node groups. If you want to use multiple node groups, you can use node affinity instead volumes: - name: tmp-dir emptyDir: {} containers: - name: my-app-v3 image: my-app:v3 # It is recommended to use a private image warehouse to avoid the image pull restrictions of docker.io imagePullPolicy: IfNotPresent volumeMounts: - mountPath: /tmp name: tmp-dir lifecycle: preStop: exec: command: - /bin/sh - -c - "while [ $(netstat -plunt | grep tcp | wc -l | xargs) -ne 0 ]; do sleep 1; done" resources: # Resource requests and restrictions are recommended to be configured equal to avoid resource competition requests: cpu: 1000m memory: 1Gi limits: cpu: 1000m memory: 1Gi securityContext: # Set the container layer to read-only to prevent the container file from being tampered with ## If you need to write temporary files, it is recommended to mount emptyDir to provide read-write data volumes readOnlyRootFilesystem: true # Forbid Pod to do any privilege promotion allowPrivilegeEscalation: false capabilities: # The permission of drop ALL is strict and can be modified as needed drop: - ALL startupProbe: # Requirements kubernetes 1.18+ httpGet: path: /actuator/health # Just use the health check interface directly port: 8080 periodSeconds: 5 timeoutSeconds: 1 failureThreshold: 20 # Up to 5s * 20 start time for service successThreshold: 1 livenessProbe: httpGet: path: /actuator/health # General health check path for spring port: 8080 periodSeconds: 5 timeoutSeconds: 1 failureThreshold: 5 successThreshold: 1 # Readiness probes are very important for a RollingUpdate to work properly, readinessProbe: httpGet: path: /actuator/health # For simplicity, you can directly use the same interface as livenessProbe. Of course, you can also define additional interfaces port: 8080 periodSeconds: 5 timeoutSeconds: 1 failureThreshold: 5 successThreshold: 1 --- apiVersion: autoscaling/v2beta2 kind: HorizontalPodAutoscaler metadata: labels: app: my-app name: my-app-v3 namespace: prod spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: my-app-v3 maxReplicas: 50 minReplicas: 3 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 --- apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: my-app-v3 namespace: prod labels: app: my-app spec: minAvailable: 75% selector: matchLabels: app: my-app version: v3
1, Graceful shutdown and 502 / 504 error
If the Pod is processing a large number of requests (such as 1000 QPS +) and is rescheduled due to node failure or "bidding node" recycling,
You may observe a small number of 502 / 504 during a period of time when the container is terminate d.
To understand this problem, you need to understand the process of terminating a Pod:
- The status of the Pod is set to "Terminating", and (almost) the Pod is removed from all associated Service Endpoints
- The preStop hook is executed, which can be a command or an http call to the container in the Pod
- If your program cannot exit gracefully when receiving SIGTERM signal, you can consider using preStop
- If it is troublesome for the program itself to support elegant exit, it is a very good way to use preStop to realize elegant exit
- Send SIGTERM to all containers in Pod
- Continue to wait until the time set in spec.terminationGracePeriodSeconds is exceeded. The default value is 30s
- It should be noted that the waiting time for graceful exit starts synchronously with preStop! And it won't wait for preStop to end!
- If spec.terminationGracePeriodSeconds is exceeded and the container still does not stop, k8s will send SIGKILL signal to the container
- After all processes are terminated, the whole Pod is completely cleaned up
Note: work 1 and 2 occur asynchronously, so the situation of "Pod is still in Service Endpoints, but preStop has been executed" may occur. We need to consider this situation.
After understanding the above process, we can analyze the causes of two error codes:
- 502: the application directly terminates the operation after receiving the SIGTERM signal, resulting in the direct interruption of some unfinished requests. The agent layer returns 502 to indicate this situation
- 504: the Service Endpoints are not removed timely. After the Pod has been terminated, individual requests are still routed to the Pod, and no response is received, resulting in 504
The usual solution is to add a 15s waiting time in the preStop step of Pod.
The principle is: when the Pod handles the terminating state, it will be removed from the Service Endpoints, and there will be no new requests.
Waiting for 15s in preStop can basically ensure that all requests are processed before the container dies (generally speaking, the processing time of most requests is less than 300ms).
A simple example is as follows. When the Pod is terminated, it always waits for 15s before sending the SIGTERM signal to the container:
containers: - name: my-app # Add the following section lifecycle: preStop: exec: command: - /bin/sleep - "15"
A better solution is to wait until all tcp connections are closed (netstat is required in the image):
containers: - name: my-app # Add the following section lifecycle: preStop: exec: command: - /bin/sh - -c - "while [ $(netstat -plunt | grep tcp | wc -l | xargs) -ne 0 ]; do sleep 1; done"
reference resources
- Kubernetes best practices: terminating with grace
- Graceful shutdown in Kubernetes is not always trivial
II Node maintenance and Pod interference budget
When we expel the container on a node through kubectl drain,
kubernetes will expel Pod according to Pod's "Pod destruction buget".
If no explicit poddestructionbuget is set, the Pod will be directly killed and then rescheduled at other nodes, which may lead to service interruption!
PDB is a separate CR custom resource, for example:
apiVersion: policy/v1beta1 kind: PodDisruptionBudget metadata: name: podinfo-pdb spec: # If PDB is not satisfied, Pod expulsion will fail! minAvailable: 1 # At least one Pod should be available # maxUnavailable: 1 # Maximum number of unavailable pods selector: matchLabels: app: podinfo
If Pod fails to meet PDB during node maintenance (kubectl drain), drain will fail. Example:
> kubectl drain node-205 --ignore-daemonsets --delete-local-data node/node-205 cordoned WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-nfhj7, kube-system/kube-proxy-94dz5 evicting pod default/podinfo-7c84d8c94d-h9brq evicting pod default/podinfo-7c84d8c94d-gw6qf error when evicting pod "podinfo-7c84d8c94d-h9brq" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. evicting pod default/podinfo-7c84d8c94d-h9brq error when evicting pod "podinfo-7c84d8c94d-h9brq" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. evicting pod default/podinfo-7c84d8c94d-h9brq error when evicting pod "podinfo-7c84d8c94d-h9brq" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. evicting pod default/podinfo-7c84d8c94d-h9brq pod/podinfo-7c84d8c94d-gw6qf evicted pod/podinfo-7c84d8c94d-h9brq evicted node/node-205 evicted
In the above example, there are two copies of podinfo, both running on node-205. I set the interference budget PDB minAvailable: 1.
Then, when the Pod is expelled with kubectl drain, one Pod is expelled immediately, while the other Pod fails to be expelled for about 15 seconds.
Because the first Pod has not been started on the new node, it does not meet the condition of interference budget PDB minAvailable: 1.
About 15 seconds later, the first Pod to be expelled was started on the new node, and the other Pod met the PDB, so it was finally expelled. This completes the drain operation of a node.
Cluster node scaling components such as ClusterAutoscaler will also consider poddisruption budget when scaling nodes. If your cluster uses components of dynamic scaling nodes such as ClusterAutoscaler, it is strongly recommended to set poddisruption budget for all services
Best practices deployment + HPA + poddisruption budget
Generally speaking, each version of a service should contain the following three resources:
- Deployment: manage the Pods of the service itself
- HPA: responsible for the expansion and contraction of Pods, usually using CPU index
- Poddisruption budget (PDB): it is recommended to set the PDB according to the target value of HPA
- For example, if the target value of HPA CPU is 60%, you can consider setting PDB minAvailable=65% to ensure that at least 65% of Pods are available. In this way, under the theoretical limit, even if the QPS is spread to the remaining 65% Pods, there will be no Avalanche (here, it is assumed that the QPS and CPU are completely linear)
3, Node affinity and node group
In a cluster, we usually use different labels to classify node groups, such as some node labels automatically generated by kubernetes:
- kubernetes.io/os: usually linux
- kubernetes.io/arch: amd64, arm64
- topology.kubernetes.io/region and topology.kubernetes.io/zone: regions and availability zones of cloud services
We often use node affinity and Pod anti affinity. The other two strategies are used as appropriate.
1. Node affinity
If you use aws, aws has some custom node labels:
- Eks.amazonaws.com/nodegroup: the name of the AWS eks node group. The same aws ec2 instance template is used for the same node group
- For example, arm64 node group and amd64/x64 node group
- Node groups with high memory ratio, such as m-series instances, and node groups with high computing performance, such as c-series
- Bidding instance node group: this saves money, but it is highly dynamic and may be recycled at any time
- Pay as you go node group: these instances are expensive but stable.
Suppose you want to give priority to bidding instances to run your Pod. If bidding instances are temporarily full, you can choose pay as you go instances.
Then nodeSelector cannot meet your needs. You need to use nodeAffinity. An example is as follows:
apiVersion: apps/v1 kind: Deployment metadata: name: xxx namespace: xxx spec: # ... template: # ... spec: affinity: nodeAffinity: # The node of spot-group-c is preferred preferredDuringSchedulingIgnoredDuringExecution: - preference: matchExpressions: - key: eks.amazonaws.com/nodegroup operator: In values: - spot-group-c weight: 80 # weight is used to score nodes. The node with the highest score will be selected first # If spot-group-c is not available, you can also select the node of ondemand-group-c requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: eks.amazonaws.com/nodegroup operator: In values: - spot-group-c - ondemand-group-c containers: # ...
2. Pod anti affinity
It is generally recommended to configure Pod anti affinity for each Deployment template to disperse Pods on all nodes:
apiVersion: apps/v1 kind: Deployment metadata: name: xxx namespace: xxx spec: # ... template: # ... spec: replicas: 3 affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: # Non mandatory conditions - weight: 100 # weight is used to score nodes. The node with the highest score will be selected first podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - xxx - key: version operator: In values: - v12 # Scatter the pod in multiple zones as much as possible topologyKey: topology.kubernetes.io/zone requiredDuringSchedulingIgnoredDuringExecution: # Mandatory requirements # Note that there are no weights and all the conditions in the list must be met - labelSelector: matchExpressions: - key: app operator: In values: - xxx - key: version operator: In values: - v12 # Pod must run on different nodes topologyKey: kubernetes.io/hostname
4, Ready probe, survival probe and start probe of Pod
Prior to Kubernetes 1.18, a common method was to add a long initialDelaySeconds to the ready probe and survival probe to realize the function similar to the start probe - wait for the container to start slowly before probing.
Pod provides the following three probes, all of which support the use of Command, HTTP API and TCP Socket to detect service availability.
- startupProbe startup probe (Kubernetes v1.18 [beta]): after this probe passes, the "ready probe" and "survival probe" will be tested for viability and readiness
- It is used to detect the viability of slow start containers to prevent them from being killed before starting operation
- The program uses the maximum time of failureThreshold * periodSeconds for startup. For example, if failureThreshold=20 and periodSeconds=5 are set, the maximum startup time of the program is 100s. If the "startup detection" is still not passed after 100s, the container will be killed.
-
readinessProbe ready probe:
- If the number of failures of the ready probe exceeds the failureThreshold limit (three by default), the Service will be kicked out of the Service Endpoints temporarily until the Service meets the successThreshold again
- livenessProbe survival probe: it detects whether the service is alive. It can capture deadlock and kill such containers in time.
- Possible causes of survival probe failure:
- The service deadlocked and did not respond to all requests
- All service threads are stuck waiting for external dependencies such as redis/mysql, resulting in no response to the request
- If the number of failures of the survival probe exceeds the failureThreshold limit (three by default), the container will be killed and then restarted according to the restart policy.
- Kubectl description pod will display the restart reason as State.Last State.Reason = Error, Exit Code=137, and there will be a description of Liveness probe failed:... In Events.
- Possible causes of survival probe failure:
The parameters of the above three types of detectors are general, and the five time-related parameters are listed as follows:
# The following values are the default values for k8s initialDelaySeconds: 0 # There is no delay time by default periodSeconds: 10 timeoutSeconds: 1 failureThreshold: 3 successThreshold: 1
Example:
apiVersion: apps/v1 kind: Deployment metadata: name: my-app-v3 spec: # ... template: # ... spec: containers: - name: my-app-v3 image: xxx.com/app/my-app:v3 imagePullPolicy: IfNotPresent # ... omit several configurations startupProbe: httpGet: path: /actuator/health # Just use the health check interface directly port: 8080 periodSeconds: 5 timeoutSeconds: 1 failureThreshold: 20 # Up to 5s * 20 start time for service successThreshold: 1 livenessProbe: httpGet: path: /actuator/health # General health check path for spring port: 8080 periodSeconds: 5 timeoutSeconds: 1 failureThreshold: 5 successThreshold: 1 # Readiness probes are very important for a RollingUpdate to work properly, readinessProbe: httpGet: path: /actuator/health # For simplicity, you can directly use the same interface as livenessProbe. Of course, you can also define additional interfaces port: 8080 periodSeconds: 5 timeoutSeconds: 1 failureThreshold: 5 successThreshold: 1
5, Pod security {#security}
Only the security related parameters in Pod are introduced here. Other security policies such as global cluster are not discussed here.
1. Pod SecurityContext
By setting the SecurityContext of the Pod, you can set specific security policies for each Pod.
There are two types of SecurityContext:
- spec.securityContext: This is a PodSecurityContext object
- As the name suggests, it is valid for all contaienrs in the Pod.
- spec.containers[*].securityContext: This is a SecurityContext object
- container private SecurityContext
The parameters of the two SecurityContext only partially overlap, and the overlapping part spec.containers[*].securityContext has higher priority.
We often encounter some security policies to enhance permissions:
- Privilege container: spec.containers[*].securityContext.privileged
- Add optional system level capabilities: spec.containers[*].securityContext.capabilities.add
- Only a few containers, such as ntp synchronization service, can enable this function. Please note that this is very dangerous.
- Sysctls: system parameter: spec.securityContext.sysctls
Security policies related to permission restrictions are (it is strongly recommended to configure the following security policies on all pods as needed!):
- spec.volumes: read and write permissions can be set for all data volumes
- spec.securityContext.runAsNonRoot: true Pod must be run as a non root user
- Spec.containers [*]. Securitycontext. Readonlyrootfile system: true sets the container layer to read-only to prevent tampering with container files.
- If the microservice needs to read and write files, it is recommended to mount additional data volumes of emptydir type.
- spec.containers[*].securityContext.allowPrivilegeEscalation: false. Pod is not allowed to do any privilege promotion!
- spec.containers[*].securityContext.capabilities.drop: remove optional system level capabilities
There are other functions not listed, such as the running user / user group of the specified container. Please consult the Kubernetes related documents.
An example of stateless micro service Pod configuration:
apiVersion: v1 kind: Pod metadata: name: <Pod name> spec: containers: - name: <container name> image: <image> imagePullPolicy: IfNotPresent # ... omit 500 words here securityContext: readOnlyRootFilesystem: true # Set the container layer to read-only to prevent the container file from being tampered with. allowPrivilegeEscalation: false # Forbid Pod to do any privilege promotion capabilities: drop: # It is forbidden for containers to use raw sockets. Generally, only hacker will use raw sockets. # raw_socket can customize network layer data, avoid tcp/udp protocol stack, and directly operate the underlying ip/icmp packets. It can realize ip camouflage, custom protocol and other functions. # Remove net_raw will make tcpdump unusable and unable to capture packets in the container. This configuration can be temporarily removed when packet capturing is required - NET_RAW # Better option: disable all capabilities directly # - ALL securityContext: # runAsUser: 1000 # Set user # runAsGroup: 1000 # Set user group runAsNonRoot: true # Pod must be run as a non root user seccompProfile: # security compute mode type: RuntimeDefault
2. seccomp: security compute mode
Seccomp and seccomp BPF allow filtering of system calls and prevent users' binaries from performing dangerous operations on host operating system devices that are not normally required. It is similar to Falco, but seccomp does not provide special support for containers.
Video: