By Wang Siyu (Jiuzhu), Technical Expert at Alibaba
For application O&M engineers, deploying and delivering stateful applications is no easy task. Common stateful operations include configuring disk persistence, assigning independent and stable network identifiers to all machines, and defining the publishing order. Kubernetes provides StatefulSets, a type of controller or workload used to deploy and run stateful applications in a Kubernetes environment.
Kubernetes provides Deployments for managing application orchestration.
A Deployment provides the following functions:
Simply put, a Deployment manages pods of the same version as identical replicas. Deployment controllers process pods of the same version in the same way, no matter what applications are deployed or what pod behaviors are configured.
This capability can meet the requirements of stateless applications, but what if you run stateful applications?
Let's take a look at the following requirements:
The preceding requirements cannot be met through Deployments. Kubernetes provides StatefulSets for managing stateful applications.
Many stateless applications in the community are also managed through StatefulSets. This article will explore why.
As shown in the right part of the preceding figure, each pod in a StatefulSet has an ordinal, which ranges from 0 to the defined number of replicas minus 1. Each pod has an independent network identifier (that is, a hostname), an independent Persistent Volume Claim (PVC), and one or more Persistent Volumes (PVs). Each different pod in the same StatefulSet has a unique hostname and an exclusive storage disk. This meets the needs of many stateful applications.
As shown in the right part of the preceding figure:
As shown in the left part of the preceding figure, a headless service is configured to assign an independent hostname to each pod in a StatefulSet. The service name is nginx.
In the right part, a StatefulSet is configured, where serviceName is set to nginx under spec. serviceName indicates the service that matches the StatefulSet.
spec contains other fields, such as selector and template. The selector field indicates a label selector. The label selection logic defined by selector must match app: nginx in template.metadata.labels. The template field defines an NGINX container of the Alpine image version, with Port 80 exposed as a web service.
template.spec defines volumeMounts. This volumeMounts comes from volumeClaimTemplates, which is the PVC template. The PVC template defines a PVC called
www-storage.volumeMounts.name is set to this PVC name, and the volume is mounted to the /usr/share/nginx/html directory. In this way, each pod has an independent PVC and is mounted to the corresponding directory in the container.
After you create a headless service and a StatefulSet, run the get command to verify that the NGINX service has been created.
get endpoints nginx command output shows that the NGINX service registers three IP addresses and a port. The IP addresses map to the pod IP addresses, and the port maps to Port 80 configured under spec.
get sts nginx-web command output includes the READY column with the value 3/3. The first number 3 indicates the number of pods in the desired Ready state in the StatefulSet, and the second number 3 indicates the desired number of pods in the StatefulSet.
get pod command output shows that all the three pods are in the Running state and are ready. The pod IP addresses map to the IP addresses in the
get endpoints nginx command output.
get pvc command output, the NAME column includes names that consist of three elements: www-storage, nginx-web, and a number as the suffix. www-storage is defined by volumeClaimTemplates. nginx-web is defined by the StatefulSet. The number is the ordinal number of the pod. Each PVC is bound to one of the three pods. Each PVC is also bound to a PV so that different pods can be bound to different PVs.
Deployments use ReplicaSets to manage pod versions and keep the desired number of pods, whereas StatefulSets act as a type of controller to manage pods. A StatefulSet identifies the version of each managed pod through the pod label controller-revision-hash. The label is similar to the pod template hash that is injected to a pod by a Deployment or StatefulSet.
As shown in the preceding figure, the
get pod command output includes controller-revision-hash, where hash indicates the template version of the pod when created for the first time. It ends with 677759c9b8. Note down the value of controller-revision-hash to check whether it is changed when pods are upgraded later.
As shown in the preceding figure, the command applies the StatefulSet configuration where the image version of the StatefulSet is upgraded to
get pod command to query revision hash. The command output shows that the controller-revision-hash values of the three pods are upgraded to the new revision hash and end with 7c55499668. By viewing the pod creation time, you can see that Pod 2 is created first, followed by Pod 1 and Pod 0. The pod creation times determine the pod upgrade order as follows: Pod 2 -> Pod 1 -> Pod 0. The PVCs used by the pods remain unchanged after the upgrade. The data stored in PVs before the upgrade is mounted to the upgraded pods.
As shown in the right part of the preceding figure, there are several key fields in the status part of the StatefulSet:
All pods are upgraded to the target version if currentReplicas is the same as updateReplicas and currentRevision is the same as updateRevision.
Assume that you have connected to an Alibaba Cloud cluster, which has three nodes.
The following shows how to create a StatefulSet and a service. Let's look at the orchestration file.
As shown in the preceding figure, in the service configuration, Port 80 is exposed for NGINX. In the StatefulSet configuration,
metadata.name is set to nginx-web, image information is defined in template.containers, and a volumeClaimTemplates is defined as a PVC template.
Run the command shown in the preceding figure to create a service and a StatefulSet. Run the
get pod command to verify that Pod 0 was created first. Run the
get pvc command to verify that PVC 0 is bound to a PV.
The preceding command output shows that Pod 0 is being created and is in the ContainerCreating state.
After Pod 0 is created, Pod 1 and Pod 2 are created in sequence, and related PVCs are also created.
A PVC is created before each pod is created. After the PVC is created, the pod transits from the Pending state to the Bound state, indicating that the pod is bound to the PV, and then enters the ContainerCreating state and finally reaches the Running state.
kubectl get sts nginx-web -o yaml command to view the StatefulSet status.
As shown in the preceding command output, the desired number of replicas is 3, the number of available replicas is also 3, and the current version is the latest.
Port 80 is exposed for the NGINX service, which has three IP addresses.
get pod command to verify that the IP addresses of the three pods map to the IP addresses under ENDPOINTS.
In short, the three PVCs and three pods reach the desired status. In the status data of the StatefulSet, the values of currentReplicas and readyReplicas are both 3.
kubectl set image is fixed and used to declare an image. StatefulSet indicates the resource type. nginx-web indicates the resource name. For
nginx=nginx:mainline, nginx before the equal sign indicates the container name defined under template, and nginx:mainline indicates the target image version.
By running the preceding command, you can upgrade the image in the StatefulSet to the target version.
get pod command to view the pod status. nginx-web-1 and nginx-web-2 are in the Running state. Their controller-revision-hash indicates the latest version. The original nginx-web-0 pod has been deleted, and the new pod is being created.
Check the pod status again, and you can see that all pods are in the Running state.
View the StatefulSet information. In the status part, currentRevision shows the latest version, indicating that the three pods in the StatefulSet are of the latest version.
How do we determine whether the three pods still use their original hostnames and PVCs after the upgrade?
The hostnames configured for the headless service are associated only with pod names and, therefore, can be reused by the upgraded pods as long as the pod names remain unchanged after the upgrade.
The preceding command output shows that the PVC creation time is still the same as the time when the pod was created for the first time, indicating that the upgraded pods still use their original PVCs.
For example, by viewing the details of a pod, you can see that, in the pod's volumes declaration, the name www-storage-nginx-web-0 under persistentVolumeClaim matches the PVC with the ordinal 0, which is the PVC used by the old pod. When a pod is upgraded, the StatefulSet controller deletes the old pod and creates a pod with the same name as the old pod, so the new pod reuses the PVC of the old pod.
This enables the reuse of network storage after pod upgrades.
A StatefulSet supports the creation of three types of resources:
ControllerRevision allows the StatefulSet to manage different template versions.
For example, a ControllerRevision is created for the first template version of the NGINX service when this service is created. After the image version is modified, the StatefulSet controller creates another ControllerRevision. In other words, each ControllerRevision maps to a template version and also maps to the version's ControllerRevision hash. ControllerRevision is named after ControllerRevision hash that is defined by the pod label. ControllerRevision allows the StatefulSet controller to manage different template versions.
If you define volumeClaimTemplates in a StatefulSet, then the StatefulSet creates a PVC based on this template before creating a pod and adds the PVC to the pod volume.
If you define volumeClaimTemplates for the PVC template under spec, then the StatefulSet creates a PVC based on this template before creating a pod and adds the PVC to the pod volume. If you do not define a PVC template under spec, then no independent PV is mounted to the created pod.
A StatefulSet creates, deletes, and upgrades pods in order. Each pod has a unique ordinal.
As shown in the preceding figure, the StatefulSet controller owns three types of resources: ControllerRevision, pod, and PVC.
In the current version, the StatefulSet adds OwnerReferences only to ControllerRevisions and pods, but not to PVCs. When a resource with OwnerReferences is deleted, its associated resources are also deleted by default. Therefore, after a StatefulSet is deleted, the ControllerRevisions and pods created by the StatefulSet are also deleted. However, the created PVCs are not deleted because they do not have OwnerReferences.
The preceding figure shows the workflow of a StatefulSet controller.
The StatefulSet controller first registers the event handlers of the informer to process changes of the StatefulSet and its pods. In the controller logic, upon receiving a change of the StatefulSet or a pod, the StatefulSet controller queues up the StatefulSet. Then, the StatefulSet controller dequeues the StatefulSet and performs the Update Revision operation. That is, the StatefulSet controller checks whether the template of the StatefulSet has a ControllerRevision. If no ControllerRevision is available, the template has been updated. In this case, the StatefulSet controller creates a revision, resulting in a new version of ControllerRevision hash.
Then, the StatefulSet controller fetches all versions and sorts them by ordinal. If any pods are missing, they are created in the order of their ordinals. If any pods are redundant, they are deleted in the order of their ordinals. When the number of pods and the pod ordinals are consistent with the number and ordinals of replicas, the StatefulSet controller checks whether to upgrade the pods. In the
Manage pods in order process, the StatefulSet controller checks whether pods are sorted by ordinal. In the
Update in order process, the StatefulSet controller checks whether pods are of the desired version. If not, the StatefulSet controller upgrades the pods in the order of their ordinals.
Update in order process is essentially a process of deleting pods. After a pod is deleted, the StatefulSet controller will find that this pod is missing based on the acquired StatefulSet. Then, the StatefulSet controller creates a pod during the
Manage Pods in order process. After that, the StatefulSet controller updates the status. The resulting status can be displayed by running the command described in the preceding sections.
Through this workflow, the StatefulSet controller can manage stateful applications.
Assume that the initial configuration of a StatefulSet is as follows: The number of replicas is 1 and one pod, Pod 0, is managed. If you change the number of replicas from 1 to 3, Pod 1 is created first, and Pod 2 is created when Pod 1 is in the Ready state.
As shown in the preceding figure, pods in the StatefulSet are created and numbered from 0. The ordinals of the pods in a StatefulSet with N replicas fall in the range [0, N). When N is greater than 0, the ordinals of pods range from 0 to N - 1.
If you do not want to create or delete pods in the order of their ordinals, StatefulSets allow you to do so through other logic. This is why some people in the community also use StatefulSets to manage stateless applications. A StatefulSet assigns a unique network identifier and independent network storage to each of its managed pods and supports scaling pods concurrently.
StatefulSet.spec includes the podManagementPolicy field, which can be set to OrderedReady (default value) or Parallel.
If podManagementPolicy is unspecified, the StatefulSet controller uses OrderedReady by default, and pods are scaled in order. This means a pod is scaled only after the previous pods are in the Ready state. When pods are scaled down, they are deleted in the descending order of their ordinals.
For example, in the preceding figure, when the StatefulSet is scaled from Pod 0 to Pod 0, Pod 1, and Pod 2, Pod 1 is created first and then Pod 2 is created only when Pod 1 is in the Ready state. If Pod 0 changes to the Not Ready state due to a host or application error when Pod 1 is being created, then the StatefulSet controller will not create Pod 2. This means that a pod is created only when all the previous pods are in the Ready state. In this example, the StatefulSet can create Pod 2 only after Pod 0 and Pod 1 are all in the Ready state.
If podManagementPolicy is set to Parallel, pods are scaled in parallel, without having to wait until all preceding pods are ready or pods with greater ordinals are deleted.
Assume that StatefulSet Template 1 maps to Revision 1 in the logic. The three pods in the StatefulSet are of the Revision 1 version. After you modify the template, such as modifying the image, the StatefulSet controller upgrades pods one by one in descending order of their ordinals. The StatefulSet controller first creates Revision 2, which maps to the ControllerRevision 2 resource, whose name is used as a new revision hash. After Pod 2 is upgraded to the new version, the StatefulSet controller deletes Pod 0 and Pod 1 in order and then creates Pod 0 and Pod 1 in the same order.
The logic here is simple. In the upgrade process, the StatefulSet controller deletes the mapped pod with the greatest ordinal. During the next reconcile cycle, the StatefulSet controller finds that the pod with the greatest ordinal is missing and then creates a pod of the new version.
Let's take a look at the following spec fields:
As shown in the right part of the preceding figure, the StatefulSetUpdateStrategyType field can be set to RollingUpdate or OnDelete.
RollingUpdateStatefulSetStrategy has the Partition field to indicate the number of pods that keep the old version during the rolling upgrade. This is different from the number of pods that are upgraded to the new version during the grayscale upgrade.
For example, assume that a StatefulSet has 10 replicas and that the Partition field is set to 8. In this case, eight pods keep the old version, and the other two pods are upgraded to the new version during the grayscale upgrade. When there are 10 replicas, the pod ordinals fall in the range [0, 9). When Partition is set to 8, the eight pods in the ordinal range [0,7) keep the old version, and the pods in the ordinal range [8, 9) are upgraded to the new version.
Assume replicas = N and Partition = M (M < N). The pods that keep the old version fall in the ordinal range [0, M), and the pods upgraded to the new version fall in the ordinal range [M, N). The Partition field can be used for the grayscale upgrade, which is currently not supported by Deployments.
Let's summarize what we have learned in this article:
Alex - January 22, 2020
Alibaba Container Service - July 16, 2019
Alibaba Developer - March 31, 2020
Alex - January 22, 2020
Alibaba Developer - February 26, 2020
Alibaba Cloud Storage - June 4, 2019
Alibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.Learn More
Simplify the Operations and Management (O&M) of your computing resourcesLearn More
Accelerate and secure the development, deployment, and management of containerized applications cost-effectively.Learn More
Visualization, O&M-free orchestration, and Coordination of Stateful Application ScenariosLearn More
More Posts by Alibaba Developer