Manually limit the permissions of the worker RAM role to enhance node security - Container Service for Kubernetes

To ensure node security for an ACK managed cluster, you can manually limit the permissions of the worker Resource Access Management (RAM) role of the cluster based on the least privilege principle.

Prerequisites

An ACK managed cluster that runs Kubernetes 1.18 or later is created. ACK managed clusters are classified into ACK Pro clusters and ACK Basic clusters. For more information, see Create an ACK managed cluster and Update an ACK cluster.
If you want to limit the permissions of the worker RAM role of an ACK dedicated cluster, you must first migrate from the ACK dedicated cluster to an ACK Pro cluster. For more information, see Hot migration from ACK dedicated clusters to ACK Pro clusters.
Default roles are assigned to ACK to grant the permissions required by ACK managed clusters. For more information, see Assign default roles to ACK.

Step 1: Confirm whether permission limits are required

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage. On the cluster details page, click the Cluster Resources tab. Click the hyperlink on the right side of Worker RAM Role to go to the RAM console.
On the Permissions tab of the role details page, check whether policies are displayed.
- If no policy is displayed, you do not need to limit the permissions of the worker RAM role.
- If a policy is displayed, such as k8sWorkerRolePolicy-db8ad5c7***, you may need to limit the permissions of the worker RAM role. In this case, we recommend that you limit the permissions of the worker RAM role based on your requirements and the least privilege principle.

Step 2: Update system components

Update the key system components of the ACK managed cluster to the minimum required version or the latest version. For more information, see Manage system components.

Important

Do not update multiple components at the same time. Instead, update them one after one. Before you start to update a component, make sure that the previous component is successfully updated.
Before you update a component, we recommend that you read and understand the remarks of the component in the following table.

Components can be installed from the Add-ons page in the ACK console or by using node pools. The following table describes how to update components installed by using the preceding methods.

Components installed from the Add-ons page

Go to the Add-ons page and update the installed components based on the descriptions in the following table. If a component is already of the minimum required version or the latest version, redeploy the component by running the corresponding command in the following table or by clicking Redeploy in the ACK console.

Component	Minimum required version	Redeploy command	Remarks
metrics-server	v0.3.9.4-ff225cd-aliyun	`kubectl -n kube-system rollout restart deployment/metrics-server`	None
alicloud-monitor-controller	v1.5.5	`kubectl -n kube-system rollout restart deployment/alicloud-monitor-controller`	None
logtail-ds	v1.0.29.1-0550501-aliyun	`kubectl -n kube-system rollout restart daemonset/logtail-ds kubectl -n kube-system rollout restart deployment/alibaba-log-controller`
terway	v1.0.10.333-gfd2b7b8-aliyun	`kubectl -n kube-system rollout restart daemonset/terway`	Update the Terway component that corresponds to the Terway mode that is enabled. For more information about the Terway modes, see Introduction to Terway. After Terway is updated, you need to manually modify the configurations of Terway. For more information, see Check the configurations of Terway.
terway-eni	v1.0.10.333-gfd2b7b8-aliyun	`kubectl -n kube-system rollout restart daemonset/terway-eni`
terway-eniip	v1.0.10.333-gfd2b7b8-aliyun	`kubectl -n kube-system rollout restart daemonset/terway-eniip`
terway-controlplane	v1.2.1	`kubectl -n kube-system rollout restart deployment/terway-controlplane`	None
flexvolume	v1.14.8.109-649dc5a-aliyun	`kubectl -n kube-system rollout restart daemonset/flexvolume`	Upgrade from FlexVolume to CSI.
csi-plugin	v1.18.8.45-1c5d2cd1-aliyun	`kubectl -n kube-system rollout restart daemonset/csi-plugin`	None
csi-provisioner	v1.18.8.45-1c5d2cd1-aliyun	`kubectl -n kube-system rollout restart deployment/csi-provisioner`	None
storage-operator	v1.18.8.55-e398ce5-aliyun	`kubectl -n kube-system rollout restart deployment/storage-auto-expander kubectl -n kube-system rollout restart deployment/storage-cnfs kubectl -n kube-system rollout restart deployment/storage-monitor kubectl -n kube-system rollout restart deployment/storage-snapshot-manager kubectl -n kube-system rollout restart deployment/storage-operator`	None
alicloud-disk-controller	v1.14.8.51-842f0a81-aliyun	`kubectl -n kube-system rollout restart deployment/alicloud-disk-controller`	None
ack-node-problem-detector	1.2.16	`kubectl -n kube-system rollout restart deployment/ack-node-problem-detector-eventer`	None
aliyun-acr-credential-helper	v23.02.06.2-74e2172-aliyun	`kubectl -n kube-system rollout restart deployment/aliyun-acr-credential-helper`	Before you start the update, you must grant permissions. If you do not need to acquire custom RAM permissions or pull images across Alibaba Cloud accounts, go to the Add-ons page and modify the configurations of the component by setting the tokenMode parameter to managedRole. If you do not need to use the password-free image pulling feature, you can uninstall the component.
ack-cost-exporter	1.0.10	`kubectl -n kube-system rollout restart deployment/ack-cost-exporter`	Before you start the update, you must grant permissions.
mse-ingress-controller	1.1.5	`kubectl -n mse-ingress-controller rollout restart deployment/ack-mse-ingress-controller`	Before you start the update, you must grant permissions.
arms-prometheus	1.1.11	`kubectl -n arms-prom rollout restart deployment/arms-prometheus-ack-arms-prometheus`	None
ack-onepilot	3.0.11	`kubectl -n ack-onepilot rollout restart deployment/ack-onepilot-ack-onepilot`	Before you start the update, you must grant permissions.

cluster-autoscaler installed by using node pools

Component

Minimum required version

Redeploy command

Remarks

cluster-autoscaler

v1.3.1-bcf13de9-aliyun

kubectl -n kube-system rollout restart deployment/cluster-autoscaler

You can use the following methods to view the version of cluster-autoscaler. For more information about how to update cluster-autoscaler, see [Component updates] Update cluster-autoscaler.

View the version of cluster-autoscaler in the ACK console. For more information, see View the version of cluster-autoscaler.

View the version of cluster-autoscaler by running the following command:

kubectl -n kube-system get deployment/cluster-autoscaler -o yaml | grep acs/autoscaler

Check the configurations of Terway

If terway, terway-eni, or terway-eniip is installed in your cluster, you need to manually check the configuration file of Terway, which is the eni_conf ConfigMap in the kube-system namespace.

Run the following command to view and modify the eni_conf ConfigMap:
```
kubectl edit cm eni-config -n kube-system
```
- If the "credential_path": "/var/addon/token-config", setting is included in the eni-conf ConfigMap, no additional action is required.
- If the "credential_path": "/var/addon/token-config", setting is not included in the eni_conf ConfigMap, you need to add a new row below the min_pool_size parameter and specify "credential_path": "/var/addon/token-config", in the row.
```
"credential_path": "/var/addon/token-config",
```
Run the corresponding command in the preceding table to redeploy Terway.

Step 3: Use ActionTrail to collect cluster logs

Use ActionTrail to collect API audit logs to analyze the API operations performed in the cluster. This way, you can identify the applications that rely on the RAM policy attached to the worker RAM role of the cluster. For more information about the Alibaba Cloud services that work with ActionTrail, see Services that work with ActionTrail.

Note

We recommend that you collect audit logs that are generated within more than at least one week.

Go to the ActionTrail console and create a single-account trail in the region where the cluster resides. When you create the single-account trail, select Delivery to Simple Log Service. For more information, see Create a single-account trail.

Step 4: Perform a functional test on the cluster

After the preceding steps are completed, perform a functional test on the cluster to check whether the cluster works as expected.

Test item	Description	Reference
Computing	Whether the cluster can scale nodes as expected.	Scale a node pool
Network	Whether the cluster can assign IP addresses to pods as expected.	Application deployment
Storage	Whether the cluster can deploy workloads that use external storage as expected if external storage is enabled.	Storage - CSI
Monitoring	Whether the cluster can generate alerts as expected.	Observability
Scalability	Whether the cluster can automatically scale nodes as expected if auto scaling is enabled.	Auto scaling of nodes
Security	Whether the cluster can use the password-free image pulling feature as expected if the feature is enabled.	Use the aliyun-acr-credential-helper component to pull images without a password

Important

After the functional test is completed, verify the logic of the business deployed in your cluster to ensure that the business runs as expected.

Step 5: Analyze the logs collected by ActionTrail

Log on to the Simple Log Service console.
In the Projects section, click the project that you want to manage.
On the details page of the project, choose Log Storage > Logstores and click the Logstore that you want to manage on the Logstores tab.
The name of the Logstore that you use to store the logs collected by ActionTrail in Step 3 is in the actiontrail_<trail name> format
Use the following query statement to retrieve the API operations that the worker RAM role of the cluster performs by using STS tokens.
Replace <worker_role_name> with the name of the worker RAM role of the cluster.
```
* and event.userIdentity.userName: <worker_role_name> | select "event.serviceName", "event.eventName", count(*) as total GROUP BY "event.eventName", "event.serviceName"
```

Step 6: Limit the permissions of the worker RAM role

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage. On the cluster details page, click the Cluster Resources tab. Click the hyperlink on the right side of Worker RAM Role to go to the RAM console.
On the Permissions tab of the role details page, click the RAM policy that you want to manage. On the Policy Content tab, click Modify Policy Document.
Important
Before you modify the policy, make a copy of the original policy content in case you need to roll back the policy.
Delete permissions from the policy based on your business requirements and the analysis result generated in Step 5. For example, you can delete the API operations that are not included in the analysis result from the Action section of the policy content. If you confirm that all API operations in the policy content are not required, you can detach the RAM policy from the worker RAM role.
Redeploy the system component. For more information, see the redeploy commands in Step 2.
Repeat Step 4, Step 5, and Step 6 until the worker RAM role provides only the minimum permissions required by the components and applications in your cluster.

References

For more information about the authorization system of ACK, see Best practices of authorization.