Custom auto scaling rules let EMR automatically add or remove task nodes in response to workload fluctuations, keeping jobs running without over-provisioning resources.
Prerequisites
Before you begin, make sure that you have:
A DataLake, Dataflow, online analytical processing (OLAP), DataServing, or custom cluster
A task node group with pay-as-you-go or preemptible instances in the cluster. For details, see Create a node group
Limitations
You can add up to 10 Elastic Compute Service (ECS) instance types per node group. When the cluster scales out, EMR tries each instance type in order, starting with the first one. If an instance type is unavailable, EMR falls back to the next type.
Only clusters with the YARN service deployed support load-based scaling rules.
Usage notes
Auto scaling rules are applied per node group. If no rule is configured for a node group, that node group never scales automatically.
EMR automatically finds and lists ECS instance types that match your specified instance specifications. Select at least one instance type from the list so that the cluster can scale based on that type.
When multiple rules are configured and their trigger conditions are met simultaneously, EMR executes them in the following order: 1. Scale-out rules take precedence over scale-in rules. 2. Time-based and load-based rules run in the order they are triggered. 3. Load-based rules triggered by the same metric run in the order they were configured.
Do not set scale-out and scale-in rules to execute at the same point in time. Overlapping execution times can cause conflicts that prevent scaling operations from completing.
Choose a scaling type
Before configuring rules, identify which scaling type fits your workload:
| Scaling type | Use when | Example |
|---|---|---|
| Time-based scaling | Workloads follow predictable schedules | Add nodes at 08:00 and remove them at 20:00 every weekday; scale down a dev cluster after business hours |
| Load-based scaling | Workload fluctuations are unpredictable | Order processing spikes that vary by time of day or user demand |
Only clusters with YARN deployed support load-based scaling. If your cluster does not have YARN, use time-based scaling.
Configure auto scaling rules
Method 1: Add rules to an existing cluster
Log on to the EMR console.
In the top navigation bar, select a region and a resource group.
On the EMR on ECS page, click the cluster name in the Cluster ID/Name column.
Click the Auto Scaling tab.
In the Configure Auto Scaling Rule section, click the Configure Auto Scaling tab, then click Custom Auto Scaling Rule. In the Modify Auto Scaling Rule dialog, click Reconfigure.
Find the target node group and click Edit in the Actions column.
In the Configure Auto Scaling panel, set the parameters. See Auto scaling parameters.
Click Save And Apply. Scale-out activities are triggered for the node group when the configured conditions are met.
Method 2: Add rules when creating a cluster
Log on to the EMR console.
In the top navigation bar, select a region and a resource group.
Click Create Cluster. For details, see Create a cluster.
Add a pay-as-you-go task node group before configuring auto scaling rules.
In the Cluster Scaling section, click Custom Auto Scaling Rule. Find the target node group and click Edit in the Actions column.
In the Configure Auto Scaling panel, set the parameters. See Auto scaling parameters.
Click Save And Apply. After the cluster is created, scaling activities are triggered when the configured conditions are met.
Method 3: Configure rules with the SDK
Call the following API operations to configure auto scaling rules programmatically:
| Operation | Description |
|---|---|
| CreateCluster | Create a cluster with auto scaling rules |
| CreateNodeGroup | Create a task node group with auto scaling rules |
| PutAutoScalingPolicy | Configure or update auto scaling rules for an existing task node group |
The following Java example configures a load-based scale-out rule using PutAutoScalingPolicy.
Use a Security Token Service (STS) token to initialize the credentials client. STS tokens provide short-lived access and are more secure than long-term credentials. For details, see Manage access credentials.
// This file is auto-generated, don't edit it. Thanks.
package com.aliyun.sample;
import com.aliyun.tea.*;
public class Sample {
/**
* descriptionAuto scaling parameters
Node count limits
Set the upper and lower bounds for the number of nodes in the node group. To modify these bounds, click Modify Limit.
| Parameter | Description |
|---|---|
| Maximum Number Of Instances | The node group stops scaling out when it reaches this count. |
| Minimum Number Of Instances | The node group stops scaling in when it reaches this count. |
Time-based scaling parameters
Time-based rules add or remove a fixed number of nodes at scheduled times — daily, weekly, or monthly. Configure both scale-out and scale-in rules to manage the full scaling cycle.
Scale-out and scale-in rules must not share the same execution time.
The following table describes the scale-out rule parameters. Scale-in rules use the same parameters.
| Parameter | Description |
|---|---|
| Scale Out Type | Select Scale Out by Time. |
| Rule Name | A unique name for the rule within the cluster. |
| Frequency | Execute Repeatedly: triggers at the same time every day, week, or month. Execute Only Once: triggers once at the specified time. |
| Execution Time | The time at which the rule triggers. |
| Rule Expiration Time | The date after which the rule stops triggering. Available only when Frequency is set to Execute Repeatedly. |
| Retry Time Range | How long EMR retries if scaling fails at the scheduled time. Valid values: 0–3600 seconds. EMR retries every 30 seconds within this window. For example, if another scaling operation is running or in cooldown at the scheduled time, EMR keeps retrying until the window expires or conditions are met. |
| Nodes For Each Scale Out | The number of nodes to add each time the rule triggers. |
| Best-effort Delivery | When enabled, EMR delivers all successfully provisioned nodes even if the full requested count is unavailable. For example, if you request 100 nodes and only 90 are available, EMR delivers 90 instead of failing the entire operation. Turn this on to maximize scale-out success. |
Load-based scaling parameters
Load-based scaling requires the YARN service.
Load-based rules trigger when YARN resource metrics cross a threshold. This lets the cluster respond to actual workload pressure rather than a fixed schedule.
The following table describes the scale-out rule parameters. Scale-in rules use the same parameters.
| Parameter | Description |
|---|---|
| Scale Out Type | Select Scale Out by Load. |
| Rule Name | A unique name for the rule within the cluster. |
| Load Metric-based Trigger Conditions | Select at least one YARN load metric and configure its threshold. Click Add Metric to add multiple metrics. For available metrics, see YARN load metrics. |
| Multi-metric Relationship | All Metrics Meet the Conditions: all selected metrics must breach their thresholds to trigger the rule. Any Metric Meets the Condition: any single metric breaching its threshold triggers the rule. |
| Statistical Period | The evaluation window for each metric. EMR collects data, aggregates it (average, maximum, or minimum), and compares the result to the threshold. A shorter period makes the rule more sensitive to spikes. |
| Repetitions That Trigger Scale Out | The number of consecutive evaluation periods in which the metric must breach the threshold before the rule triggers. |
| Nodes For Each Scale Out | The number of nodes to add each time the rule triggers. |
| Best-effort Delivery | When enabled, EMR delivers all successfully provisioned nodes even if the full requested count is unavailable. Decide based on whether partial scale-out is acceptable for your workload. |
| Cooldown Time | The minimum interval between two consecutive scale-out activities. During cooldown, scale-out conditions are monitored but no action is taken. After cooldown ends, the next breach triggers a new scale-out. |
| Effective Time Period | The time window in which the rule is active. Outside this window, the rule does not trigger even if conditions are met. Leave blank to keep the rule active 24 hours a day. |
Descriptions of EMR auto scaling metrics that match YARN load metrics
For queue-related metrics,queue_namespecifies the queue name. The default value isroot. Specify a custom queue name as needed.
For partition-related metrics, partition_name specifies the partition name. This field cannot be left blank.| EMR auto scaling metric | Service | Description |
|---|---|---|
| yarn_resourcemanager_queue_AvailableVCores | YARN | The number of available virtual CPU cores for the specified queue. |
| yarn_resourcemanager_queue_PendingVCores | YARN | The number of pending virtual CPU cores for the specified queue. |
| yarn_resourcemanager_queue_AllocatedVCores | YARN | The number of allocated virtual CPU cores for the specified queue. |
| yarn_resourcemanager_queue_ReservedVCores | YARN | The number of reserved virtual CPU cores for the specified queue. |
| yarn_resourcemanager_queue_AvailableMB | YARN | The amount of available memory for the specified queue. |
| yarn_resourcemanager_queue_PendingMB | YARN | The amount of pending memory for the specified queue. |
| yarn_resourcemanager_queue_AllocatedMB | YARN | The amount of allocated memory for the specified queue. |
| yarn_resourcemanager_queue_ReservedMB | YARN | The amount of reserved memory for the specified queue. |
| yarn_resourcemanager_queue_AppsRunning | YARN | The number of running jobs in the specified queue. |
| yarn_resourcemanager_queue_AppsPending | YARN | The number of suspended jobs in the specified queue. |
| yarn_resourcemanager_queue_AppsKilled | YARN | The number of stopped jobs in the specified queue. |
| yarn_resourcemanager_queue_AppsFailed | YARN | The number of failed jobs in the specified queue. |
| yarn_resourcemanager_queue_AppsCompleted | YARN | The number of completed jobs in the specified queue. |
| yarn_resourcemanager_queue_AppsSubmitted | YARN | The number of submitted jobs in the specified queue. |
| yarn_resourcemanager_queue_AllocatedContainers | YARN | The number of allocated containers in the specified queue. |
| yarn_resourcemanager_queue_PendingContainers | YARN | The number of pending containers in the specified queue. |
| yarn_resourcemanager_queue_ReservedContainers | YARN | The number of reserved containers in the specified queue. |
| yarn_resourcemanager_queue_AvailableMBPercentage | YARN | The percentage of available memory in the specified queue. Calculated as: AvailableMemory / Total Memory. Note Available for EMR V3.43.0, V5.9.0, and later minor versions. |
| yarn_resourcemanager_queue_PendingContainersRatio | YARN | The ratio of pending containers to allocated containers in the specified queue. Calculated as: PendingContainers / AllocatedContainers. Note Available for EMR V3.43.0, V5.9.0, and later minor versions. |
| yarn_resourcemanager_queue_AvailableVCoresPercentage | YARN | The percentage of available CPU cores in the specified queue. Calculated as: AvailableVCores / (ReservedVCores + AvailableVCores + AllocatedVCores) * 100. Note Available for EMR V3.43.0, V5.9.0, and later minor versions. |
| yarn_cluster_numContainersByPartition | YARN | The number of containers for the specified partition. Note Available for EMR V3.44.0, V5.10.0, and later minor versions. |
| yarn_cluster_usedMemoryMBByPartition | YARN | The amount of used memory for the specified partition. Note Available for EMR V3.44.0, V5.10.0, and later minor versions. |
| yarn_cluster_availMemoryMBByPartition | YARN | The amount of available memory for the specified partition. Note Available for EMR V3.44.0, V5.10.0, and later minor versions. |
| yarn_cluster_usedVirtualCoresByPartition | YARN | The number of used CPU cores for the specified partition. Note Available for EMR V3.44.0, V5.10.0, and later minor versions. |
| yarn_cluster_availableVirtualCoresByPartition | YARN | The number of available CPU cores for the specified partition. Note Available for EMR V3.44.0, V5.10.0, and later minor versions. |
Troubleshooting
Scale-out fails or succeeds only partially
Scale-out failures typically have two causes:
ECS resource shortage: The requested instance type has insufficient inventory in the selected zone. ECS resource availability is dynamic and can be temporarily exhausted during high-demand periods.
Single instance type configured: If only one instance type is set for the node group and that type runs out, the entire scale-out request fails.
To improve scale-out reliability:
Enable Best-effort Delivery: When only some of the requested instances are available, EMR delivers those instead of failing the whole operation.
Configure multiple instance types: Add up to 10 instance types to the task node group. EMR tries each type in order — if the first is unavailable, it falls back to the next. For details on adding instance types, see Create a node group.
What's next
To monitor scaling activity, check the Auto Scaling tab of your cluster.
To create or modify the task node group where rules are applied, see Create a node group.
To programmatically manage auto scaling policies, see PutAutoScalingPolicy.