Distributed applications routinely need to coordinate sequences of operations that span seconds, minutes, or hours order processing across payment, inventory, and fulfilment services; document ingestion through OCR, extraction, and classification stages; periodic data preparation jobs that chain multiple compute tasks. Implementing these flows directly inside application code creates well-documented operational fragility: state must be persisted between steps, retries must be coded around every failure mode, timeouts must be handled at every boundary, and visibility into in-flight executions usually depends on ad-hoc logging.
Alibaba Cloud Serverless Workflow (CloudFlow) addresses this class of problem by externalising the orchestration logic from application code into a managed coordinator. The service executes flows defined in a declarative specification language, persists execution state, and handles retries, timeouts, branching, and parallel execution as native primitives. This article documents the architecture, the structure of flow definitions, and the engineering decisions that govern reliable multi-step orchestration on the service.

Figure 1: Serverless Workflow orchestration architecture.
Serverless Workflow executes flows defined in the Flow Definition Language (FDL), a YAML-based specification in which each step is a named state with an explicit type task, choice, parallel, for-each, pass, succeed, fail, or wait. Steps are connected by named transitions rather than positional sequence, allowing non-linear topologies without manual control-flow code. Each execution of a flow is assigned a unique execution ID and persisted by the service from start to terminal state, with intermediate inputs and outputs of every step retained in execution history.
State persistence is the property that distinguishes managed orchestration from chained function invocations. If a task step invokes a Function Compute function that fails or times out, the execution does not terminate; it pauses at that step, the failure is recorded, and the retry policy attached to the step governs subsequent attempts. After all retries are exhausted, the flow either transitions to a configured catch handler or terminates in a failed state with the full execution trace queryable through the service console or the API. This removes the requirement for application code to implement its own state machine, durable storage, or replay logic.
The task step is the workhorse of most orchestrations. It invokes an external service, most commonly a Function Compute function and waits for the response before transitioning. The integration uses the function's qualified ARN and supports both synchronous invocations, in which the function executes within the workflow's wait, and asynchronous invocation with a callback token for operations that exceed Function Compute's synchronous execution limits.
For long-running operations such as batch document processing, large file transformations, or external API calls that do not return immediately, the callback pattern is the correct choice. The task step issues an asynchronous invocation with a task token, the function or external system performs its work, and a separate API call to SendTaskSuccess or SendTaskFailure resumes the flow with the result payload. The workflow itself can wait in this state for an extended period without consuming function execution time, since the wait is held by the orchestrator rather than a running function instance.
Input and output mapping between steps uses a JSONPath-based projection mechanism. The InputMapping field of a step extracts a subset of the execution context to pass as the task input, and the OutputMapping field merges the task result back into the execution context under a named key. This contract keeps step interfaces narrow and prevents the entire execution context from being passed implicitly through every function, a pattern that becomes unmaintainable as flows grow beyond a handful of steps.
Choice steps implement conditional branching using expression-based predicates evaluated against the execution context. Each branch declares a condition and a target step; the first matching branch is taken, with an optional default branch handling the unmatched case. Conditions support comparison operators, logical composition, and JSONPath references in the input payload, enabling routing decisions to be expressed declaratively rather than inside a function whose sole purpose is to return the name of the next step.
Parallel steps execute multiple branches concurrently and wait for all branches to complete before transitioning. Each branch is an independent sub-flow with its own steps and error handling. Output from the parallel step is an ordered array of branch results, indexed by branch position. This pattern fits scenarios such as enriching a record from multiple independent sources simultaneously, or running validation, scoring, and notification stages in parallel, where none depend on the others.
Foreach steps, iterate over an input array and execute a sub-flow for each element, with configurable concurrency. Setting concurrency above one fan out the iteration across multiple workers; setting it to one enforces sequential processing where downstream rate limits or ordering constraints require it. The aggregated output is an array of sub-flow results in input order. Foreach is the appropriate construct for batch operations, processing each item in an uploaded file, dispatching notifications to a list of recipients, or applying the same transformation to every record in a partition.
Every task step accepts a retry array in which each entry specifies an error code pattern, a maximum attempt count, an initial interval in seconds, and a backoff multiplier. Errors are matched against entries in order, allowing different policies for different failure classes. For example, transient network errors are retried five times with exponential backoff, while validation errors are caught immediately without retry. The retry interval grows by the multiplier after each attempt up to a configurable maximum interval, preventing unbounded backoff on persistently failing dependencies.
Catch handlers operate at the step level and route specific error patterns to a recovery step rather than terminating the execution. A common pattern pairs a retry policy that exhausts attempts for transient failures with a catch handler that routes permanent failures to a compensation step, releasing reserved inventory, refunding a charge, or writing the failed payload to a dead-letter queue for later inspection. This separation between retryable and non-retryable failure paths is the foundation of correct exception handling in long-running flows.
Step timeouts are declared independently of the underlying function's timeout and bound the maximum time the workflow will wait for a task to complete. The workflow timeout is configured at the flow definition level and bounds the total execution duration, after which the flow transitions to a timed-out state and any configured cleanup steps are invoked. These bounds should be derived from measured execution profiles. The 95th percentile of historical step duration is a defensible starting point for step timeouts, with a margin added for variance.
Each execution carries a full audit trail accessible through the service console: the input and output of every step, the timing of each transition, the contents of any thrown errors, and the retry history of failed attempts. This trace is the primary tool for diagnosing failures in production flows, eliminating the need to correlate function logs across multiple services to reconstruct what an execution did and why it ended where it did.
Execution metrics count of running executions, success and failure rates per flow, step duration percentiles, and retry counts are exported to CloudMonitor and can be wired to alarm rules. A common operational baseline is to alarm on a sustained failure rate above a threshold, on execution duration exceeding the expected upper bound, and on the count of executions stuck in a running state beyond the configured workflow timeout. Logs from underlying Function Compute invocations are emitted to Log Service with the execution ID and step name available as queryable fields, allowing function-level diagnostic logs to be joined to the workflow execution context.
Flows are started by direct API call to StartExecution, by EventBridge rules that translate cloud events into flow inputs, by Function Compute triggers that invoke a flow from within a function, or on a schedule via the CloudFlow scheduler. EventBridge integration is the typical pattern for event-driven orchestrations. An object uploaded to Object Storage Service emits an event, an EventBridge rule pattern matches the bucket and prefix, and the matched event becomes the input payload of a new workflow execution. This decouples the source of the trigger from the orchestration logic and allows the same flow to be invoked from multiple event sources without modification.
Outbound integration with services beyond Function Compute is handled through task steps that invoke Message Service queues, HTTP endpoints, or other Alibaba Cloud APIs through service integrations. For external systems that operate over webhooks, the callback pattern described earlier is the standard mechanism: the workflow dispatches the request with a task token, the external system processes the request and posts the result back to the SendTaskSuccess API with the same token, and the flow resumes at the next step.
Serverless Workflow externalises the orchestration concerns of distributed multi-step processes, state persistence, retry, timeout, branching, parallelism, and observability from application code into a managed coordinator. The declarative flow definition becomes the source of truth for process logic, with individual function code reduced to single-purpose units that perform one task and return a result. For flows that span more than a handful of steps, that need to survive transient failures of their dependencies, or that operate over time horizons longer than a single function invocation can hold, this separation typically results in a smaller code surface, clearer failure modes, and execution traces that are diagnosable without correlating logs across services.
Engineers adopting the service should size workflow timeouts against measured step latencies rather than nominal estimates, place compensation logic behind catch handlers rather than embedding it in happy-path function code, and prefer the callback pattern over synchronous task invocation for any step that may approach Function Compute's synchronous execution limit. Iteration with foreach should be sized against the rate limits of downstream services rather than the workflow's own parallelism ceiling, since concurrent fan-out is most often constrained by the dependency rather than the orchestrator.
Disclaimer: The views expressed herein are for reference only and don’t necessarily represent the official views of Alibaba Cloud.
Elasticsearch on Alibaba Cloud: Index and Search Architecture
109 posts | 2 followers
FollowAlibaba Clouder - February 15, 2021
Alibaba Cloud New Products - June 3, 2020
Rupal_Click2Cloud - December 15, 2023
Alibaba Developer - February 1, 2021
5927941263728530 - May 15, 2025
Justin See - March 13, 2026
109 posts | 2 followers
Follow
Microservices Engine (MSE)
MSE provides a fully managed registration and configuration center, and gateway and microservices governance capabilities.
Learn More
Serverless Workflow
Visualization, O&M-free orchestration, and Coordination of Stateful Application Scenarios
Learn More
Function Compute
Alibaba Cloud Function Compute is a fully-managed event-driven compute service. It allows you to focus on writing and uploading code without the need to manage infrastructure such as servers.
Learn More
Serverless Application Engine
Serverless Application Engine (SAE) is the world's first application-oriented serverless PaaS, providing a cost-effective and highly efficient one-stop application hosting solution.
Learn MoreMore Posts by PM - C2C_Yuan