DevOps and Application Services on Alibaba Cloud: How Modern Software Gets Built

This article traces how a change moves from a developer commit to running, observable production software on Alibaba Cloud, and the architectural decisions that shape each transition along the way.

Software delivery has shifted from periodic releases of monolithic artefacts to continuous delivery of small, independently deployable units across heterogeneous runtimes. What ties this shift together is a discipline rather than a tool: the practice of treating the entire path from source to production as a single engineered system. Alibaba Cloud's application and DevOps services are designed around this premise, sharing a common identity model through Resource Access Management, a common observability surface through Application Real-Time Monitoring Service and Log Service, and a common networking fabric through Virtual Private Cloud. The result is that pipelines, container clusters, function runtimes, and application registries can be composed without the need for manual integration glue.

ChatGPT_Image_May_25_2026_11_12_09_AM
Figure 1: DevOps and application services architecture on Alibaba Cloud.

A change begins in a source repository on Apsara DevOps, where branch protection rules require a pull request review and successful pipeline checks before merging. The pipeline definition lives alongside the application code, so the rules governing how changes are admitted evolve under the same version control as the changes themselves. On merge, the pipeline produces a versioned container image pushed to Container Registry. The distinction between mutable tags such as 'latest' and the immutable image digest is consequential; downstream deployment objects should reference the digest, because it cannot be reassigned and therefore provides an unambiguous rollback target. Static analysis, dependency scanning, and image vulnerability scanning belong in this stage as gating checks rather than advisory warnings; treating them as advisory predictably produces a backlog no one resolves.

A modern application is rarely a single workload type, and the platform reflects this. Container Service for Kubernetes (ACK) is the default destination for long-running stateless and stateful workloads, with node pools structured to reflect workload heterogeneity stateless tiers on burstable instances, stateful workloads on instances with local NVMe storage, and GPU workloads on tainted pools. Mixing classes in a single pool produces unpredictable scheduling and capacity attribution. Enterprise Distributed Application Service (EDAS) sits a layer above ACK and ECS, providing application-aware abstractions for teams whose services use Spring Cloud or Dubbo; it handles service discovery, configuration distribution, and traffic shifting without requiring the team to operate the registry and configuration centre directly. Function Compute covers the remaining surface event-driven and intermittently invoked workloads where always-on capacity is uneconomic, such as object-storage-triggered transforms, message processing, and webhook receivers. The right runtime for each component is determined by its load profile, not by a uniform policy.

Declarative deployment, where the pipeline writes the desired state into a target system and a controller reconciles toward it, is the default for ACK workloads and increasingly for EDAS-managed applications. The deployment object becomes the audit record: the running version is whatever the controller most recently reconciled, eliminating the failure mode where the pipeline reports success but the running state diverges silently. Transition mechanics matter as much as state representation. Rolling updates replace instances in batches and bound blast radius; blue-green deployments run both versions concurrently and shift traffic in a single cutover, simplifying rollback at the cost of double capacity during the window; canary deployments route a small traffic fraction and progressively increase it. The right pattern follows the recovery objective canary, where detection latency dominates risk, and blue-green, where rollback latency dominates. In all cases, every release that reaches production should be redeployable through the same pipeline mechanism that produced it; a rollback that requires reconstructing a previous build is, operationally, a reverse-direction release.

A release does not end when the deployment object reports success. It ends when its operational consequences have been observed long enough to conclude the change is behaving as expected. ARMS provides distributed tracing, application metrics, and frontend monitoring through a shared trace context for JVM workloads. The agent attaches at startup without code modification, while OpenTelemetry is the preferred path for Go, Node.js, and Python. Sampling strategy carries direct cost and signal-quality implications. Head-based sampling at a fixed rate produces predictable telemetry volume but discards tail-latency outliers; tail-based sampling defers the decision until the trace completes, retaining slow and failed traces at full fidelity while discarding fast successful ones, the right default for SLO and error-investigation workloads. Alerts should be structured around symptoms such as error rate, latency percentile, and saturation, routed to paging rotations, and kept separate from cause signals such as garbage collection or thread pool exhaustion, which belong on dashboards. Mixing the two categories produces alert fatigue and erodes on-call response quality.

The services above are useful individually. They function as a platform through three properties that are easy to overlook and expensive to retrofit. The first is identity. Every pipeline run, deployment, and function invocation acts under a RAM role scoped to the minimum permissions required, with credentials rotated on a defined cycle; cross-environment promotion belongs to a dedicated promotion role, not a shared role with access to both source and target as if they were the same environment. The second is environmental parity. Development, staging, and production should differ in scale and data, not topology; Resource Orchestration Service (ROS) templates applied across environments with environment-scoped parameters maintain parity by construction and surface drift as a reviewable diff. The third is cost attribution. Tags applied consistently at the project, application, and environment level allow Billing and Cost Management to map spend to the teams accountable for it; untagged resources should be detected through scheduled queries and resolved at source, since retrospective re-tagging cannot recover historical attribution.

DevOps is best understood as an engineering discipline applied to the path from source code to running software. A platform built on Apsara DevOps, ACK, EDAS, ARMS, Function Compute, and the supporting identity, networking, and observability services provides the substrate on which that discipline can be exercised, but the substrate alone does not produce good outcomes. Teams extending this architecture should consider GitOps controllers such as Argo CD running inside ACK for fleets requiring stronger audit, Alibaba Cloud Service Mesh (ASM), where service portfolios have grown large enough to justify mutual TLS and mesh-level traffic shifting, and EventBridge as an event router for portfolios increasingly composed of event-driven workloads.

Disclaimer: The views expressed herein are for reference only and don’t necessarily represent the official views of Alibaba Cloud.

Community

DevOps and Application Services on Alibaba Cloud: How Modern Software Gets Built

Read previous post:

Read next post:

PM - C2C_Yuan

You may also like

Comments

PM - C2C_Yuan

Related Products

DevOps Solution

Microservices Engine (MSE)

Apsara Stack