As Cloud environments explode in complexity, traditional click-ops monitoring just doesn't cut it anymore. Manually configuring dashboards, alerts, and log queries across hundreds of microservices? It's error-prone, inconsistent, and scales horribly. For Alibaba Cloud, we can also leveraging Observability as Code (OaC)!
Let’s dive into how Alibaba Cloud empowers this to treat the observability setup with the same rigor as your application code.
Think Infrastructure as Code (IaC) like Terraform or ROS, but for your entire observability stack only. OaC means defining, managing, and version-controlling your monitoring and observability configurations (dashboards, alerts, log collection rules, tracing setups) by using declarative code files.
Instead of clicking through the Alibaba Cloud console for every new service or environment, leveraging coding such as YAML, JSON, HCL to describe what we would like to monitor, and how about the alert, but also where to store the logs should flow. This code gets checked into Git, reviewed, tested, and deployed alongside your application code as a best lifecycle we are.
1. Consistency & Standardization: Enforce uniform monitoring and alerting practices across all teams, projects, and environments (Dev, Test, Prod). No more "snowflake" dashboards or forgotten alerts.
2. Version Control & Auditability: Track changes to your observability setup over time. Easily see which part of changed what, when, and why. Rollback instantly when needed, it can integrates seamlessly with your CI/CD pipelines.
3. Reproducibility: Spin up a new environment? Apply your OaC definitions and instantly have consistent, production-grade observability configured. Disaster recovery becomes significantly smoother and more practical!
4. Scalability & Efficiency: Managing monitoring for hundreds of resources becomes feasible. Automation eliminates tedious, repetitive manual configuration tasks.
5. Collaboration: Developers, SREs, and Ops can collaborate on observability definitions using familiar workflows say pull requests and code reviews.
6. Shift-Left Observability: Define alerting thresholds and dashboard requirements ‘during’ development, ensuring observability is baked in, not bolted on.
1. Alibaba Cloud provides a robust suite of services perfectly suited for implementing OaC:
• Resource Orchestration Service (ROS): The cornerstone for OaC. ROS uses ARM (Alibaba Resource Management) templates (JSON/YAML) to declaratively provision and manage ‘any’ Alibaba Cloud resource, including:
• Application Real-Time Monitoring Service (ARMS): Define Prometheus instances, Grafana dashboards, alert contact groups, alert rules, and integration settings.
• Simple Log Service (SLS): Configure Logstores, define Logtail collection configurations, set up indexes, create saved searches, and configure alerts based on log patterns.
• Managed Service for Grafana (MSG): Provision Grafana workspaces, define data sources (connecting to ARMS Prometheus, SLS, etc.), and crucially, manage Grafana dashboards as code (using JSON definitions).
• Managed Service for Prometheus (MSP): Deploy and manage Prometheus instances.
• ActionTrail: Configure audit trail settings.
• CloudMonitor: Define event rules and notification settings (though ARMS/SLS often supersede for app-level observability).
2. Terraform (Alibaba Cloud Provider): A popular open-source IaC tool. It supports resources for ARMS, SLS, CloudMonitor, etc., allowing you to manage Alibaba observability alongside other infrastructure using HCL.
3. ARMS & SLS OpenAPI/SDKs: For ultimate flexibility, use the Alibaba Cloud SDKs (Java, Python, Go, Node.js, etc.) or direct API calls to programmatically configure observability resources within your custom automation scripts or tools.
Let's see a snippet of an ARM template defining an ARMS alert rule:
ROSTemplateFormatVersion: '2015-09-01'
Resources:
MyCriticalErrorAlert:
Type: ALIYUN::ARMS::AlertContactGroup
Properties:
ContactGroupName: Critical-Ops-Team
ContactIds:
- cp-1234567890xxxx # Your existing contact ID
MyAppHighErrorRateAlert:
Type: ALIYUN::ARMS::AlertRule
Properties:
AlertName: "MyApp-ErrorRate-High-PROD"
ClusterId: "your-arms-prometheus-cluster-id"
Expression: "sum(rate(http_server_requests_errors_total{application=\"my-app\", environment=\"prod\"}[5m])) by (instance) / sum(rate(http_server_requests_total{application=\"my-app\", environment=\"prod\"}[5m])) by (instance) > 0.05"
Duration: "1m"
Message: "High Error Rate ({{ $value }}) detected on instance {{ $labels.instance }} in PROD for application my-app."
NotifyType: "DISPATCH_RULE"
DispatchRuleId: "dr-1234567890xxxx" # Ref a dispatch rule (or define inline)
Labels:
- Key: severity
Value: critical
- Key: team
Value: backend
Annotations:
- Key: summary
Value: "High HTTP error rate (>5%) on {{ $labels.instance }}"
- Key: description
Value: "Investigate application logs (SLS) and metrics immediately."
ContactGroupIds:
- { Ref: MyCriticalErrorAlert } # Reference the contact group above
1. Inventory & Standardize: Document the current critical alerts, dashboards, and log setups. Define team/project standards.
2. Choose a right Tool: ROS (native) or Terraform (ecosystem familiarity) are primary choices. SDKs for advanced cases.
3. Start from Small: Pick one critical service or environment. Codify its ARMS alerts or a key SLS dashboard first.
4. Version Control: Store your OaC definitions in a Git repository (e.g., Codeup, GitHub, GitLab).
5. Integrate with CI/CD: Use tools like Jenkins, GitLab CI, or Alibaba Cloud CI/CD to automatically lint, validate, and deploy your OaC changes upon merge. Use ros deploy or terraform apply.
6. Iterate & Expand: Gradually migrate more configurations. Encourage cross teams to own the service's OaC definitions.
7. Govern & Review: Implement code reviews for OaC changes, just like application code.
Modularize: Break down large templates into reusable modules (ROS) or modules.
Parameterize: Use parameters (ROS) or variables (Terraform) for environment-specific values (cluster IDs, thresholds).
Lint & Validate: Use ros validate or terraform validate religiously. Consider custom policy checks.
Test: Treat OaC like production code. Test deployments in a staging environment.
Documentation: Clearly comment your code and maintain READMEs explaining what each resource does.
Leverage Managed Services: Use MSG for Grafana and MSP for Prometheus to reduce operational overhead.
Observability as Code NOT just a trend but the necessity for managing modern, dynamic Cloud-native applications at scale. By adopting OaC with ROS, Terraform, or the SDKs, it can gain unprecedented control, consistency, and efficiency in your monitoring setup. Move beyond fragile manual configurations and start building truly reliable, observable systems defined in code.
Start codifying your observability journey today and unlock the true potential of reliability on Alibaba Cloud!
Alibaba Cloud ROS Documentation: https://www.alibabacloud.com/help/en/resource-orchestration-service
Alibaba Cloud Terraform Provider: https://registry.terraform.io/providers/aliyun/alicloud/latest/docs
ARMS Documentation: https://www.alibabacloud.com/help/en/arms
SLS Documentation: https://www.alibabacloud.com/help/en/sls
Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.
Rupal_Click2Cloud - December 15, 2023
Data Geek - February 21, 2025
Data Geek - July 25, 2024
Alibaba Cloud Community - October 10, 2022
Farruh - July 25, 2023
Alibaba Container Service - July 1, 2024
Function Compute
Alibaba Cloud Function Compute is a fully-managed event-driven compute service. It allows you to focus on writing and uploading code without the need to manage infrastructure such as servers.
Learn More
Elastic High Performance Computing Solution
High Performance Computing (HPC) and AI technology helps scientific research institutions to perform viral gene sequencing, conduct new drug research and development, and shorten the research and development cycle.
Learn More
Quick Starts
Deploy custom Alibaba Cloud solutions for business-critical scenarios with Quick Start templates.
Learn More
ECS(Elastic Compute Service)
Elastic and secure virtual cloud servers to cater all your cloud hosting needs.
Learn MoreMore Posts by Kidd Ip