This topic describes best practices for managing Hologres computing resources, helping you handle complex and diverse read and write requests efficiently, intelligently, and cost-effectively.
Background
In an e-commerce company, data is handled by data teams and used by analysts and recommendation teams. The data team is responsible for data processing and serving data to various consumers across the company. The following table describes the business tasks and characteristics of each team.
Team | Responsibilities | Tasks | Characteristics |
Data team | Responsible for data writing and processing for all business teams, including real-time, near real-time, and batch data ETL links. | Real-time and near real-time ETL:
Batch ETL:
|
|
Data analysts | Sales data analytics for various roles, such as operational staff, outlet staff, and sales staff |
|
|
Recommendation team | Personalized product recommendations | Real-time recommendation: Retrieve user features based on primary keys and tailor recommendations for users using recommendation algorithms. | Traffic from users on the e-commerce platform peaks every night. |
Optimization principles
For stable workloads, it's sufficient to reserve dedicated computing resources. However, in real-world use cases, workloads can be highly concurrent, have peak hours, or be extremely high. This can exhaust computing resources, affecting instance stability. In this case, use the following decision tree to select a resource optimization solution.
Use Serverless Computing for extremely large workloads, such as massive data backfilling, full table scans without filters, complex join operations with over 10 tables, and multiple nested subqueries. This avoids affecting other queries running on the instance. For these tasks, we recommend enabling adaptive serverless computing.
Requests optimized by Fixed Plan do not support execution using Serverless Computing or Query Queues. Therefore, you must handle peak workloads of these requests by scaling up virtual warehouses, but note that write peaks cannot be handled by auto-elasticity of virtual warehouses.
For other requests, select a solution based on your performance needs, peak hours, and request types.
When using Serverless Computing, see Challenges and resource management solutions to select a suitable solution.
Challenges and resource management solutions
This section recommends optimal solutions for the following business pain points.
Challenge 1: Resource contention and mutual interference between different teams
Description: Data ETL and query tasks can compete for computing resources and interfere with each other.
Solution: Use a Hologres instance with multiple virtual warehouses to implement workload isolation among different teams. See Architecture. Example:
Primary virtual warehouse (init_warehouse): Used by the data team for data writing and ETL.
Read-only virtual warehouse 1: Used by the analytics team.
Read-only virtual warehouse 2: Used by the recommendation team.
Challenge 2: Fixed peak hours
Use scheduled scaling (beta) to scale up or down virtual warehouses at scheduled times. If you require resource expansion for less than 16 hours per day, scheduled scaling saves costs compared to using only dedicated (reserved) resources. Example:
Real-time ETL:
As these tasks are optimized via Fixed Plan, you can only address peak hours by scaling up virtual warehouses.
Enable scheduled scaling for the primary virtual warehouse to handle peak hours.
Data analytics: Configure scheduled scaling for the read-only virtual warehouse based on the traffic peak characteristics (daytime work hours).
Real-time recommendation:
These tasks are point queries optimized by Fixed Plan. Therefore, you can only scale the virtual warehouse to handle workload surges.
Because peak traffic occurs in the evening, configure scheduled scaling for the read-only virtual warehouse to scale up computing resources in the evening.
Challenge 3: Large tasks frequently cause OOM errors and occupy resources for a long time, affecting other tasks
The following scenarios might involve large tasks:
Batch ETL: A single task typically needs a large amount of resources. The task itself is prone to OOM errors and can occupy computing resources for a prolonged time, blocking other tasks.
Solution 1 (prioritizing stability): Execute all batch ETL tasks using serverless computing resources. You can configure this at the SQL or user level. See Use Serverless Computing resources to execute read and write tasks.
Solution 2 (balancing stability and cost): If the batch ETL tasks are of varying sizes, run small tasks using the primary virtual warehouse and large tasks using Serverless Computing.
Near real-time ETL with Dynamic Tables: An incremental refresh task is executed for each table every minute. The amount of computation required for a task is related to the amount of incremental data, which can be unstable.
Solution 1 (prioritizing stability): Run all of these ETL tasks using serverless computing resources. See CREATE DYNAMIC TABLE.
Solution 2 (balancing stability and cost): Execute refresh tasks for large tables or tables with significant data volume fluctuations using Serverless Computing. Run other tasks using the primary virtual warehouse.
BI dashboards for data analytics: There are many data dashboards, and the task sizes vary. Large tasks can occupy computing resources for a long time, blocking small tasks.
Solution 1 (balancing stability, cost, and development effort): Enable the Adaptive Serverless Computing at the database or user level. This automatically directs large tasks to use serverless computing resources, while keeping small tasks in the read-only virtual warehouse.
Solution 2 (balancing stability and cost): Extract the SQL fingerprints of large tasks from data dashboards, create a query queue, and configure all requests in this query queue to run on serverless computing resources. To implement this solution, do the following:
Test the data dashboard to ensure that Hologres executes all requests from the dashboard.
In the Hologres slow query logs, filter for large tasks using the
cpu_time_msfield and extract thedigestfield, which is the SQL fingerprint. For more information, see Query and analyze slow query logs.Create a query queue for large tasks using their SQL fingerprints. For more information, see Configure matching rules for a classifier.
Configure all requests in this query queue to run on serverless computing resources. For more information, see Use Serverless Computing resources to execute queries in a query queue.
Solution 3 (prioritizing cost savings): Prioritize running all queries in the virtual warehouse. At the same time, enable large query auto-rerun for query queues. This uses serverless computing resources to rerun timed-out and OOM queries. The user does not notice the timeout or OOM error. For more information, see Large query control.
Challenge 4: Sudden large tasks fully occupy the resource pool, affecting system stability
Sporadic analytical tasks can create sudden and high overhead, affecting instance stability.
Solution 1 (prioritizing stability): Configure all ad-hoc analytical request to run on Serverless Computing at the user level. For more information, see Configure at the user level.
Solution 2 (balancing stability and cost): Enable Adaptive Serverless Computing at the user level. This automatically directs large tasks initiated by the user to serverless resources, while small tasks continue to run in the virtual warehouse.
Solution 3 (prioritizing cost savings): Prioritize running all requests in the virtual warehouse. At the same time, enable automatic rerun for large queries for query queues. This uses serverless computing resources to rerun timed-out and OOM queries. The user does not notice the timeout or OOM error. For more information, see Large query control.
Challenge 5: Different data freshness and stability requirements
BI dashboards are queried by various roles across the company, such as data developers, operational staff, sales staff, and senior management. Their access purposes also vary, including customer-facing demonstration or internal viewing.
High performance needs:
Solution 1 (prioriotizing stability): If the workload is stable, create a separate virtual warehouse to handle the requests.
Solution 2 (balancing stability and cost): Execute all requests using Serverless Computing, or use the Adaptive Serverless Computing feature.
Latency can be tolerated:
If the request types for the data dashboard are fixed (e.g., a role consistently accesses a specific dashboard), manually configure throttling. First, perform stress testing to determine the virtual warehouse's read capacity. Then, configure a fixed concurrency limit for a query queue. For more information, see Create a query queue.
If the request types are not fixed (e.g., multiple dashboards are accessed by different roles), enable automatic throttling. Hologres automatically adjusts the concurrency limit of the query queue based on the workload. For more information, see Automatic throttling (beta).
Challenge 6: Unexpected request surges affect instance stability
Unexpected nighttime query surges pose a challenge. Serverless computing cannot handle these due to unknown users, tables, and SQL statements. Scheduled scaling is ineffective against unpredictable spikes.
Solution: Enable auto-scaling. This allows virtual warehouses to automatically scale out during peak loads and scale in when demand decreases. For details, see Multi-cluster and Auto Scaling (Beta).
Advanced settings
Configure task priorities for Serverless Computing
If multiple services use Serverless Computing, configure priorities for their requests at the user level. For more information, see Set the priority for serverless computing tasks. Example:
Query requests from roles with high performance needs: Set the priority to 5. These requests are executed first when serverless resources are scarce.
Batch ETL tasks: Set the priority to 1. These tasks wait in a queue when serverless resources are scarce.
Other tasks: Use the default priority of 3.
Configure daily quotas for Serverless Computing
If multiple services can use Serverless Computing, the costs may be unexpectedly high. In this case, configure daily quotas of serverless computing resources for an instance. You can also configure resource quotas for different users. This helps you use serverless computing resources flexibly and cost-efficiently. See Daily cumulative usage limit.
Enable high availability for read-only virtual warehouses
For read-only virtual warehouses, especially those that handle online recommendation services, configure multiple shard-level replicas. This ensures lossless queries when a query node fails. For more information, see Shard-level replication for a Hologres instance.