Practical Use of MaxCompute Metadata: Job Accounting

By Haiqing

The tenant-level Information Schema of MaxCompute provides project metadata and historical data from the perspective of tenants. You can pull a type of metadata from all your projects in the same metadata center at a time for statistical analysis. We are launching a series of articles about metadata usage practices.

This article focuses on job accounting through the TASKS_HISTORY view of metadata.

If you haven't used the tenant-level Information Schema, read the background information, feature introduction, fees, usage limits, and precautions written in Tenant-level Information Schema in advance to avoid unnecessary problems.

Count and Analyze CU Consumption

Scenario

The current billing mode of MaxCompute is pay-as-you-go, and you want to evaluate whether it is suitable to switch to a subscription-based model and determine the appropriate number of subscription CUs (compute units) to purchase.

Solution

This scenario aims to save costs. Under the pay-as-you-go billing mode of MaxCompute, SQL jobs are not billed by CU, making it difficult to calculate the final cost directly. Instead, you can refer to and evaluate the CU consumption distribution of past jobs using the tenant-level SYSTEM_CATALOG.INFORMATION_SCHEMA.TASKS_HISTORY view.

The approach is as follows:

Calculate the total daily cost_cpu consumption of all projects that need to be prepaid in the last month. The SYSTEM_CATALOG.INFORMATION_SCHEMA.TASKS_HISTORY view provides the original unit of cost_cpu as core.s, which must be converted to CU (cost_cpu/100/3600). Identify the day with the highest normal consumption.
Calculate the CU consumption per hour of all jobs on the day with the highest normal consumption.
Analyze and divide time periods by business activities. For instance, define the time slot from 0 am to 7 am with peak businesses as ETL night peak time slot, while the remaining low-activity time belongs to another time slot.
Summarize the total number of CUs required during the peak business hours by dividing the total CU consumption per hour by the hours with peak businesses. For instance, if 800 CUs are consumed in 8 hours from 0 am to 7 am, then 100 CUs are needed to run for 8 hours to complete the same workload. However, this is an ideal scenario, and additional CUs may need to be allocated based on the specific task requirements. Extreme cases, such as tasks with large CU requirements at specific times, must also be considered. Adjust the number of CUs based on business needs, and if a project cannot tolerate any delays, then increase the number of CUs accordingly. Otherwise, a project may not be suitable for subscription resources.
It is advisable to transition projects gradually instead of all at once. After each project is transferred, monitor the resource usage through the MaxCompute Management and adjust the scaling as needed.

The relevant code is as follows:

SET odps.namespace.schema = true; -- If you have enabled the tenant-level schema syntax switch for your current tenant, you do not need to execute this flag.
SELECT  task_catalog
                ,SUBSTR(end_time,1,13) end_time
        ,SUM(cost_cpu / 100 / 3600) cuh
FROM    SYSTEM_CATALOG.INFORMATION_SCHEMA.TASKS_HISTORY
WHERE   ds <= xxx -- Enter the time period required for statistical analysis. Data in no more than the last 15 days is available. Note that if running data on the day consumes a large amount of CUs, you must be cautious when using pay-as-you-go computing resources.
GROUP BY task_catalog
         ,SUBSTR(end_time,1,13)
;

If you perform SQL analysis in the MaxCompute console, you can analyze the results as follows:

As shown in the preceding figure, CU consumption is unevenly distributed. Only one hour per day consumes large CUs. According to this order, not all projects are suitable for subscription CUs. However, you can analyze and evaluate some project data by project.

Count the Consumption of Each Job

If you use MaxCompute on a pay-as-you-go basis and have cost verification and audit requirements or job-based cost optimization requirements, you can use the SYSTEM_CATALOG.INFORMATION_SCHEMA.TASKS_HISTORY view to count related costs.

Common Statistical Analysis

Count the Number of Jobs

Count the trend of the number of jobs of different types in each project:

SET odps.namespace.schema = true; -- If you have enabled the tenant-level schema syntax switch for your current tenant, you do not need to execute this flag.
SELECT  ds
        ,task_catalog
        ,task_type
        ,COUNT(inst_id) inst_id_cnt
FROM    SYSTEM_CATALOG.INFORMATION_SCHEMA.TASKS_HISTORY
WHERE   ds <= xxx -- Enter the time period required for statistical analysis. Data in no more than the last 15 days is available. Note that if running data on the day consumes a large amount of CUs, you must be cautious when using pay-as-you-go computing resources.
GROUP BY ds
         ,task_catalog
         ,task_type
;

Valid values:

• SQL: SQL jobs
• CUPID: Spark or Mars jobs
• SQLCost: SQL cost estimation jobs
• SQLRT: MaxCompute Query Acceleration (MCQA) SQL jobs
• LOT: MapReduce jobs
• PS: Parameter Server of PAI
• AlgoTask: Machine Learning Platform for AI (PAI) jobs

If you perform SQL analysis in the MaxCompute console, you can analyze the results as follows:

View Who Is Accessing Table Data

View who is accessing or writing data in a table.

SET odps.namespace.schema = true; -- If you have enabled the tenant-level schema syntax switch for your current tenant, you do not need to execute this flag.
SELECT  task_catalog
        ,task_type
        ,inst_id
        ,owner_name
        ,start_time
        ,end_time
        ,input_tables
        ,output_tables
        ,operation_text
        ,ext_platform_id
        ,ext_node_id
        ,ext_node_onduty
FROM    SYSTEM_CATALOG.INFORMATION_SCHEMA.TASKS_HISTORY
WHERE   ds =xxxx   -- Enter the time period required for statistical analysis. Data in no more than the last 15 days is available. Note that if running data on the day consumes a large amount of CUs, you must be cautious when using pay-as-you-go computing resources.
AND input_tables LIKE 'dwd_github_events_odps'
;

• task_catalog: The project associated with the job.

• owner_name: Identifies the actual executor of the MaxCompute job. In some scenarios, a business may be associated with a single account, obscuring the true developer; for instance, DataWorks might designate an account to run MaxCompute jobs in a production environment.

• - operation_text: The SQL statement.

• ext_platform_id: Indicates the origin platform of the job when initiated from upstream sources. Currently, some jobs from DataWorks and Dataphin include this information. It's unattainable for others not explicitly passed.

• ext_node_id: The specific name of the job initiated upstream.

• ext_node_onduty: The uid of the upstream job owner, which can be linked with the SYSTEM_CATALOG.INFORMATION_SCHEMA.USERS table to find the user name.

Summary

The scenarios outlined are just a few examples. More information can be accessed through the SYSTEM_CATALOG.INFORMATION_SCHEMA.TASKS_HISTORY view to address varying business needs.

Community

Practical Use of MaxCompute Metadata: Job Accounting

Count and Analyze CU Consumption

Scenario

Solution

Count the Consumption of Each Job

Common Statistical Analysis

Count the Number of Jobs

View Who Is Accessing Table Data

Summary

Read previous post:

Read next post:

Alibaba Cloud MaxCompute

You may also like

Comments

Alibaba Cloud MaxCompute

Related Products

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Omnichannel Data Mid-End Solution

Quick Starts