MaxCompute security white paper - MaxCompute - Alibaba Cloud Documentation Center

Legal disclaimer

Alibaba Cloud reminds you to carefully read and fully understand the terms and conditions of this legal disclaimer before you read or use this document. If you have read or used this document, it shall be deemed as your total acceptance of this legal disclaimer.

You shall download and obtain this document from the Alibaba Cloud website or other channels that are authorized by Alibaba Cloud, and use this document for your own legal business activities only. The content of this document is considered confidential information of Alibaba Cloud. You shall strictly abide by the confidentiality obligations. No part of this document shall be disclosed or provided to any third party for use without the prior written consent of Alibaba Cloud.
No part of this document shall be excerpted, translated, reproduced, transmitted, or disseminated by any organization, company, or individual in any form or by any means without the prior written consent of Alibaba Cloud.
The content of this document may be changed due to product version upgrades, adjustments, or other reasons. Alibaba Cloud reserves the right to modify the content of this document without notice. The updated versions of this document will be occasionally released through channels that are authorized by Alibaba Cloud. You shall pay attention to the version changes of this document as they occur and download and obtain the most up-to-date version of this document from channels that are authorized by Alibaba Cloud.
This document serves only as a reference guide for your use of Alibaba Cloud products and services. Alibaba Cloud provides the document in the context that Alibaba Cloud products and services are provided on an as is, with all faults, and as available basis. Alibaba Cloud makes every effort to provide relevant operational guidance based on existing technologies. However, Alibaba Cloud hereby makes a clear statement that it in no way guarantees the accuracy, integrity, applicability, and reliability of the content of this document, either explicitly or implicitly. Alibaba Cloud shall not bear any liability for any errors or financial losses incurred by any organizations, companies, or individuals arising from their download, use, or trust in this document. Alibaba Cloud shall not, under any circumstances, bear responsibility for any indirect, consequential, exemplary, incidental, special, or punitive damages, including lost profits arising from the use or trust in this document, even if Alibaba Cloud has been notified of the possibility of such a loss.
By law, all the content of the Alibaba Cloud website, including but not limited to works, products, images, archives, information, materials, website architecture, website architecture, website graphic layout, and webpage design, are intellectual property of Alibaba Cloud and/or its affiliates. This intellectual property includes, but is not limited to, trademark rights, patent rights, copyrights, and trade secrets. No part of the Alibaba Cloud website, product programs, or content shall be used, modified, reproduced, publicly transmitted, changed, disseminated, distributed, or published without the prior written consent of Alibaba Cloud and/or its affiliates. The names owned by Alibaba Cloud include, but are not limited to, Alibaba Cloud, Aliyun, HiChina, and other brands of Alibaba Cloud and/or its affiliates, which appear separately or in combination, as well as the auxiliary signs and patterns of the preceding brands, or anything similar to the company names, trade names, trademarks, product or service names, domain names, patterns, logos, marks, signs, or special descriptions that third parties identify as Alibaba Cloud and/or its affiliates.
Please contact Alibaba Cloud directly if you discover any errors in this document.

Overview of data security and compliance

MaxCompute builds a comprehensive data security system based on confidentiality, integrity, and availability, and provides comprehensive data access control capabilities and a secure and trusted computing environment. The cluster high availability and disaster recovery solutions are provided to ensure business continuity. MaxCompute records detailed user operation logs and task runtime logs for in-process O&M monitoring and post-event security auditing. MaxCompute is built on top of Alibaba Cloud Infrastructure as a service (IaaS) and leverages the security capabilities of the cloud infrastructure. MaxCompute is linked with the security products of the cloud platform, such as Resource Access Management (RAM), Security Center of DataWorks, and Data Security Guard of DataWorks, to implement more security control scenarios.

MaxCompute provides an audit report of the principle compliance description about security, availability, and confidentiality in trusted Alibaba Cloud services in accordance with the relevant standards of the American Institute of Certified Public Accountants (AICPA). This audit report is issued by an independent third-party audit firm. For more information about the audit report, see SOC 3 Report.

Access control

Authentication

MaxCompute supports the following user identities: Alibaba Cloud accounts, RAM users, and RAM roles. MaxCompute also supports authentication based on an AccessKey pair, multi-factor authentication (MFA), and Security Token Service (STS) authorization.

You can create an AccessKey pair in the RAM console. An AccessKey pair consists of an AccessKey ID and an AccessKey secret. The AccessKey ID is public and uniquely identifies a user, whereas the AccessKey secret is private and used to authenticate a user. Before a client sends a request to MaxCompute, the client generates a string to be signed in the format that is required by MaxCompute and then generates a signature for the request by using the AccessKey secret. After MaxCompute receives the request, MaxCompute finds the AccessKey secret based on the AccessKey ID and then generates a signature. If the signature is the same as the signature that is sent by the client, the request is valid. Otherwise, MaxCompute rejects the request and returns an HTTP 403 error.

For more information, see User authentication.

Authorization.

MaxCompute provides the following access control mechanisms to perform fine-grained access control: access control list (ACL)-based access control, policy-based access control, download control, LabelSecurity, and row-level permissions. Authorization objects include projects, quotas, network connections, tables, functions, resources, instances, external tables, and external volumes.

RAM-based authorization

MaxCompute supports RAM-based authorization to grant access and management permissions on MaxCompute resources of your Alibaba Cloud account to RAM users and RAM roles. This way, you can assign minimum permissions to users based on your business requirements. This reduces information security risks for enterprises. For more information, see RAM permissions.

ACL-based access control

An ACL is used to implement object-based authorization. An ACL specifies permissions on an object and is considered as a subresource of the object. An ACL takes effect only if the object exists. If the object is deleted, the ACL of the object is also deleted. ACL-based access control is similar to the authorization mechanism that is implemented by using the GRANT and REVOKE statements defined in SQL-92. You can execute these statements to grant or revoke permissions on an object. To manage permissions, you must specify the effect (grant or revoke), object (such as a table or resource), subject (user or role), and action (such as read, write, or delete). For more information, see ACL-based access control.

Policy-based access control

A policy is used to define role permissions. After you assign a role to a user, the permissions of the role take effect on the user. Compared with ACL-based access control, policy-based access control supports not only the whitelist mechanism but also the blacklist mechanism. When you perform policy-based access control, you can specify a policy to allow a role to perform specified operations on specified objects or deny a role from performing specified operations on specified objects. If both the whitelist and blacklist mechanisms are used for the same object, the blacklist mechanism takes precedence. For more information, see Policy-based access control.

LabelSecurity

LabelSecurity allows you to flexibly control user access to column-level sensitive data by using data sensitivity level labels. Data in tables or columns needs to be classified based on data sensitivity levels and users need to be classified based on data access levels. You can set the sensitivity level to 0 to 9 to adapt to different data classification standards. The following security policies are supported:

No-ReadUp: Users cannot read data that has a higher data sensitivity level than their data access level, unless the users are explicitly granted permissions.
Trusted-User: Users are allowed to write data of all sensitivity levels. The default sensitivity level of newly written data is 0.

For more information, see Label-based access control.

Access modes

MaxCompute provides multiple endpoint-based access modes, including the Internet, virtual private cloud (VPC), and PrivateLink. You can enable endpoints for network isolation based on your business requirements. You can configure an IP address whitelist for each endpoint to limit connections from clients.

Secure and trusted computing environment

MaxCompute provides secure computing containers and Java and Python sandboxes to isolate task processes and prevent malicious code from affecting cluster computing tasks.

MaxCompute provides hybrid computing modes based on the considerations of computing flexibility and extensibility. MaxCompute supports user-defined functions (UDFs) in SQL engines and third-party computing frameworks such as Spark and Python in compute engines. However, these features may cause untrusted code, which can trigger unintended system damage and malicious code attacks. MaxCompute uses lightweight secure computing containers and language-level sandboxes to run untrusted code in secure computing containers to achieve process-level security isolation.

MaxCompute provides security assurance to meet the network communication requirements of internal data synchronization and external data access in computing tasks. MaxCompute builds an overlay virtual network for each computing task among its running security containers to achieve security isolation from the host network. This ensures that all nodes in the task can communicate by using private IP addresses but cannot access the host network. If a computing task needs to access the data service API over the Internet or in a VPC, MaxCompute supports task-level networking by using network connections. You must declare the destination network that you want to access and meet the permission check requirements when you start a job. For more information, see Network connection process.

The code runtime environment provided by MaxCompute belongs to users. Any legal liability arising from code execution shall be borne by users.

Confidentiality

MaxCompute provides data security measures for the storage, computing, and transmission status of data.

Transparent encryption at the data storage layer

MaxCompute integrates Key Management Service (KMS) and Bring Your Own Key (BYOK) to automatically encrypt and decrypt data files in storage media. Applications can meet data ciphertext storage requirements without modification, and can use keys to encrypt or decrypt persistent data files of tables and partitions. Encryption algorithms such as AES256 are supported. Automatic key rotation is supported. If the customer disables the key service, the encrypted data cannot be decrypted and accessed. This meets the requirements of data confidentiality and regulatory compliance. For more information, see Storage encryption.

Data content encryption

MaxCompute supports column-level content encryption for sensitive data, such as personally identifiable information (PII), financial information, accounts, and passwords. Applications can call encryption functions to encrypt sensitive data before writing data and decrypt the data during data reading. This prevents sensitive data leaks caused by attacks, such as SQL injection or data breach. MaxCompute can be connected to KMS to encrypt data by using keysets that contain multiple keys. MaxCompute supports the following encryption algorithms: AES-GCM-256, AES-SIV-CMAC-128, and AES-SIV-CMAC-256. These encryption algorithms provide higher encryption reliability to prevent data cracking. For more information, see Encryption and decryption functions.

Dynamic data masking

MaxCompute supports the dynamic data masking feature to protect sensitive data such as PII during business development and testing, data sharing, and O&M. Data masking policies include masking, hashing, character replacement, numeric value rounding, and date rounding. Data masking policies can be used together with the data classification feature of Data Security Guard to meet your masking requirements for sensitive data, such as identity information, bank card numbers, addresses, and phone numbers. Data masking of MaxCompute is implemented in the link that is closest to data reading from storage. This ensures that data is de-sensitive during query, download, association, and UDF computing to avoid sensitive data breach. For more information, see Dynamic data masking.

Encrypted transmission

When you connect to MaxCompute by using the MaxCompute client, SDK, or API, the HTTPS TLS v1.2 encryption protocol is used. This prevents data from being intercepted or tampered with during transmission.

Data lifecycle management

MaxCompute allows you to specify a data retention period. After you specify a data retention period, MaxCompute automatically cleans up expired data to reduce data leak risks and storage costs. You can specify the lifecycle of a table based on your business requirements and data usage frequency. MaxCompute determines whether the interval between the latest update time (LastModifiedTime) of each table or partition and the current time exceeds the lifecycle. If the interval exceeds the lifecycle, MaxCompute performs the reclaim operation. For more information, see Lifecycle management operations.

MaxCompute also supports tiered storage for cold and hot data. Infrequent Access (IA) storage and long-term storage can help you limit the access frequency of historical data and reduce storage costs. For more information, see Configure storage tiers for storage resources.

Integrity

MaxCompute provides data integrity protection measures during data processing.

ACID

MaxCompute supports atomicity, consistency, isolation, durability (ACID) for large-scale data processing jobs. Delta tables use the multiversion concurrency control (MVCC) model to ensure read/write snapshot isolation and optimistic concurrency control (OCC) to control transaction concurrency. Row-level or file-level transaction concurrency control is not supported. Instead, each batch data processing operation is managed as a separate transaction. The transaction conflict logic of some frequently performed operations is optimized based on the semantics of the operations to better support concurrency control while ensuring correctness. MaxCompute uses cyclic redundancy check (CRC) codes to verify data integrity during data storage and transmission. For more information, see ACID semantics.

Multi-replica data storage

MaxCompute uses a distributed file system to automatically create multiple replicas for stored data. By default, three replicas are created. The replicas are distributed across different physical machines and racks to prevent data loss caused by single points of failures (SPOFs) and ensure data durability and integrity.

MaxCompute stores data in the Apsara Distributed File System. The system provides a flat linear storage space and slices linear addresses. Each shard is called a chunk. Each chunk has three replicas and stores the replicas on different nodes in the cluster based on specific policies. This prevents data unavailability due to the failure of one chunk server or one rack. The operations of adding, modifying, and deleting data are synchronized to the three replicas to ensure data integrity and consistency. The file system reclaims the storage space that is released after data is deleted, prohibits other users from accessing the storage space, and erases data to ensure that the deleted data cannot be restored.

Task fault tolerance

The task scheduling system of MaxCompute has high fault tolerance and provides the task retry mechanism. When an SQL job runs, the system builds a directed acyclic graph (DAG) to allocate appropriate computing resource nodes for the job and optimizes the execution process. This eliminates unnecessary shuffles and network jitters and prevents the overall job from being affected by partial node faults. This ensures the accuracy of the execution result and execution efficiency.

MaxCompute Tunnel supports resumable transmission for batch data uploads and downloads. MaxCompute Tunnel also allows you to specify the number of data rows that can contain errors and adjust the buffer size. This ensures data integrity for data processing.

Backup and restoration

By default, the backup and restoration feature is enabled for MaxCompute tables. This feature effectively prevents data loss caused by misoperations. MaxCompute automatically backs up historical versions of data each time data is modified or deleted, or a table is deleted. Historical versions of data are retained for up to 30 days. You can quickly restore data based on the data version number.

Delta tables support time travel queries. You can query data snapshots of a delta table at any time in the previous seven days. You can also query incremental data within a specified time range or version range.

For more information, see Backup and restoration.

Availability

MaxCompute provides security measures to ensure data availability.

Data sharing

MaxCompute provides the package-based access control mechanism for cross-project data sharing scenarios. This mechanism can package the data resources and related permissions of a project. After the package is installed for a third-party project, the third-party project can access the authorized data resources of the project. After the package-based access control mechanism is used, the administrator of Project A can create a package and authorize Project B to install the package. After the administrator of Project B installs the package, you can manage whether the package needs to be further authorized to users in your own project. For more information, see Cross-project resource access based on packages.

Limits on data exchange

MaxCompute provides project protection mechanism for scenarios where data is only imported but not exported. For example, users who have access permissions on multiple projects can transfer data between different projects. If data in a project is sensitive and cannot be exported to other projects, the administrator can use the project protection mechanism to protect the data. This mechanism requires that data be written to the project but not read from the project. For more information, see Project data protection.

Disaster recovery

MaxCompute supports zone-disaster recovery and cross-region disaster recovery to provide disaster recovery capabilities for customers.

Cross-zone high availability

Zone-disaster recovery of MaxCompute extends data storage and computing services to three zones in the same city to improve the resilience of your business. Zone-disaster recovery is suitable for industries such as finance and critical infrastructure. This feature ensures that services of business systems do not stop due to the failure of a single data center. This feature can provide data redundancy backup to reduce business downtime and meet industry compliance requirements for better customer experience on upper-layer applications.

This feature allows you to extend the availability of project-level data from a single zone to three zones in the same city. The three zones are physically isolated, but the network connection has low latency. This implements fault isolation and real-time data synchronization across data centers to ensure data integrity and availability if a disaster occurs.

Cross-region disaster recovery

Cross-region disaster recovery of MaxCompute allows you to select a region that is more than 1,000 kilometers away from the current region as a data disaster backup cluster to establish a complete geo-redundancy for the project-level data of customers and protect the data from regional natural disasters.

Monitoring and auditing

MaxCompute provides in-process monitoring and post-audit capabilities for data authorization and data usage.

Audit logging

MaxCompute provides detailed ActionTrail logs, which record information about operations related to jobs (instances), tables, users, roles, and permissions. ActionTrail logs can meet the compliance requirements for log retention and security management requirements such as real-time monitoring and backtracking analysis.

MaxCompute records all operations triggered by users. You can view and retrieve user behavior logs in ActionTrail logs and store the logs in Simple Log Service projects or specified OSS buckets for a long period of time. For more information, see Audit logging.

Information schema

MaxCompute allows you to query the project metadata and job running history. MaxCompute also allows you to analyze data in real time and periodically export user information, role information, table partition authorization information, historical job records of the previous14 days, and batch upload and download task records for security audit based on the ANSI SQL-92 Information Schema. For more information, see Information schema.