The AnalyticDB for MySQL confidential engine powered by Apache Spark is one of the first products certified by the China Academy of Information and Communications Technology (CAICT) for its performance and security in trusted execution environments (TEEs). The engine encrypts sensitive data to prevent data breaches and is ideal for privacy-preserving computing. This topic describes the scenarios and benefits of the Apache Spark based confidential engine and compares the features of the Basic and High-performance editions.
Scenarios
The AnalyticDB for MySQL (Enterprise Edition, Basic Edition, and Data Lakehouse Edition) confidential engine powered by Apache Spark encrypts sensitive data to prevent data breaches and meet compliance requirements. The Apache Spark based confidential engine is often used to resolve data security issues in scenarios such as secure data storage and computing, sensitive data compliance, and secure data sharing. Common scenarios include the following:
Secure data storage and computing: In untrusted environments, such as third-party platforms, the Apache Spark based confidential engine provides data protection for key data analytics applications, such as investment and financial analysis. This ensures that data is secure during storage and computation and reduces the risk of plaintext data breaches.
Sensitive data compliance: In untrusted environments, such as third-party platforms, the Apache Spark based confidential engine provides security protection for application services to protect end-user sensitive data. For example, private data such as personally identifiable information (PII) and genetic data must meet end-to-end encryption compliance requirements when managed by a third party.
Secure data sharing: You can control key ownership to manage data usage rights and access frequency. This enables secure data sharing and prevents data breaches. The following figure shows this scenario.
Editions
The Apache Spark based confidential engine is available in two editions: Basic Edition and High-performance Edition. The differences are as follows:
Basic Edition: The Basic Edition of the Apache Spark based confidential engine transmits and stores sensitive data as ciphertext. Only key owners can decrypt the data, which prevents data breaches. You must use client tools for encryption and decryption to convert data between plaintext and ciphertext.
High-performance Edition (Recommended): Building on the data encryption capabilities of the Basic Edition, the High-performance Edition of the Apache Spark based confidential engine integrates Apache Gluten and Velox to provide vectorization. This ensures secure data transmission and storage while improving data processing efficiency.
The following compares the Basic Edition and the High-performance Edition of the always-confidential Apache Spark compute engine:
Edition | Confidential data format | Performance (compared to open source Apache Spark) | Compatibility (compared to open source Apache Spark) | Tool dependency | Key mechanism |
Basic Edition | EncBlocksSource format | 0.5 times |
| Depends on client tools provided by Apache Spark to encrypt and decrypt data. | Supports two types of keys: master encryption key (MEK) and data key (DK). For more information, see Keys and Encryption. |
High-performance Edition | Parquet modular encryption format | 1.9 times |
| No dependencies. You can use any tool that supports Parquet modular encryption to encrypt and decrypt data. | Supports three types of keys: master encryption key (MEK), key encryption key (KEK), and data key (DK). For more information, see Keys and Encryption. |
Benefits
Rich features and ease of use
Supports all standard SQL operators. Confidential computing applications can be used with simple configurations and do not require SQL modifications.
Usage is consistent with open source Apache Spark.
The High-performance Edition supports hybrid processing of data at different privacy levels, including mixed-join computations on plaintext tables, plaintext and ciphertext tables, and ciphertext tables.
Computation results can be encrypted for output to enhance data security.
Data control
Key management supports Bring-Your-Own-Key (BYOK), giving you full control over your keys. The High-performance Edition introduces the Parquet modular encryption format, which lets you use your own keys to encrypt and decrypt data for complete data control. For more information, see Keys.
In the High-performance Edition, encryption keys are managed by the application. During computation, the keys are held by the InMemoryKMS class in the default application and are destroyed after the computation is complete.
High performance
The High-performance Edition of the Apache Spark based confidential engine is 4 times faster than the Basic Edition and 1.9 times faster than open source Apache Spark 3.2.0.
Flexible encryption methods support encrypting individual data columns in data files, which reduces unnecessary data I/O overhead.
Keys and Encryption
Keys
Master encryption key (MEK)
Key encryption key (KEK)
Data Encryption Key (DEK)
Encryption and decryption
Basic Edition
High-performance Edition
Notes
When you use the BYOK key management method, you must keep your keys secure. If a key is lost, the data cannot be decrypted.
Different compute engines may process data with different precision. If you encounter problems when using the Apache Spark based confidential engine, submit a ticket.
References
Examples of how to use the Basic Edition of the Apache Spark based confidential engine
Examples of how to use the High-performance Edition of the Apache Spark based confidential engine