The High-performance Edition Spark confidential engine of AnalyticDB for MySQL provides the features of the Basic Edition Spark confidential engine, supports the Parquet modular encryption feature, and is compatible with engines such as Apache Spark, Hadoop, and Hive. The High-performance Edition Spark confidential engine ensures secure data transmission and storage and improves data processing efficiency. This topic describes how to use the High-performance Edition Spark confidential engine to encrypt data and perform SQL queries based on ciphertext tables.
Prerequisites
An AnalyticDB for MySQL Data Lakehouse Edition cluster is created.
An Object Storage Service (OSS) bucket is created in the same region as the AnalyticDB for MySQL cluster.
A job resource group is created for the AnalyticDB for MySQL cluster. For more information, see Create a resource group.
A database account is created for the AnalyticDB for MySQL cluster.
If you use an Alibaba Cloud account, you must create a privileged account. For more information, see the "Create a privileged account" section of the Create a database account topic.
If you use a Resource Access Management (RAM) user, you must create both a privileged account and a standard account and associate the standard account with the RAM user. For more information, see Create a database account and Associate or disassociate a database account with or from a RAM user.
The AliyunADBFullAccess, AliyunADBSparkProcessingDataRole, and read and write permissions on AnalyticDB for MySQL databases and tables are granted to a RAM user. For more information, see Perform authorization.
Data preparations
The data file that you want to encrypt must be in the Parquet format. You can download the sample Spark confidential data.
Procedure
You can encrypt plaintext data in the AnalyticDB for MySQL console or by using an encryption tool. If the data that you want to encrypt is stored on an on-premises device, you can use an encryption tool to encrypt the data. If the data is stored in a cloud database, you can encrypt the data in the AnalyticDB for MySQL console. The following encryption methods involve different operations:
AnalyticDB for MySQL console: Upload plaintext data to OSS, and then encrypt the data.
Encryption tool: Encrypt data on the on-premises device, and then upload the ciphertext data to OSS.
Encrypt data and create a ciphertext table in the AnalyticDB for MySQL console
Upload the plaintext data in the "Data preparations" section of this topic to an OSS bucket. In this example,
oss://testBucketName/adb/Spark/customeris used. For more information, see Simple upload.Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose .
On the SQLConsole tab, select the Spark engine and a job resource group.
Create a ciphertext table.
Enable the Spark confidential engine, specify a list of master encryption keys (MEKs), and then create a database.
-- Enable native computing. SET spark.adb.native.enabled=true; -- Configure resources. SET spark.driver.resourceSpec=medium; SET spark.executor.instances=2; SET spark.executor.resourceSpec=medium; SET spark.app.name=Spark SQL Encryption Test; -- Specify a list of MEKs, a Key Management Service (KMS) client class, and a CryptoFactory class. After you enable the Spark confidential engine, the engine can support plaintext and ciphertext data. SET spark.hadoop.parquet.encryption.key.list=kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****; SET spark.hadoop.parquet.encryption.kms.client.class=io.glutenproject.encryption.InMemoryKMS; SET spark.hadoop.parquet.crypto.factory.class=org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory; -- Create a database. CREATE database IF NOT EXISTS adb_external_db;The following table describes the parameters.
Parameter
Description
spark.hadoop.parquet.encryption.key.list
A list of MEKs. Each MEK corresponds to an MEK ID. Separate multiple MEKs with commas (,). Separate each MEK ID and MEK with a colon (:). Format:
<MEK ID 1>:<MEK 1 in Base64 encoding>,<MEK ID 2>:<MEK 2 in Base64 encoding>. For more information, see the "Keys" section of the Overview topic.In this example,
kf:MDEyMzQ1Njc4OTAxMjdy****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****is used.WarningYou can use a common tool such as OpenSSL to generate a random MEK. An MEK is the root credential that can be used to access encrypted data. If you lose the MEK, you can no longer access existing data. Keep the MEK confidential.
spark.hadoop.parquet.encryption.kms.client.class
The name of the KMS client class. Set the value to
io.glutenproject.encryption.InMemoryKMS.spark.hadoop.parquet.crypto.factory.class
The name of the CryptoFactory class. Set the value to
org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory.Create an external table named
customerto store plaintext data. The LOCATION parameter specifies the OSS path of the plaintext data. In this example,oss://testBucketName/adb/Spark/customeris used.SET spark.adb.native.enabled=true; SET spark.hadoop.parquet.encryption.key.list=kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****; SET spark.hadoop.parquet.encryption.kms.client.class=io.glutenproject.encryption.InMemoryKMS; SET spark.hadoop.parquet.crypto.factory.class=org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory; CREATE TABLE IF NOT EXISTS adb_external_db.customer ( c_custkey long, c_name string, c_address string, c_nationkey long, c_phone string, c_acctbal decimal(12, 2), c_mktsegment string, c_comment string ) USING parquet LOCATION 'oss://testBucketName/adb/Spark/customer';NoteIf the
adb_external_dbdatabase already contains a plaintext table, you can skip this step.If the plaintext data is stored in another cloud database, you must create an external table in the corresponding location. For information about how to create an external table, see CREATE EXTERNAL TABLE.
Create an external table named
enc_customerto store ciphertext data. In this example, theenc_customerexternal table is stored in theoss://testBucketName/adb/Spark/enc_customerpath.SET spark.adb.native.enabled=true; SET spark.hadoop.parquet.encryption.key.list=kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****; SET spark.hadoop.parquet.encryption.kms.client.class=io.glutenproject.encryption.InMemoryKMS; SET spark.hadoop.parquet.crypto.factory.class=org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory; CREATE TABLE IF NOT EXISTS adb_external_db.enc_customer USING Parquet OPTIONS ( 'parquet.encryption.column.keys'='kc1:c_name;kc2:c_phone', 'parquet.encryption.footer.key'='kf' ) LOCATION 'oss://testBucketName/adb/Spark/enc_customer' AS SELECT * FROM adb_external_db.customer;The following table describes the parameters.
Parameter
Required
Description
parquet.encryption.column.keys
Yes
The MEKs and the columns that you want to encrypt. Each MEK can encrypt multiple columns. Separate an MEK ID and column names with a colon (:). Separate multiple column names with commas (,). Separate multiple MEKs with semicolons (;).
parquet.encryption.footer.key
Yes
The footer key that is used to encrypt the metadata of the Parquet file.
NoteThe footer is a data structure at the end of a Parquet file. The footer is used to store the metadata of the Parquet file, such as the version number, group metadata, column metadata, and key metadata.
ImportantYou must specify the parquet.encryption.column.keys and parquet.encryption.footer.key parameters at the same time. Otherwise, the file is not encrypted.
(Optional) Delete the
customerexternal table.DROP TABLE IF EXISTS adb_external_db.customer;ImportantThe DROP TABLE statement deletes the
customerexternal table. We recommend that you delete the metadata in the OSS bucket to prevent plaintext data leaks.
Create an external table named
enc_customer_outputand write the SQL computing results of theenc_customertable to theenc_customer_outputexternal table. In this example, theenc_customer_outputexternal table is stored in theoss://testBucketName/adb/Spark/enc_customer_outputpath.SET spark.adb.native.enabled=true; SET spark.hadoop.parquet.encryption.key.list=kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****; SET spark.hadoop.parquet.encryption.kms.client.class=io.glutenproject.encryption.InMemoryKMS; SET spark.hadoop.parquet.crypto.factory.class=org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory; CREATE TABLE IF NOT EXISTS adb_external_db.enc_customer_output USING Parquet OPTIONS ( 'parquet.encryption.column.keys'='kc1:c_name;kc2:c_phone', 'parquet.encryption.footer.key'='kf' ) LOCATION 'oss://testBucketName/adb/Spark/enc_customer_output' AS SELECT * FROM adb_external_db.enc_customer WHERE c_custkey < 15;Decrypt the computing results.
Create an external table named
customer_outputand write the decrypted data of theenc_customer_outputtable to thecustomer_outputexternal table. In this example, thecustomer_outputexternal table is stored in theoss://testBucketName/adb/Spark/customer_outputpath.SET spark.adb.native.enabled=true; SET spark.hadoop.parquet.encryption.key.list=kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****; SET spark.hadoop.parquet.encryption.kms.client.class=io.glutenproject.encryption.InMemoryKMS; SET spark.hadoop.parquet.crypto.factory.class=org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory; CREATE TABLE IF NOT EXISTS adb_external_db.customer_output USING Parquet LOCATION 'oss://testBucketName/adb/Spark/customer_output' AS SELECT * FROM adb_external_db.enc_customer_output;Query the data of the
customer_outputtable.SELECT * FROM adb_external_db.customer_output;
Use an encryption tool to encrypt data and create a ciphertext table
Use an encryption tool to convert an on-premises plaintext file into a ciphertext dataset. For information about the encryption tool, see the "Columnar Encryption" section of the Apache Spark documentation.
import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions._ import org.apache.spark.SparkConf // Initialize a SparkSession and specify the encryption and decryption parameters. val conf = new SparkConf() .set("spark.hadoop.parquet.encryption.kms.client.class", "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS") .set("spark.hadoop.parquet.encryption.key.list", "kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****") .set("spark.hadoop.parquet.crypto.factory.class", "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory") val spark = SparkSession.builder().appName("SquareDataFrame").config(conf).getOrCreate() // Read the plaintext file named customer. val df = spark.read.parquet("customer") // Encrypt the customer file to create a ciphertext file. Use kc1 to encrypt the c_name column and kf to encrypt the footer. df.write .option("parquet.encryption.column.keys" , "kc1:c_name") .option("parquet.encryption.footer.key" , "kf") // The name and on-premises path of the ciphertext dataset. .parquet("enc_customer")The following table describes the parameters.
Parameter
Required
Description
spark.hadoop.parquet.encryption.kms.client.class
Yes
The name of the KMS client class.
When you encrypt data on the on-premises device, set this parameter to
org.apache.parquet.crypto.keytools.mocks.InMemoryKMS.When you create a ciphertext table in the AnalyticDB for MySQL console, set this parameter to
io.glutenproject.encryption.InMemoryKMS.
spark.hadoop.parquet.encryption.key.list
Yes
A list of MEKs. Each MEK corresponds to an MEK ID. Separate multiple MEKs with commas (,). Separate each MEK ID and MEK with a colon (:). Format:
<MEK ID 1>:<MEK 1 in Base64 encoding>,<MEK ID 2>:<MEK 2 in Base64 encoding>. For more information, see the "Keys" section of the Overview topic.In this example,
kf:MDEyMzQ1Njc4OTAxMjdy****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****is used.WarningYou can use a common tool such as OpenSSL to generate a random MEK. An MEK is the root credential that can be used to access encrypted data. If you lose the MEK, you can no longer access existing data. Keep the MEK confidential.
spark.hadoop.parquet.crypto.factory.class
Yes
The name of the CryptoFactory class. Set the value to
org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory.parquet.encryption.column.keys
Yes
The MEKs and the columns that you want to encrypt. Each MEK can encrypt multiple columns. Separate an MEK ID and column names with a colon (:). Separate multiple column names with commas (,). Separate multiple MEKs with semicolons (;).
parquet.encryption.footer.key
Yes
The footer key that is used to encrypt the metadata of the Parquet file.
NoteThe footer is a data structure at the end of a Parquet file. The footer is used to store the metadata of the Parquet file, such as the version number, group metadata, column metadata, and key metadata.
Upload the
enc_customer.parquetciphertext dataset to an OSS path. In this example,oss://testBucketName/adb/Spark/enc_customer.parquetis used. For more information, see Simple upload.Create a ciphertext table.
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose .
On the SQLConsole tab, select the Spark engine and a job resource group.
Create a ciphertext table.
Enable the Spark confidential engine and create a database.
-- Enable native computing. SET spark.adb.native.enabled=true; -- Configure resources. SET spark.driver.resourceSpec=medium; SET spark.executor.instances=2; SET spark.executor.resourceSpec=medium; -- Specify a list of MEKs, a KMS client class, and a CryptoFactory class. After you enable the Spark confidential engine, the engine can support plaintext and ciphertext data. SET spark.hadoop.parquet.encryption.key.list=kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****; SET spark.hadoop.parquet.encryption.kms.client.class=io.glutenproject.encryption.InMemoryKMS; SET spark.hadoop.parquet.crypto.factory.class=org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory; -- Create a database. CREATE DATABASE IF NOT EXISTS adb_external_db;Create an external table named
enc_customer. The LOCATION parameter specifies the OSS path of theenc_customerciphertext dataset. In this example,oss://testBucketName/adb/Spark/enc_customer.parquetis used.SET spark.adb.native.enabled=true; SET spark.hadoop.parquet.encryption.key.list=kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****; SET spark.hadoop.parquet.encryption.kms.client.class=io.glutenproject.encryption.InMemoryKMS; SET spark.hadoop.parquet.crypto.factory.class=org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory; CREATE TABLE IF NOT EXISTS adb_external_db.enc_customer USING parquet LOCATION 'oss://testBucketName/adb/Spark/enc_customer';
Create an external table named
enc_customer_outputand write the SQL computing results of theenc_customertable to theenc_customer_outputexternal table. In this example, theenc_customer_outputexternal table is stored in theoss://testBucketName/adb/Spark/enc_customer_outputpath.SET spark.adb.native.enabled=true; SET spark.hadoop.parquet.encryption.key.list=kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****; SET spark.hadoop.parquet.encryption.kms.client.class=io.glutenproject.encryption.InMemoryKMS; SET spark.hadoop.parquet.crypto.factory.class=org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory; CREATE TABLE IF NOT EXISTS adb_external_db.enc_customer_output USING Parquet OPTIONS ( 'parquet.encryption.column.keys'='kc1:c_name;kc2:c_phone', 'parquet.encryption.footer.key'='kf' ) LOCATION 'oss://testBucketName/adb/Spark/enc_customer_output' AS SELECT * FROM adb_external_db.enc_customer WHERE c_custkey < 15;Download and decrypt the ciphertext computing results.
Download the ciphertext computing results from the
oss://testBucketName/adb/Spark/enc_customer_outputpath to the on-premises device. For more information, see Download objects.Decrypt the ciphertext result dataset and store the decrypted data in the
customer_outputfile.// Decrypt the ciphertext result dataset. val conf = new SparkConf() .set("spark.hadoop.parquet.encryption.kms.client.class", "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS") .set("spark.hadoop.parquet.encryption.key.list", "kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****") .set("spark.hadoop.parquet.crypto.factory.class", "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory") val spark = SparkSession.builder().appName("SquareDataFrame").config(conf).getOrCreate() val df2 = spark.read.parquet("enc_customer_output") // Download the decrypted file to the on-premises device. df2.write .parquet("customer_output")