How to use the High-performance Edition of the Apache Spark based confidential computing engine - AnalyticDB

The High-performance Edition of the Apache Spark based confidential compute engine on AnalyticDB for MySQL builds on the capabilities of the Basic Edition. It supports the Parquet modular encryption feature and is compatible with compute engines such as the community editions of Spark, Hadoop, and Hive. This edition ensures secure data transmission and storage and improves data processing efficiency. This topic describes how to use the High-performance Edition of the Apache Spark based confidential compute engine to encrypt data and perform SQL queries on ciphertext tables.

Prerequisites

An AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster is created.
An Object Storage Service (OSS) bucket is created in the same region as the AnalyticDB for MySQL cluster.
A job resource group is created.
A database account is created for the AnalyticDB for MySQL cluster.
- If you use an Alibaba Cloud account, you need to only create a privileged account.
- If you use a Resource Access Management (RAM) user, you must create a privileged account and a standard account and associate the standard account with the RAM user.
The AliyunADBFullAccess, AliyunADBSparkProcessingDataRole, and read and write permissions for AnalyticDB for MySQL databases and tables are granted to a RAM user. For more information, see Account authorization.

Data preparation

The data file that you want to encrypt must be in the Parquet format. To complete the following steps, you can download the Spark confidential computing sample data.

Procedure

You can encrypt plaintext data in the AnalyticDB for MySQL console or using an encryption tool. If the data that you want to encrypt is stored on an on-premises device, you can use an encryption tool to encrypt the data. If the data is stored in a cloud database, you can encrypt the data in the AnalyticDB for MySQL console. The encryption methods involve different operations:

Encrypt data in the console: Upload plaintext data to OSS, and then encrypt it.
Encrypt data using an encryption tool: Encrypt data locally, and then upload the ciphertext to OSS.

Encrypt data and create a ciphertext table in the console

Upload the plaintext data from the Data preparation section to an OSS bucket. This example uses oss://testBucketName/adb/Spark/customer. For more information, see Simple upload.
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
In the navigation pane on the left, choose Job Development > SQL Development.
In the SQLConsole window, select the Spark engine and a job resource group.

Execute the following statements to create a ciphertext table.

Enable confidential computing, set the customer master keys, and create a database.

-- Enable native computing.
SET spark.adb.native.enabled=true;
-- Configure resources.
SET spark.driver.resourceSpec=medium;
SET spark.executor.instances=2;
SET spark.executor.resourceSpec=medium;
SET spark.app.name=Spark SQL Encryption Test;
-- Enable ciphertext read and write, and set the list of master keys, KMS Client, and CryptoFactory. After this is enabled, the engine supports both plaintext and ciphertext.
SET spark.hadoop.parquet.encryption.key.list=kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****;
SET spark.hadoop.parquet.encryption.kms.client.class=io.glutenproject.encryption.InMemoryKMS;
SET spark.hadoop.parquet.crypto.factory.class=org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory;
-- Create a database.
CREATE database IF NOT EXISTS adb_external_db;

Parameter description:

Parameter	Description
spark.hadoop.parquet.encryption.key.list	The list of customer master keys (CMKs). A CMK corresponds to a key ID. Separate multiple CMKs with commas (,). Separate each key ID and CMK with a colon (:). Format: `<Key ID 1>:<Base64-encoded CMK 1>,<Key ID 2>:<Base64-encoded CMK 2>`. For more information, see Introduction to keys. This example uses `kf:MDEyMzQ1Njc4OTAxMjdy**,kc1:bvCDwqcOJGSdZSEMLjfk,kc2:kflI/sq+uf50Qhl1MmtG` Warning** You can use a general-purpose tool, such as OpenSSL, to randomly generate a customer master key. A customer master key is the root credential for accessing encrypted data. If the key is lost, you can no longer access the existing data. Keep your customer master key secure.
spark.hadoop.parquet.encryption.kms.client.class	The KMS client class name. Set this to `io.glutenproject.encryption.InMemoryKMS`.
spark.hadoop.parquet.crypto.factory.class	The CryptoFactory class name. Set this to `org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory`.

Create an external table named customer to store plaintext data. The LOCATION parameter specifies the OSS path of the plaintext data. In this example, oss://testBucketName/adb/Spark/customer is used.

SET spark.adb.native.enabled=true;
SET spark.hadoop.parquet.encryption.key.list=kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****;
SET spark.hadoop.parquet.encryption.kms.client.class=io.glutenproject.encryption.InMemoryKMS;
SET spark.hadoop.parquet.crypto.factory.class=org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory;
CREATE TABLE IF NOT EXISTS adb_external_db.customer 
 (
    c_custkey long,
    c_name       string,
    c_address    string,
    c_nationkey long,
    c_phone      string,
    c_acctbal    decimal(12, 2),
    c_mktsegment string,
    c_comment    string
)
USING parquet 
LOCATION 'oss://testBucketName/adb/Spark/customer';

Note

If a plaintext table already exists in the adb_external_db database, you can skip this step.
If the data is stored in another cloud database, create a corresponding external table. For the syntax to create an external table, see CREATE EXTERNAL TABLE.

Create an external table named enc_customer to store ciphertext data. In this example, the enc_customer external table is stored in oss://testBucketName/adb/Spark/enc_customer.

SET spark.adb.native.enabled=true;
SET spark.hadoop.parquet.encryption.key.list=kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****;
SET spark.hadoop.parquet.encryption.kms.client.class=io.glutenproject.encryption.InMemoryKMS;
SET spark.hadoop.parquet.crypto.factory.class=org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory;
CREATE TABLE IF NOT EXISTS adb_external_db.enc_customer
USING Parquet
OPTIONS (
 'parquet.encryption.column.keys'='kc1:c_name;kc2:c_phone',
 'parquet.encryption.footer.key'='kf'
)
LOCATION 'oss://testBucketName/adb/Spark/enc_customer'
AS
SELECT *
FROM adb_external_db.customer;

Parameter description:

Parameter	Required	Description
parquet.encryption.column.keys	Yes	Encrypts columns using the CMK that corresponds to the key ID. One CMK can encrypt multiple columns. Separate the key ID and column name with a colon (:). Separate multiple columns with commas (,). Separate different CMKs with semicolons (;).
parquet.encryption.footer.key	Yes	The footer key. It is used to encrypt information such as the metadata of the Parquet file. Note The footer is a data structure at the end of a Parquet file. It typically stores file metadata, such as the version number, group metadata, column metadata, and key metadata.

Important

The parquet.encryption.column.keys and parquet.encryption.footer.key parameters must be set at the same time. Otherwise, the file is not encrypted.

(Optional) Delete the external table customer.
```
DROP TABLE IF EXISTS adb_external_db.customer;
```
Important
The DROP TABLE statement deletes only the metadata of the customer external table. To prevent plaintext data leakage, you must manually delete the corresponding data files from OSS.

Create an external table named enc_customer_output and write the SQL computation results from the enc_customer table to the enc_customer_output external table. The data for the enc_customer_output external table is stored in oss://testBucketName/adb/Spark/enc_customer_output.

SET spark.adb.native.enabled=true;
SET spark.hadoop.parquet.encryption.key.list=kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****;
SET spark.hadoop.parquet.encryption.kms.client.class=io.glutenproject.encryption.InMemoryKMS;
SET spark.hadoop.parquet.crypto.factory.class=org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory;
CREATE TABLE IF NOT EXISTS adb_external_db.enc_customer_output
USING Parquet
OPTIONS (
 'parquet.encryption.column.keys'='kc1:c_name;kc2:c_phone',
 'parquet.encryption.footer.key'='kf'
)
LOCATION 'oss://testBucketName/adb/Spark/enc_customer_output'
AS
SELECT *
FROM adb_external_db.enc_customer
WHERE 
c_custkey < 15;

Decrypt the computation results.

Create an external table named customer_output. Decrypt the data from the enc_customer_output table and write the decrypted data to the customer_output external table. The data for the customer_output external table is stored at oss://testBucketName/adb/Spark/customer_output.

SET spark.adb.native.enabled=true;
SET spark.hadoop.parquet.encryption.key.list=kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****;
SET spark.hadoop.parquet.encryption.kms.client.class=io.glutenproject.encryption.InMemoryKMS;
SET spark.hadoop.parquet.crypto.factory.class=org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory;
CREATE TABLE IF NOT EXISTS adb_external_db.customer_output
USING Parquet
LOCATION 'oss://testBucketName/adb/Spark/customer_output'
AS
SELECT *
FROM adb_external_db.enc_customer_output;

Query the data in the customer_output table.

SELECT * FROM adb_external_db.customer_output;

Encrypt data using an encryption tool and create a ciphertext table

Use an encryption tool to encrypt the locally stored plaintext data into a ciphertext dataset. For more information about the encryption tool, see Spark encryption tool.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.SparkConf

// Initialize SparkSession and enter the encryption and decryption parameters.
val conf = new SparkConf()
.set("spark.hadoop.parquet.encryption.kms.client.class", "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS")
.set("spark.hadoop.parquet.encryption.key.list", "kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****")
.set("spark.hadoop.parquet.crypto.factory.class", "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory")

val spark = SparkSession.builder().appName("SquareDataFrame").config(conf).getOrCreate()

// Read the plaintext customer file.
val df = spark.read.parquet("customer")
// Encrypt the plaintext customer file. The name column is encrypted with kc1, and the footer is encrypted with kf. The resulting ciphertext file is enc_customer.
df.write
.option("parquet.encryption.column.keys" , "kc1:c_name")
.option("parquet.encryption.footer.key" , "kf")
// The local path where the ciphertext dataset is stored.
.parquet("enc_customer")

Parameter description:

Parameter	Required	Description
spark.hadoop.parquet.encryption.kms.client.class	Yes	The KMS client class name. When encrypting data locally, set this to `org.apache.parquet.crypto.keytools.mocks.InMemoryKMS`. When creating a ciphertext table in the console, set this to `io.glutenproject.encryption.InMemoryKMS`.
spark.hadoop.parquet.encryption.key.list	Yes	The list of CMKs. A CMK corresponds to a key ID. Separate multiple CMKs with commas (,). Separate each key ID and CMK with a colon (:). Format: `<Key ID 1>:<Base64-encoded CMK 1>,<Key ID 2>:<Base64-encoded CMK 2>`. For more information, see Introduction to keys. This example uses `kf:MDEyMzQ1Njc4OTAxMjdy**,kc1:bvCDwqcOJGSdZSEMLjfk,kc2:kflI/sq+uf50Qhl1MmtG` Warning** You can use a general-purpose tool, such as OpenSSL, to randomly generate a customer master key. A customer master key is the root credential for accessing encrypted data. If the key is lost, you can no longer access the existing data. Keep your customer master key secure.
spark.hadoop.parquet.crypto.factory.class	Yes	The CryptoFactory class name. Set this to `org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory`.
parquet.encryption.column.keys	Yes	Encrypts columns using the CMK that corresponds to the key ID. One CMK can encrypt multiple columns. Separate the key ID and column name with a colon (:). Separate multiple columns with commas (,). Separate different CMKs with semicolons (;).
parquet.encryption.footer.key	Yes	The footer key. It is used to encrypt information such as the metadata of the Parquet file. Note The footer is a data structure at the end of a Parquet file. It typically stores file metadata, such as the version number, group metadata, column metadata, and key metadata.

Upload the ciphertext dataset enc_customer.parquet to OSS. This example uses oss://testBucketName/adb/Spark/enc_customer.parquet. For more information, see Simple upload.

Create a ciphertext table.

Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
In the navigation pane on the left, choose Job Development > SQL Development.
On the SQLConsole tab, select the Spark engine and a job resource group.

Execute the following statements to create a ciphertext table.

You can enable native computing and create a database.

-- Enable native computing.
SET spark.adb.native.enabled=true;
-- Configure resources.
SET spark.driver.resourceSpec=medium;
SET spark.executor.instances=2;
SET spark.executor.resourceSpec=medium;
-- Enable ciphertext read and write, and set the list of master keys, KMS Client, and CryptoFactory. After this is enabled, the engine supports both plaintext and ciphertext.
SET spark.hadoop.parquet.encryption.key.list=kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****;
SET spark.hadoop.parquet.encryption.kms.client.class=io.glutenproject.encryption.InMemoryKMS;
SET spark.hadoop.parquet.crypto.factory.class=org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory;
-- Create a database.
CREATE DATABASE IF NOT EXISTS adb_external_db;

Create the external table enc_customer. The LOCATION parameter specifies the OSS path of the enc_customer ciphertext dataset. This example uses oss://testBucketName/adb/Spark/enc_customer.parquet.

SET spark.adb.native.enabled=true;
SET spark.hadoop.parquet.encryption.key.list=kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****;
SET spark.hadoop.parquet.encryption.kms.client.class=io.glutenproject.encryption.InMemoryKMS;
SET spark.hadoop.parquet.crypto.factory.class=org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory;
CREATE TABLE IF NOT EXISTS adb_external_db.enc_customer 
USING parquet
LOCATION  'oss://testBucketName/adb/Spark/enc_customer';

Create an external table named enc_customer_output, and write the SQL query results from the enc_customer table to the enc_customer_output external table. The data for the enc_customer_output table is stored in oss://testBucketName/adb/Spark/enc_customer_output.

SET spark.adb.native.enabled=true;
SET spark.hadoop.parquet.encryption.key.list=kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****;
SET spark.hadoop.parquet.encryption.kms.client.class=io.glutenproject.encryption.InMemoryKMS;
SET spark.hadoop.parquet.crypto.factory.class=org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory;
CREATE TABLE IF NOT EXISTS adb_external_db.enc_customer_output
USING Parquet
OPTIONS (
 'parquet.encryption.column.keys'='kc1:c_name;kc2:c_phone',
 'parquet.encryption.footer.key'='kf'
)
LOCATION 'oss://testBucketName/adb/Spark/enc_customer_output'
AS
SELECT *
FROM adb_external_db.enc_customer
WHERE 
c_custkey < 15;

Download and decrypt the ciphertext results.

Download the ciphertext computation results from the OSS path oss://testBucketName/adb/Spark/enc_customer_output to your local device. For more information, see Download files.

Decrypt the ciphertext result dataset and save the decrypted file as customer_output.

// Decrypt the ciphertext dataset.
val conf = new SparkConf()
.set("spark.hadoop.parquet.encryption.kms.client.class", "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS")
.set("spark.hadoop.parquet.encryption.key.list", "kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****")
.set("spark.hadoop.parquet.crypto.factory.class", "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory")
val spark = SparkSession.builder().appName("SquareDataFrame").config(conf).getOrCreate()
val df2 = spark.read.parquet("enc_customer_output")
// Download the decrypted file to your local device.
df2.write
.parquet("customer_output")