All Products
Search
Document Center

AnalyticDB:Use the High-performance Edition Spark confidential engine

Last Updated:Oct 10, 2024

The High-performance Edition Spark confidential engine of AnalyticDB for MySQL provides the features of the Basic Edition Spark confidential engine, supports the Parquet modular encryption feature, and is compatible with engines such as Apache Spark, Hadoop, and Hive. The High-performance Edition Spark confidential engine ensures secure data transmission and storage and improves data processing efficiency. This topic describes how to use the High-performance Edition Spark confidential engine to encrypt data and perform SQL queries based on ciphertext tables.

Prerequisites

  • An AnalyticDB for MySQL Data Lakehouse Edition cluster is created.

  • An Object Storage Service (OSS) bucket is created in the same region as the AnalyticDB for MySQL cluster.

  • A job resource group is created for the AnalyticDB for MySQL cluster. For more information, see Create a resource group.

  • A database account is created for the AnalyticDB for MySQL cluster.

  • The AliyunADBFullAccess, AliyunADBSparkProcessingDataRole, and read and write permissions on AnalyticDB for MySQL databases and tables are granted to a RAM user. For more information, see Perform authorization.

Data preparations

The data file that you want to encrypt must be in the Parquet format. You can download the sample Spark confidential data.

Procedure

You can encrypt plaintext data in the AnalyticDB for MySQL console or by using an encryption tool. If the data that you want to encrypt is stored on an on-premises device, you can use an encryption tool to encrypt the data. If the data is stored in a cloud database, you can encrypt the data in the AnalyticDB for MySQL console. The following encryption methods involve different operations:

  • AnalyticDB for MySQL console: Upload plaintext data to OSS, and then encrypt the data.

  • Encryption tool: Encrypt data on the on-premises device, and then upload the ciphertext data to OSS.

Encrypt data and create a ciphertext table in the AnalyticDB for MySQL console

  1. Upload the plaintext data in the "Data preparations" section of this topic to an OSS bucket. In this example, oss://testBucketName/adb/Spark/customer is used. For more information, see Simple upload.

  2. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.

  3. In the left-side navigation pane, choose Job Development > SQL Development.

  4. On the SQLConsole tab, select the Spark engine and a job resource group.

  5. Create a ciphertext table.

    1. Enable the Spark confidential engine, specify a list of master encryption keys (MEKs), and then create a database.

      -- Enable native computing.
      SET spark.adb.native.enabled=true;
      -- Configure resources.
      SET spark.driver.resourceSpec=medium;
      SET spark.executor.instances=2;
      SET spark.executor.resourceSpec=medium;
      SET spark.app.name=Spark SQL Encryption Test;
      -- Specify a list of MEKs, a Key Management Service (KMS) client class, and a CryptoFactory class. After you enable the Spark confidential engine, the engine can support plaintext and ciphertext data.
      SET spark.hadoop.parquet.encryption.key.list=kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****;
      SET spark.hadoop.parquet.encryption.kms.client.class=io.glutenproject.encryption.InMemoryKMS;
      SET spark.hadoop.parquet.crypto.factory.class=org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory;
      -- Create a database.
      CREATE database IF NOT EXISTS adb_external_db;

      The following table describes the parameters.

      Parameter

      Description

      spark.hadoop.parquet.encryption.key.list

      A list of MEKs. Each MEK corresponds to an MEK ID. Separate multiple MEKs with commas (,). Separate each MEK ID and MEK with a colon (:). Format: <MEK ID 1>:<MEK 1 in Base64 encoding>,<MEK ID 2>:<MEK 2 in Base64 encoding>. For more information, see the "Keys" section of the Overview topic.

      In this example, kf:MDEyMzQ1Njc4OTAxMjdy****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG**** is used.

      Warning

      You can use a common tool such as OpenSSL to generate a random MEK. An MEK is the root credential that can be used to access encrypted data. If you lose the MEK, you can no longer access existing data. Keep the MEK confidential.

      spark.hadoop.parquet.encryption.kms.client.class

      The name of the KMS client class. Set the value to io.glutenproject.encryption.InMemoryKMS.

      spark.hadoop.parquet.crypto.factory.class

      The name of the CryptoFactory class. Set the value to org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory.

    2. Create an external table named customer to store plaintext data. The LOCATION parameter specifies the OSS path of the plaintext data. In this example, oss://testBucketName/adb/Spark/customer is used.

      SET spark.adb.native.enabled=true;
      SET spark.hadoop.parquet.encryption.key.list=kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****;
      SET spark.hadoop.parquet.encryption.kms.client.class=io.glutenproject.encryption.InMemoryKMS;
      SET spark.hadoop.parquet.crypto.factory.class=org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory;
      CREATE TABLE IF NOT EXISTS adb_external_db.customer 
       (
          c_custkey long,
          c_name       string,
          c_address    string,
          c_nationkey long,
          c_phone      string,
          c_acctbal    decimal(12, 2),
          c_mktsegment string,
          c_comment    string
      )
      USING parquet 
      LOCATION 'oss://testBucketName/adb/Spark/customer';
      Note
      • If the adb_external_db database already contains a plaintext table, you can skip this step.

      • If the plaintext data is stored in another cloud database, you must create an external table in the corresponding location. For information about how to create an external table, see CREATE EXTERNAL TABLE.

    3. Create an external table named enc_customer to store ciphertext data. In this example, the enc_customer external table is stored in the oss://testBucketName/adb/Spark/enc_customer path.

      SET spark.adb.native.enabled=true;
      SET spark.hadoop.parquet.encryption.key.list=kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****;
      SET spark.hadoop.parquet.encryption.kms.client.class=io.glutenproject.encryption.InMemoryKMS;
      SET spark.hadoop.parquet.crypto.factory.class=org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory;
      CREATE TABLE IF NOT EXISTS adb_external_db.enc_customer
      USING Parquet
      OPTIONS (
       'parquet.encryption.column.keys'='kc1:c_name;kc2:c_phone',
       'parquet.encryption.footer.key'='kf'
      )
      LOCATION 'oss://testBucketName/adb/Spark/enc_customer'
      AS
      SELECT *
      FROM adb_external_db.customer;

      The following table describes the parameters.

      Parameter

      Required

      Description

      parquet.encryption.column.keys

      Yes

      The MEKs and the columns that you want to encrypt. Each MEK can encrypt multiple columns. Separate an MEK ID and column names with a colon (:). Separate multiple column names with commas (,). Separate multiple MEKs with semicolons (;).

      parquet.encryption.footer.key

      Yes

      The footer key that is used to encrypt the metadata of the Parquet file.

      Note

      The footer is a data structure at the end of a Parquet file. The footer is used to store the metadata of the Parquet file, such as the version number, group metadata, column metadata, and key metadata.

      Important

      You must specify the parquet.encryption.column.keys and parquet.encryption.footer.key parameters at the same time. Otherwise, the file is not encrypted.

    4. (Optional) Delete the customer external table.

      DROP TABLE IF EXISTS adb_external_db.customer;
      Important

      The DROP TABLE statement deletes the customer external table. We recommend that you delete the metadata in the OSS bucket to prevent plaintext data leaks.

  6. Create an external table named enc_customer_output and write the SQL computing results of the enc_customer table to the enc_customer_output external table. In this example, the enc_customer_output external table is stored in the oss://testBucketName/adb/Spark/enc_customer_output path.

    SET spark.adb.native.enabled=true;
    SET spark.hadoop.parquet.encryption.key.list=kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****;
    SET spark.hadoop.parquet.encryption.kms.client.class=io.glutenproject.encryption.InMemoryKMS;
    SET spark.hadoop.parquet.crypto.factory.class=org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory;
    CREATE TABLE IF NOT EXISTS adb_external_db.enc_customer_output
    USING Parquet
    OPTIONS (
     'parquet.encryption.column.keys'='kc1:c_name;kc2:c_phone',
     'parquet.encryption.footer.key'='kf'
    )
    LOCATION 'oss://testBucketName/adb/Spark/enc_customer_output'
    AS
    SELECT *
    FROM adb_external_db.enc_customer
    WHERE 
    c_custkey < 15;
  7. Decrypt the computing results.

    1. Create an external table named customer_output and write the decrypted data of the enc_customer_output table to the customer_output external table. In this example, the customer_output external table is stored in the oss://testBucketName/adb/Spark/customer_output path.

      SET spark.adb.native.enabled=true;
      SET spark.hadoop.parquet.encryption.key.list=kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****;
      SET spark.hadoop.parquet.encryption.kms.client.class=io.glutenproject.encryption.InMemoryKMS;
      SET spark.hadoop.parquet.crypto.factory.class=org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory;
      CREATE TABLE IF NOT EXISTS adb_external_db.customer_output
      USING Parquet
      LOCATION 'oss://testBucketName/adb/Spark/customer_output'
      AS
      SELECT *
      FROM adb_external_db.enc_customer_output;
    2. Query the data of the customer_output table.

      SELECT * FROM adb_external_db.customer_output;

Use an encryption tool to encrypt data and create a ciphertext table

  1. Use an encryption tool to convert an on-premises plaintext file into a ciphertext dataset. For information about the encryption tool, see the "Columnar Encryption" section of the Apache Spark documentation.

    import org.apache.spark.sql.SparkSession
    import org.apache.spark.sql.functions._
    import org.apache.spark.SparkConf
    
    // Initialize a SparkSession and specify the encryption and decryption parameters. 
    val conf = new SparkConf()
    .set("spark.hadoop.parquet.encryption.kms.client.class", "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS")
    .set("spark.hadoop.parquet.encryption.key.list", "kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****")
    .set("spark.hadoop.parquet.crypto.factory.class", "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory")
    
    val spark = SparkSession.builder().appName("SquareDataFrame").config(conf).getOrCreate()
    
    // Read the plaintext file named customer. 
    val df = spark.read.parquet("customer")
    // Encrypt the customer file to create a ciphertext file. Use kc1 to encrypt the c_name column and kf to encrypt the footer. 
    df.write
    .option("parquet.encryption.column.keys" , "kc1:c_name")
    .option("parquet.encryption.footer.key" , "kf")
    // The name and on-premises path of the ciphertext dataset. 
    .parquet("enc_customer")

    The following table describes the parameters.

    Parameter

    Required

    Description

    spark.hadoop.parquet.encryption.kms.client.class

    Yes

    The name of the KMS client class.

    • When you encrypt data on the on-premises device, set this parameter to org.apache.parquet.crypto.keytools.mocks.InMemoryKMS.

    • When you create a ciphertext table in the AnalyticDB for MySQL console, set this parameter to io.glutenproject.encryption.InMemoryKMS.

    spark.hadoop.parquet.encryption.key.list

    Yes

    A list of MEKs. Each MEK corresponds to an MEK ID. Separate multiple MEKs with commas (,). Separate each MEK ID and MEK with a colon (:). Format: <MEK ID 1>:<MEK 1 in Base64 encoding>,<MEK ID 2>:<MEK 2 in Base64 encoding>. For more information, see the "Keys" section of the Overview topic.

    In this example, kf:MDEyMzQ1Njc4OTAxMjdy****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG**** is used.

    Warning

    You can use a common tool such as OpenSSL to generate a random MEK. An MEK is the root credential that can be used to access encrypted data. If you lose the MEK, you can no longer access existing data. Keep the MEK confidential.

    spark.hadoop.parquet.crypto.factory.class

    Yes

    The name of the CryptoFactory class. Set the value to org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory.

    parquet.encryption.column.keys

    Yes

    The MEKs and the columns that you want to encrypt. Each MEK can encrypt multiple columns. Separate an MEK ID and column names with a colon (:). Separate multiple column names with commas (,). Separate multiple MEKs with semicolons (;).

    parquet.encryption.footer.key

    Yes

    The footer key that is used to encrypt the metadata of the Parquet file.

    Note

    The footer is a data structure at the end of a Parquet file. The footer is used to store the metadata of the Parquet file, such as the version number, group metadata, column metadata, and key metadata.

  2. Upload the enc_customer.parquet ciphertext dataset to an OSS path. In this example, oss://testBucketName/adb/Spark/enc_customer.parquet is used. For more information, see Simple upload.

  3. Create a ciphertext table.

    1. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.

    2. In the left-side navigation pane, choose Job Development > SQL Development.

    3. On the SQLConsole tab, select the Spark engine and a job resource group.

    4. Create a ciphertext table.

      1. Enable the Spark confidential engine and create a database.

        -- Enable native computing.
        SET spark.adb.native.enabled=true;
        -- Configure resources.
        SET spark.driver.resourceSpec=medium;
        SET spark.executor.instances=2;
        SET spark.executor.resourceSpec=medium;
        -- Specify a list of MEKs, a KMS client class, and a CryptoFactory class. After you enable the Spark confidential engine, the engine can support plaintext and ciphertext data.
        SET spark.hadoop.parquet.encryption.key.list=kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****;
        SET spark.hadoop.parquet.encryption.kms.client.class=io.glutenproject.encryption.InMemoryKMS;
        SET spark.hadoop.parquet.crypto.factory.class=org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory;
        -- Create a database.
        CREATE DATABASE IF NOT EXISTS adb_external_db;
      2. Create an external table named enc_customer. The LOCATION parameter specifies the OSS path of the enc_customer ciphertext dataset. In this example, oss://testBucketName/adb/Spark/enc_customer.parquet is used.

        SET spark.adb.native.enabled=true;
        SET spark.hadoop.parquet.encryption.key.list=kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****;
        SET spark.hadoop.parquet.encryption.kms.client.class=io.glutenproject.encryption.InMemoryKMS;
        SET spark.hadoop.parquet.crypto.factory.class=org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory;
        CREATE TABLE IF NOT EXISTS adb_external_db.enc_customer 
        USING parquet
        LOCATION  'oss://testBucketName/adb/Spark/enc_customer';
  4. Create an external table named enc_customer_output and write the SQL computing results of the enc_customer table to the enc_customer_output external table. In this example, the enc_customer_output external table is stored in the oss://testBucketName/adb/Spark/enc_customer_output path.

    SET spark.adb.native.enabled=true;
    SET spark.hadoop.parquet.encryption.key.list=kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****;
    SET spark.hadoop.parquet.encryption.kms.client.class=io.glutenproject.encryption.InMemoryKMS;
    SET spark.hadoop.parquet.crypto.factory.class=org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory;
    CREATE TABLE IF NOT EXISTS adb_external_db.enc_customer_output
    USING Parquet
    OPTIONS (
     'parquet.encryption.column.keys'='kc1:c_name;kc2:c_phone',
     'parquet.encryption.footer.key'='kf'
    )
    LOCATION 'oss://testBucketName/adb/Spark/enc_customer_output'
    AS
    SELECT *
    FROM adb_external_db.enc_customer
    WHERE 
    c_custkey < 15;
  5. Download and decrypt the ciphertext computing results.

    1. Download the ciphertext computing results from the oss://testBucketName/adb/Spark/enc_customer_output path to the on-premises device. For more information, see Download objects.

    2. Decrypt the ciphertext result dataset and store the decrypted data in the customer_output file.

      // Decrypt the ciphertext result dataset.
      val conf = new SparkConf()
      .set("spark.hadoop.parquet.encryption.kms.client.class", "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS")
      .set("spark.hadoop.parquet.encryption.key.list", "kf:MDEyMzQ1Njc4OTAxMjM0****,kc1:bvCDwqcOJGSdZSEMLjfk****,kc2:kflI/sq+uf50Qhl1MmtG****")
      .set("spark.hadoop.parquet.crypto.factory.class", "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory")
      val spark = SparkSession.builder().appName("SquareDataFrame").config(conf).getOrCreate()
      val df2 = spark.read.parquet("enc_customer_output")
      // Download the decrypted file to the on-premises device.
      df2.write
      .parquet("customer_output")