Use Spark SQL to access MaxCompute - AnalyticDB for MySQL

This topic describes how to use AnalyticDB for MySQL Data Lakehouse Edition (V3.0) Spark SQL to read data from and write data to MaxCompute.

Background information

MaxCompute is a fast and fully-managed data warehousing service that can process terabytes or petabytes of data. You can use AnalyticDB for MySQL Data Lakehouse Edition (V3.0) Spark SQL to read data from and write data to MaxCompute.

Prerequisites

An AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster is created. For more information, see Create a cluster.
A database account is created.
- If you use an Alibaba Cloud account, you must create a privileged database account. For more information, see Create a database account.
- If you use a Resource Access Management (RAM) user, you must create both a privileged database account and a standard database account and associate the standard database account with the RAM user. For more information, see Create a database account and Associate or disassociate a database account with or from a RAM user.
A job resource group is created. For more information, see Create a resource group.
A MaxCompute project is created in the same region as the Data Lakehouse Edition (V3.0) cluster. For more information, see Create a MaxCompute project. In this topic, a project named spark_on_maxcompute is created in the China (Hangzhou) region.

Procedure

You can choose to read data from and write data to MaxCompute in batch or interactive mode. For information about the differences between the two modes, see Spark SQL execution modes.

Use the batch mode to access MaxCompute

Step 1: Configure a Spark job

Log on to the AnalyticDB for MySQL console.
In the upper-left corner of the page, select the region where the cluster resides.
In the left-side navigation pane, click Clusters.
On the Data Lakehouse Edition (V3.0) tab, find the cluster and click the cluster ID.
In the left-side navigation pane, choose Job Development > SQL Development.
On the SQLConsole tab, select the Spark engine and the job resource group.

On the SQLConsole tab, enter the following configurations in the SET Key=Value; format.

SET spark.sql.catalog.odps = org.apache.spark.sql.execution.datasources.v2.odps.OdpsTableCatalog;
SET spark.hadoop.odps.access.id = LTAI5tMnzDS5EyqqMsT****;
SET spark.hadoop.odps.access.key = A2kHFzEgFidOKrDKqAbJIPt8****;
SET spark.hadoop.odps.endpoint = http://service.cn-hangzhou.maxcompute.aliyun-inc.com/api;
SET spark.hadoop.odps.project = spark_on_maxcompute;
SET spark.adb.connectors = odps;

Parameter	Description
spark.sql.catalog.<catalog_name>	The method that is supported for Spark SQL to configure the data source. Set the value to `org.apache.spark.sql.execution.datasources.v2.odps.OdpsTableCatalog`. Note You can specify a custom value for `catalog_name`. In this example, `odps` is used.
spark.hadoop.odps.access.id	The AccessKey ID of an Alibaba Cloud account or a RAM user that is used to access MaxCompute. For information about how to obtain your AccessKey ID, see Obtain an AccessKey pair.
spark.hadoop.odps.access.key	The AccessKey secret of an Alibaba Cloud account or a RAM user that is used to access MaxCompute. For information about how to obtain your AccessKey secret, see Obtain an AccessKey pair.
spark.hadoop.odps.endpoint	The Virtual Private Cloud (VPC) endpoint of the MaxCompute project. For information about the VPC endpoint of each region, see Endpoints.
spark.hadoop.odps.project	The name of the MaxCompute project.
spark.adb.connectors	The data source of the Spark job. Set the value to `odps`.

In the message that appears, click Proceed.

Step 2: Use Spark SQL to read data from and write data to MaxCompute

Enter the following statement and click Execute (F8). In the message that appears, click Proceed to create a MaxCompute table.
```
CREATE TABLE odps.spark_on_maxcompute.spark_test(k1 int, k2 string) 
partitioned by (part int);
```
Important
In this example, the odps.spark_on_maxcompute.spark_test table name consists of three parts.
- The first part is the value of catalog_name, which is contained in the spark.sql.catalog.<catalog_name> parameter. In this example, odps is used.
- The second part is the name of the MaxCompute project.
- The third part is the name of the MaxCompute table.
You can also specify a name for the table in the <maxcompute_project_name>.<table_name> format. In this case, before you execute the statement to create the table, you must execute the USE <catalog_name>; statement. Then, you can execute the USE <maxcompute_project_name>; statement. This way, when you create the table, you only need to specify the table name.
Enter the following statement and click Execute (F8). In the message that appears, click Proceed to write data to the MaxCompute table.
```
INSERT INTO odps.spark_on_maxcompute.spark_test 
values (1, 'aaa', 101), (2, 'bbb', 102);
```
Enter the following statement and click Execute (F8). In the message that appears, click Proceed to query data in the MaxCompute table.
```
SELECT * FROM odps.spark_on_maxcompute.spark_test;
```
Note
In the DataWorks console, you can find the MaxCompute project and check whether a MaxCompute table is created and has data written to it.

Use the interactive mode to access MaxCompute

Step 1: Configure a Spark job

Log on to the AnalyticDB for MySQL console.
In the upper-left corner of the page, select the region where the cluster resides.
In the left-side navigation pane, click Clusters.
On the Data Lakehouse Edition (V3.0) tab, find the cluster and click the cluster ID.
In the left-side navigation pane, choose Clusters > Resource Groups.
Find the job resource group that you want to manage and click Advanced Settings in the Actions column.

Click the tubiao..png icon next to Other Settings and configure the parameters that are described in the following table.

Parameter	Description
spark.sql.catalog.<catalog_name>	The method that is supported for Spark SQL to configure the data source. Set the value to `org.apache.spark.sql.execution.datasources.v2.odps.OdpsTableCatalog`. Note You can specify a custom value for `catalog_name`. In this example, `odps` is used.
spark.hadoop.odps.access.id	The AccessKey ID of an Alibaba Cloud account or a RAM user that is used to access MaxCompute. For information about how to obtain your AccessKey ID, see Obtain an AccessKey pair.
spark.hadoop.odps.access.key	The AccessKey secret of an Alibaba Cloud account or a RAM user that is used to access MaxCompute. For information about how to obtain your AccessKey secret, see Obtain an AccessKey pair.
spark.hadoop.odps.endpoint	The Virtual Private Cloud (VPC) endpoint of the MaxCompute project. For information about the VPC endpoint of each region, see Endpoints.
spark.hadoop.odps.project	The name of the MaxCompute project.
spark.adb.connectors	The data source of the Spark job. Set the value to `odps`.

Click Start. When Running is displayed on the Advanced Settings (Spark Thrift Server Settings) page, the settings take effect.

Step 2: Use Spark SQL to read data from and write data to MaxCompute

In the left-side navigation pane, choose Job Development > SQL Development.
On the SQLConsole tab, select the Spark engine and the job resource group.
Enter the following statement and click Execute (F8) to create a MaxCompute table.
```
CREATE TABLE odps.spark_on_maxcompute.spark_test(k1 int, k2 string) 
partitioned by (part int);
```
Important
In this example, the odps.spark_on_maxcompute.spark_test table name consists of three parts.
- The first part is the value of catalog_name, which is contained in the spark.sql.catalog.<catalog_name> parameter. In this example, odps is used.
- The second part is the name of the MaxCompute project.
- The third part is the name of the MaxCompute table.
You can also specify a name for the table in the <maxcompute_project_name>.<table_name> format. In this case, before you execute the statement to create the table, you must execute the USE <catalog_name>; statement. Then, you can execute the USE <maxcompute_project_name>; statement. This way, when you create the table, you only need to specify the table name.

Enter the following statement and click Execute (F8) to write data to the MaxCompute table.

INSERT INTO odps.spark_on_maxcompute.spark_test 
values (1, 'aaa', 101), (2, 'bbb', 102);

Enter the following statement and click Execute (F8) to query data in the MaxCompute table.
```
SELECT * FROM odps.spark_on_maxcompute.spark_test;
```
Note
In the DataWorks console, you can find the MaxCompute project and check whether a MaxCompute table is created and has data written to it.

Supported SQL statements and data types

Supported SQL statements

SHOW DATABASES
AHOW TABLES
CREATE TABLE
DROP TABLE
SELECT
INSERT INTO
INSERT INTO PARTITION

Important

The INSERT OVERWRITE PARTITION statement is not supported.

Supported field types

BOOLEAN
DOUBLE
INT
BIGINT
STRING
DECIMAL
DATA
TIMESTAMP
FLOAT
SMALLINT
TINYINT
BINARY
ARRAY
MAP
STRUCT

Supported partition field types

INT
BIGINT
STRING
TINYINT
SMALLINT