This topic describes how to use AnalyticDB for MySQL Data Lakehouse Edition (V3.0) Spark SQL to read data from and write data to MaxCompute.
Background information
MaxCompute is a fast and fully-managed data warehousing service that can process terabytes or petabytes of data. You can use AnalyticDB for MySQL Data Lakehouse Edition (V3.0) Spark SQL to read data from and write data to MaxCompute.
Prerequisites
An AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster is created. For more information, see Create a cluster.
A database account is created.
If you use an Alibaba Cloud account, you must create a privileged database account. For more information, see Create a database account.
If you use a Resource Access Management (RAM) user, you must create both a privileged database account and a standard database account and associate the standard database account with the RAM user. For more information, see Create a database account and Associate or disassociate a database account with or from a RAM user.
A job resource group is created. For more information, see Create a resource group.
A MaxCompute project is created in the same region as the Data Lakehouse Edition (V3.0) cluster. For more information, see Create a MaxCompute project. In this topic, a project named
spark_on_maxcompute
is created in the China (Hangzhou) region.
Procedure
You can choose to read data from and write data to MaxCompute in batch or interactive mode. For information about the differences between the two modes, see Spark SQL execution modes.
Use the batch mode to access MaxCompute
Step 1: Configure a Spark job
Log on to the AnalyticDB for MySQL console.
In the upper-left corner of the page, select the region where the cluster resides.
In the left-side navigation pane, click Clusters.
On the Data Lakehouse Edition (V3.0) tab, find the cluster and click the cluster ID.
In the left-side navigation pane, choose Job Development > SQL Development.
On the SQLConsole tab, select the Spark engine and the job resource group.
On the SQLConsole tab, enter the following configurations in the
SET Key=Value;
format.SET spark.sql.catalog.odps = org.apache.spark.sql.execution.datasources.v2.odps.OdpsTableCatalog; SET spark.hadoop.odps.access.id = LTAI5tMnzDS5EyqqMsT****; SET spark.hadoop.odps.access.key = A2kHFzEgFidOKrDKqAbJIPt8****; SET spark.hadoop.odps.endpoint = http://service.cn-hangzhou.maxcompute.aliyun-inc.com/api; SET spark.hadoop.odps.project = spark_on_maxcompute; SET spark.adb.connectors = odps;
Parameter
Description
spark.sql.catalog.<catalog_name>
The method that is supported for Spark SQL to configure the data source.
Set the value to
org.apache.spark.sql.execution.datasources.v2.odps.OdpsTableCatalog
.NoteYou can specify a custom value for
catalog_name
. In this example,odps
is used.spark.hadoop.odps.access.id
The AccessKey ID of an Alibaba Cloud account or a RAM user that is used to access MaxCompute.
For information about how to obtain your AccessKey ID, see Obtain an AccessKey pair.
spark.hadoop.odps.access.key
The AccessKey secret of an Alibaba Cloud account or a RAM user that is used to access MaxCompute.
For information about how to obtain your AccessKey secret, see Obtain an AccessKey pair.
spark.hadoop.odps.endpoint
The Virtual Private Cloud (VPC) endpoint of the MaxCompute project.
For information about the VPC endpoint of each region, see Endpoints.
spark.hadoop.odps.project
The name of the MaxCompute project.
spark.adb.connectors
The data source of the Spark job. Set the value to
odps
.In the message that appears, click Proceed.
Step 2: Use Spark SQL to read data from and write data to MaxCompute
Enter the following statement and click Execute (F8). In the message that appears, click Proceed to create a MaxCompute table.
CREATE TABLE odps.spark_on_maxcompute.spark_test(k1 int, k2 string) partitioned by (part int);
ImportantIn this example, the
odps.spark_on_maxcompute.spark_test
table name consists of three parts.The first part is the value of catalog_name, which is contained in the
spark.sql.catalog.<catalog_name>
parameter. In this example,odps
is used.The second part is the name of the MaxCompute project.
The third part is the name of the MaxCompute table.
You can also specify a name for the table in the
<maxcompute_project_name>.<table_name>
format. In this case, before you execute the statement to create the table, you must execute theUSE <catalog_name>;
statement. Then, you can execute theUSE <maxcompute_project_name>;
statement. This way, when you create the table, you only need to specify the table name.Enter the following statement and click Execute (F8). In the message that appears, click Proceed to write data to the MaxCompute table.
INSERT INTO odps.spark_on_maxcompute.spark_test values (1, 'aaa', 101), (2, 'bbb', 102);
Enter the following statement and click Execute (F8). In the message that appears, click Proceed to query data in the MaxCompute table.
SELECT * FROM odps.spark_on_maxcompute.spark_test;
NoteIn the DataWorks console, you can find the MaxCompute project and check whether a MaxCompute table is created and has data written to it.
Use the interactive mode to access MaxCompute
Step 1: Configure a Spark job
Log on to the AnalyticDB for MySQL console.
In the upper-left corner of the page, select the region where the cluster resides.
In the left-side navigation pane, click Clusters.
On the Data Lakehouse Edition (V3.0) tab, find the cluster and click the cluster ID.
In the left-side navigation pane, choose Clusters > Resource Groups.
Find the job resource group that you want to manage and click Advanced Settings in the Actions column.
Click the icon next to Other Settings and configure the parameters that are described in the following table.
Parameter
Description
spark.sql.catalog.<catalog_name>
The method that is supported for Spark SQL to configure the data source.
Set the value to
org.apache.spark.sql.execution.datasources.v2.odps.OdpsTableCatalog
.NoteYou can specify a custom value for
catalog_name
. In this example,odps
is used.spark.hadoop.odps.access.id
The AccessKey ID of an Alibaba Cloud account or a RAM user that is used to access MaxCompute.
For information about how to obtain your AccessKey ID, see Obtain an AccessKey pair.
spark.hadoop.odps.access.key
The AccessKey secret of an Alibaba Cloud account or a RAM user that is used to access MaxCompute.
For information about how to obtain your AccessKey secret, see Obtain an AccessKey pair.
spark.hadoop.odps.endpoint
The Virtual Private Cloud (VPC) endpoint of the MaxCompute project.
For information about the VPC endpoint of each region, see Endpoints.
spark.hadoop.odps.project
The name of the MaxCompute project.
spark.adb.connectors
The data source of the Spark job. Set the value to
odps
.Click Start. When Running is displayed on the Advanced Settings (Spark Thrift Server Settings) page, the settings take effect.
Step 2: Use Spark SQL to read data from and write data to MaxCompute
In the left-side navigation pane, choose Job Development > SQL Development.
On the SQLConsole tab, select the Spark engine and the job resource group.
Enter the following statement and click Execute (F8) to create a MaxCompute table.
CREATE TABLE odps.spark_on_maxcompute.spark_test(k1 int, k2 string) partitioned by (part int);
ImportantIn this example, the
odps.spark_on_maxcompute.spark_test
table name consists of three parts.The first part is the value of catalog_name, which is contained in the
spark.sql.catalog.<catalog_name>
parameter. In this example,odps
is used.The second part is the name of the MaxCompute project.
The third part is the name of the MaxCompute table.
You can also specify a name for the table in the
<maxcompute_project_name>.<table_name>
format. In this case, before you execute the statement to create the table, you must execute theUSE <catalog_name>;
statement. Then, you can execute theUSE <maxcompute_project_name>;
statement. This way, when you create the table, you only need to specify the table name.Enter the following statement and click Execute (F8) to write data to the MaxCompute table.
INSERT INTO odps.spark_on_maxcompute.spark_test values (1, 'aaa', 101), (2, 'bbb', 102);
Enter the following statement and click Execute (F8) to query data in the MaxCompute table.
SELECT * FROM odps.spark_on_maxcompute.spark_test;
NoteIn the DataWorks console, you can find the MaxCompute project and check whether a MaxCompute table is created and has data written to it.
Supported SQL statements and data types
Supported SQL statements
SHOW DATABASES
AHOW TABLES
CREATE TABLE
DROP TABLE
SELECT
INSERT INTO
INSERT INTO PARTITION
The INSERT OVERWRITE PARTITION statement is not supported.
Supported field types
BOOLEAN
DOUBLE
INT
BIGINT
STRING
DECIMAL
DATA
TIMESTAMP
FLOAT
SMALLINT
TINYINT
BINARY
ARRAY
MAP
STRUCT
Supported partition field types
INT
BIGINT
STRING
TINYINT
SMALLINT