This topic describes how to use Spark SQL to read data from and write data to MaxCompute in AnalyticDB for MySQL.
Prerequisites
An AnalyticDB for MySQL Data Lakehouse Edition cluster is created.
A database account is created for the AnalyticDB for MySQL cluster.
If you use an Alibaba Cloud account, you need to only create a privileged account. For more information, see the "Create a privileged account" section of the Create a database account topic.
If you use a Resource Access Management (RAM) user, you must create a privileged account and a standard account and associate the standard account with the RAM user. For more information, see Create a database account and Associate or disassociate a database account with or from a RAM user.
A job resource group is created for the AnalyticDB for MySQL cluster. For more information, see Create a resource group.
MaxCompute is activated in the same region as the AnalyticDB for MySQL cluster. For more information, see Activate MaxCompute and DataWorks.
A MaxCompute project is created. For more information, see Create a MaxCompute project.
Procedure
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose
.On the SQLConsole tab, select the Spark engine and a job resource group.
Enter the following SQL statements, and click Execute.
SET spark.adb.version=3.3; SET spark.adb.connectors=odps; SET spark.sql.catalog.odps=org.apache.spark.sql.execution.datasources.v2.odps.OdpsTableCatalog; SET spark.sql.extensions=org.apache.spark.sql.execution.datasources.v2.odps.extension.OdpsExtensions; SET spark.hadoop.odps.project.name=spark_on_maxcompute; SET spark.hadoop.odps.end.point=https://service.cn-hangzhou-vpc.maxcompute.aliyun-inc.com/api; SET spark.hadoop.odps.access.id=LTAI5tMnzDS5EyqqMsT****; SET spark.hadoop.odps.access.key=A2kHFzEgFidOKrDKqAbJIPt8****; SET spark.sql.catalog.odps.tableWriteProvider=tunnel; SET spark.sql.catalog.odps.tableReadProvider=tunnel; CREATE TABLE odps.spark_on_maxcompute.spark_test(k1 int, k2 string) partitioned by (part int); INSERT INTO odps.spark_on_maxcompute.spark_test partition(part=1) values(1, 'aaa'); SELECT * FROM odps.adbtest2.spark_test2 where part=1;
The following table describes the parameters in the SQL statements.
Parameter
Description
spark.adb.version
The version of Spark. Only AnalyticDB for MySQL clusters of
V3.3
andV3.5
allow you to read data from and write data to MaxCompute. Set this parameter to 3.3 or 3.5.spark.adb.connectors
The name of the data source. When you read data from or write data to MaxCompute, set this parameter to
odps
.spark.sql.catalog.<catalog_name>
The method that is used by Spark SQL to configure data sources. You can configure a custom value for the
catalog_name
parameter. When you read data from or write data to MaxCompute, set this parameter toorg.apache.spark.sql.execution.datasources.v2.odps.OdpsTableCatalog
.Example:
spark.sql.catalog.odps=org.apache.spark.sql.execution.datasources.v2.odps.OdpsTableCatalog
.spark.sql.extensions
The SQL extension module of Spark. When you read data from or write data to MaxCompute, set this parameter to
org.apache.spark.sql.execution.datasources.v2.odps.extension.OdpsExtensions
.spark.hadoop.odps.project.name
The name of the MaxCompute project. In this example,
spark_on_maxcompute
is used.NoteThis parameter specifies the name of the MaxCompute project, not the name of the workspace. You can log on to the MaxCompute console, select a region in the top navigation bar, and then choose Workspace > Projects in the left-side navigation pane to view the name of the MaxCompute project.
spark.hadoop.odps.endpoint
The Virtual Private Cloud (VPC) endpoint of the MaxCompute project.
For information about the VPC endpoint of each region, see the "Endpoints in different regions (VPC)" section of the Endpoints topic.
spark.hadoop.odps.access.id
The AccessKey ID of an Alibaba Cloud account or a Resource Access Management (RAM) user that is used to access MaxCompute.
For information about how to obtain the AccessKey ID, see Accounts and permissions.
spark.hadoop.odps.access.key
The AccessKey secret of an Alibaba Cloud account or a RAM user that is used to access MaxCompute.
For information about how to obtain the AccessKey secret, see Accounts and permissions.
spark.sql.catalog.odps.tableWriteProvider
The type of the API operation used to write data to MaxCompute. When you write data to MaxCompute, set this parameter to
tunnel
, which indicates that the data is written to MaxCompute in tunnel mode.NoteIf you want to only read data from MaxCompute, you do not need to specify this parameter.
spark.sql.catalog.odps.tableReadProvider
The type of the API operation used to read data from MaxCompute. When you read data from MaxCompute, set this parameter to
tunnel
, which indicates that the data is read from MaxCompute in tunnel mode.NoteIf you want to only write data to MaxCompute, you do not need to specify this parameter.
Click Execute.
After you run the Spark job, go to the Spark JAR Development page and click Logs in the Actions column on the Applications tab to view the table data. For more information, see the "View information about a Spark application" section of the Spark editor topic.