This topic describes how to use Spark SQL in AnalyticDB for MySQL to read and write data in MaxCompute.
Prerequisites
An AnalyticDB for MySQL Enterprise Edition, Basic Edition, or Data Lakehouse Edition cluster is created.
A database account is created for the AnalyticDB for MySQL cluster.
If you use an Alibaba Cloud account, you need to only create a privileged account.
If you use a Resource Access Management (RAM) user, you must create a privileged account and a standard account and associate the standard account with the RAM user.
A job resource group is created for the AnalyticDB for MySQL cluster.
MaxCompute and AnalyticDB for MySQL are deployed in the same region. For more information, see Activate MaxCompute.
A MaxCompute project is created. For more information, see Create a MaxCompute project.
Procedure
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
In the navigation pane on the left, click .
In the SQLConsole window, select the Spark engine and a job resource group.
Write a Spark SQL job.
SET spark.adb.version=3.3; SET spark.adb.connectors=odps; SET spark.sql.catalog.odps=org.apache.spark.sql.execution.datasources.v2.odps.OdpsTableCatalog; SET spark.sql.extensions=org.apache.spark.sql.execution.datasources.v2.odps.extension.OdpsExtensions; SET spark.hadoop.odps.project.name=spark_on_maxcompute; SET spark.hadoop.odps.end.point=https://service.cn-hangzhou-vpc.maxcompute.aliyun-inc.com/api; SET spark.hadoop.odps.access.id=yourAccessKeyID; SET spark.hadoop.odps.access.key=yourAccessKeySecret; SET spark.sql.catalog.odps.tableWriteProvider=tunnel; SET spark.sql.catalog.odps.tableReadProvider=tunnel; -- In this example, the project name is "spark_on_maxcompute". Make sure that this name is the same as the value of the spark.hadoop.odps.project.name parameter. CREATE TABLE odps.spark_on_maxcompute.spark_test(k1 int, k2 string) partitioned BY (part int); INSERT INTO odps.spark_on_maxcompute.spark_test PARTITION(part=1) VALUES(1, 'aaa'); SELECT * FROM odps.spark_on_maxcompute.spark_test WHERE part=1;Parameters:
Parameter
Description
spark.adb.version
The Spark version. Only versions
3.3and3.5support access to MaxCompute data sources. You can select either version.spark.adb.connectors
The name of the data source. To read and write MaxCompute data, set this parameter to
odps.spark.sql.catalog.<catalog_name>
The method used to configure a data source in Spark SQL. In the parameter name,
catalog_nameis user-defined. To read and write MaxCompute data, set the value of this parameter toorg.apache.spark.sql.execution.datasources.v2.odps.OdpsTableCatalog.For example:
spark.sql.catalog.odps=org.apache.spark.sql.execution.datasources.v2.odps.OdpsTableCatalog.spark.sql.extensions
The Spark SQL extension. To read and write MaxCompute data, set this parameter to
org.apache.spark.sql.execution.datasources.v2.odps.extension.OdpsExtensions.spark.hadoop.odps.project.name
The name of the MaxCompute project. In this example, the value is
spark_on_maxcompute.NoteThis is the name of the MaxCompute project, not the workspace. Log on to the MaxCompute console, switch the region in the upper-left corner, and then choose Manage Configurations > Projects in the navigation pane on the left to view the name of the MaxCompute project.
spark.hadoop.odps.endpoint
The VPC endpoint of the region where MaxCompute resides.
spark.hadoop.odps.access.id
The AccessKey information of an Alibaba Cloud account or a Resource Access Management (RAM) user that has access permissions on the MaxCompute project.
spark.hadoop.odps.access.key
spark.sql.catalog.odps.tableWriteProvider
The type of interface used to write data. To write data to MaxCompute, set this parameter to
tunnel. This indicates that data is written to MaxCompute using Tunnel.NoteIf you only need to read data from MaxCompute, you do not need to configure this parameter.
spark.sql.catalog.odps.tableReadProvider
The type of interface used to read data. To read data from MaxCompute, set this parameter to
tunnel. This indicates that data is read from MaxCompute using Tunnel.NoteIf you only need to write data to MaxCompute, you do not need to configure this parameter.
Click Execute.
After the Spark job runs successfully, view the table data in the Logs on the Applications tab of the Spark JAR Development page. For more information, see Spark development editor.