Use AnalyticDB for MySQL Spark to access OSS - AnalyticDB for MySQL

AnalyticDB for MySQL Spark allows you to access Object Storage Service (OSS) data within an Alibaba Cloud account or across Alibaba Cloud accounts. This topic describes how to access OSS data within an Alibaba Cloud account or across Alibaba Cloud accounts.

Prerequisites

An AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster is created in the same region as an OSS bucket.
A job resource group is created in the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster. For more information, see Create a resource group.
A database account is created for the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster.
- If you use an Alibaba Cloud account, you must create a privileged account. For more information, see the "Create a privileged account" section of the Create a database account topic.
- If you use a Resource Access Management (RAM) user, you must create both a privileged account and a standard account and associate the standard account with the RAM user. For more information, see Create a database account and Associate or disassociate a database account with or from a RAM user.
Authorization is complete. For more information, see Perform authorization for Alibaba Cloud accounts.
Important
To access OSS data within an Alibaba Cloud account, you must have the AliyunADBSparkProcessingDataRole permission. To access OSS data across Alibaba Cloud accounts, you must perform authorization for other Alibaba Cloud accounts.

Step 1: Prepare data

Prepare a text file for access and upload the file to the OSS bucket. In this example, the file is named readme.txt. For more information, see Upload objects.
```
AnalyticDB for MySQL
Database service
```

Compile Python code and upload the code to the OSS bucket. In this example, the Python code file is named example.py. The Python code file is used to read the first line in the readme.txt file.

import sys

from pyspark.sql import SparkSession

# Initialize a Spark application.
spark = SparkSession.builder.appName('OSS Example').getOrCreate()
# Read the specified text file. The file path is specified by the args parameter.
textFile = spark.sparkContext.textFile(sys.argv[1])
# Count and display the number of lines in the text file.
print("File total lines: " + str(textFile.count()))
# Display the first line of the text file.
print("First line is: " + textFile.first())

Step 2: Access OSS data

Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition (V3.0) tab, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Job Development > Spark JAR Development.
In the upper part of the editor, select the job resource group and a Spark application type. In this example, the Batch type is selected.

Run the following Spark code in the editor. Display the total number of lines and the content of the first line in the text file.

Access OSS data within an Alibaba Cloud account

{
  "args": ["oss://testBucketName/data/readme.txt"],
  "name": "spark-oss-test",
  "file": "oss://testBucketName/data/example.py",
  "conf": {
    "spark.driver.resourceSpec": "small",
    "spark.executor.resourceSpec": "small",
    "spark.executor.instances": 1
  }
}

Access OSS data across Alibaba Cloud accounts

{
  "args": ["oss://testBucketName/data/readme.txt"],
  "name": "CrossAccount",
  "file": "oss://testBucketName/data/example.py",
  "conf": {
    "spark.adb.roleArn": "acs:ram::testAccountID:role/<testUserName>",
    "spark.driver.resourceSpec": "c.medium",
    "spark.executor.instances": 1
    "spark.executor.resourceSpec": "c.medium",
  }
}

The following table describes the parameters.

Parameter	Description
args	The arguments that are passed to the Spark application. Separate multiple arguments with commas (,). In this example, the OSS path of the text file is assigned to `textFile`.
name	The name of the Spark application.
file	The path of the main file of the Spark application. The main file can be a JAR package that contains the entry point or an executable file that serves as the entry point for the Python application. Important You must store the main files of Spark applications in OSS.
spark.adb.roleArn	The RAM role that is used to access an external data source across Alibaba Cloud accounts. Separate multiple roles with commas (,). Specify the parameter in the `acs:ram::<testAccountID>:role/<testUserName>` format. Note `<testAccountID>`: the ID of the Alibaba Cloud account that owns the external data source. `<testUserName>`: the name of the RAM role that is created when you perform authorization across Alibaba Cloud accounts. For more information, see the "Perform authorization across Alibaba Cloud accounts" section of the Perform authorization for Alibaba Cloud accounts topic.
conf	The configuration parameters that are required for the Spark application, which are similar to the configuration parameters of Apache Spark. The parameters must be in the `key: value` format. Separate multiple parameters with commas (,). For information about the configuration parameters that are different from the configuration parameters of Apache Spark or the configuration parameters that are specific to AnalyticDB for MySQL, see Spark application configuration parameters.

Click Run Now.
After you run the Spark code, you can click Log in the Actions column on the Applications tab of the Spark JAR Development page to view log information. For more information, see Spark editor.

References

For information about Spark application development, see Overview of Spark application development.
For information about the configuration parameters of Spark applications, see Spark application configuration parameters.