This topic describes how to use the serverless Spark engine of Data Lake Analytics (DLA) to access Object Storage Service (OSS). After you are authorized to access OSS, you can execute SQL statements or submit Spark code to access OSS.

Grant permissions to access OSS

Before you access OSS, make sure that you are authorized to access OSS by using DLA.

If you use an Alibaba Cloud account, you have the permissions to access all OSS data within your account and the OSS tables that are stored in DLA by default. You can directly access OSS without additional configurations.

If you use a RAM user to access OSS by submitting code, you must grant the RAM user the required permissions. For more information, see Grant permissions to a RAM user (detailed version).

If you use the serverless Spark engine to access OSS tables stored in DLA, make sure that your RAM user is bound with a DLA sub-account. For more information, see Bind a DLA child account with a RAM user. In addition, make sure that the DLA sub-account has the required permissions. To check whether the DLA sub-account has the required permissions, you can log on to the DLA console and execute SQL statements. These statements use the GRANT or REVOKE syntax that is compatible with the MySQL protocol.

Configure spark.dla.connectors

After you are granted the required permissions, you can use the serverless Spark engine of DLA to access OSS. Before you access OSS, you must set spark.dla.connectors to oss in the configuration file of your Spark job. This is because the access feature of DLA does not take effect by default. You must use this parameter to make the access feature take effect. If you do not want to use the access feature, this parameter is not required. You need only to submit your JAR file and add the required configurations.

Execute SQL statements to access OSS data

The serverless Spark engine of DLA allows you to execute SQL statements to access OSS data tables in DLA. If you use this method, you do not need to submit Spark code. For more information about how to execute SQL statements, see Spark SQL. Sample statements in a Spark job:
{
    "sqls": [
        "select * from `1k_tables`.`table0` limit 100",
        "insert into `1k_tables`.`table0` values(1, 'test')"
    ],
    "name": "sql oss test",
    "conf": {
        "spark.dla.connectors": "oss",
        "spark.driver.resourceSpec": "small",
        "spark.sql.hive.metastore.version": "dla",
        "spark.executor.instances": 10,
        "spark.dla.job.log.oss.uri": "oss://test/spark-logs",
        "spark.executor.resourceSpec": "small"
    }
}

Submit Spark code to access OSS data

You can submit Java, Scala, or Python code to access OSS data. Sample Scala code:
{  
  "args": ["oss://${oss-buck-name}/data/test/test.csv"],
  "name": "spark-oss-test",
  "file": "oss://${oss-buck-name}/jars/test/spark-examples-0.0.1-SNAPSHOT.jar",
  "className": "com.aliyun.spark.oss.SparkReadOss",
  "conf": {
    "spark.driver.resourceSpec": "medium",
    "spark.executor.resourceSpec": "medium",
    "spark.executor.instances": 2,
    "spark.dla.connectors": "oss"
  }
}
Note For the source code of SparkReadOss in the main class, see DLA Spark OSS demo.

OSS FileOutputCommitter

Background information

OSSFileOutputCommitter is a job OutputCommitter that is used by the serverless Spark engine to read data from or write data to OSS. The OSS-based multipart upload feature is used instead of inefficient rename operations on OSS to significantly improve the data write performance of jobs. This feature also supports algorithms that are developed based on the algorithms of Spark 2.x or 3.x released by the community of Apache Spark. By default, the algorithms that are developed based on the algorithms of Spark 2.x are used.

Limits

Only jobs with configuration items in the Parquet format are supported. Configuration items in the Parquet format cannot be used together with configuration items in other data formats. This is because configuration items are globally applied.

Example
{  
  "args": ["oss://${oss-buck-name}/data/test/test.csv"],
  "name": "spark-oss-test",
  "file": "oss://${oss-buck-name}/jars/test/spark-examples-0.0.1-SNAPSHOT.jar",
  "className": "com.aliyun.spark.oss.WriteParquetFile",
  "conf": {
    "spark.driver.resourceSpec": "medium",
    "spark.executor.resourceSpec": "medium",
    "spark.executor.instances": 2,
    "spark.dla.connectors": "oss",
    "spark.hadoop.job.oss.fileoutputcommitter.enable": true,
    "spark.sql.parquet.output.committer.class": "com.aliyun.hadoop.mapreduce.lib.output.OSSFileOutputCommitter"
  }
}