The OSS-HDFS service supports RootPolicy. With RootPolicy, you can set a custom prefix for the OSS-HDFS service. This feature allows Serverless Spark to directly operate on data in OSS-HDFS without modifying existing jobs that use the hdfs:// prefix.
Prerequisites
-
You have created a Serverless Spark workspace. For more information, see Create a workspace.
-
You have created an EMR on ECS cluster with the OSS-HDFS service enabled. For more information, see Create a cluster.
Configure RootPolicy
If you have already configured RootPolicy for the OSS-HDFS service on your cluster, skip this section and proceed to Using RootPolicy.
-
Configure environment variables.
-
Connect to an ECS instance. For more information, see Connect to an ECS instance.
-
Go to the bin directory of the installed JindoSDK JAR package.
cd jindosdk-x.x.x/bin/Notex.x.x indicates the version number of the JindoSDK JAR package.
-
Create a configuration file named
jindosdk.cfg, and then add the following parameters to the configuration file.[common] Retain the following default configurations. logger.dir = /tmp/jindo/ logger.sync = false logger.consolelogger = false logger.level = 0 logger.verbose = 0 logger.cleaner.enable = true hadoopConf.enable = false [jindosdk] Specify the following parameters. <!-- In this example, the China (Hangzhou) region is used. Specify your actual region. --> fs.oss.endpoint = cn-hangzhou.oss-dls.aliyuncs.com <! -- Configure the AccessKey ID and AccessKey secret that is used to access OSS-HDFS. --> fs.oss.accessKeyId = yourAccessKeyId fs.oss.accessKeySecret = yourAccessKeySecret -
Configure environment variables.
export JINDOSDK_CONF_DIR=<JINDOSDK_CONF_DIR>Set <JINDOSDK_CONF_DIR> to the absolute path of the
jindosdk.cfgconfiguration file.
-
-
Configure RootPolicy.
Run the following SetRootPolicy command to specify a registered address that contains a custom prefix for a bucket:
jindo admin -setRootPolicy oss://<bucket_name>.<dls_endpoint>/ hdfs://<your_ns_name>/The following table describes the parameters in the SetRootPolicy command.
Parameter
Description
bucket_name
The name of the bucket for which OSS-HDFS is enabled.
dls_endpoint
The endpoint of the region in which the bucket for which OSS-HDFS is enabled. Example:
cn-hangzhou.oss-dls.aliyuncs.com.If you do not want to repeatedly add the <dls_endpoint> parameter to the SetRootPolicy command each time you run RootPolicy, you can use one of the following methods to add configuration items to the
core-site.xmlfile of Hadoop:-
Method 1:
<configuration> <property> <name>fs.oss.endpoint</name> <value><dls_endpoint></value> </property> </configuration> -
Method 2:
<configuration> <property> <name>fs.oss.bucket.<bucket_name>.endpoint</name> <value><dls_endpoint></value> </property> </configuration>
your_ns_name
The custom nsname that is used to access OSS-HDFS. A non-empty string is supported, such as
test. The current version supports only the root directory. -
-
Configure Access Policy discovery address and Scheme implementation class.
You must configure the following parameters in the core-site.xml file of Hadoop:
<configuration> <property> <name>fs.accessPolicies.discovery</name> <value>oss://<bucket_name>.<dls_endpoint>/</value> </property> <property> <name>fs.AbstractFileSystem.hdfs.impl</name> <value>com.aliyun.jindodata.hdfs.HDFS</value> </property> <property> <name>fs.hdfs.impl</name> <value>com.aliyun.jindodata.hdfs.JindoHdfsFileSystem</value> </property> </configuration>If you want to configure Access Policy discovery addresses and Scheme implementation classes for multiple buckets, separate the buckets with commas (
,). -
Run the following command to check whether RootPolicy is successfully configured:
hadoop fs -ls hdfs://<your_ns_name>/If the following results are returned, RootPolicy is successfully configured:
drwxr-x--x - hdfs hadoop 0 2023-01-05 12:27 hdfs://<your_ns_name>/apps drwxrwxrwx - spark hadoop 0 2023-01-05 12:27 hdfs://<your_ns_name>/spark-history drwxrwxrwx - hdfs hadoop 0 2023-01-05 12:27 hdfs://<your_ns_name>/tmp drwxrwxrwx - hdfs hadoop 0 2023-01-05 12:27 hdfs://<your_ns_name>/user -
Use a custom prefix to access OSS-HDFS.
After you restart services such as Hive and Spark, you can access OSS-HDFS by using a custom prefix.
-
Optional. Use RootPolicy for other purposes.
-
List all registered addresses that contain a custom prefix specified for a bucket
Run the following listAccessPolicies command to list all registered addresses that contain a custom prefix specified for a bucket:
jindo admin -listAccessPolicies oss://<bucket_name>.<dls_endpoint>/ -
Delete all registered addresses that contain a custom prefix specified for a bucket:
Run the following unsetRootPolicy command to delete all registered addresses that contain a custom prefix specified for a bucket:
jindo admin -unsetRootPolicy oss://<bucket_name>.<dls_endpoint>/ hdfs://<your_ns_name>/
-
Using RootPolicy
Scenario 1: Use in a Notebook session
-
Configure Spark settings.
-
On the EMR Serverless Spark page, click Sessions in the left-side navigation pane.
-
On the Notebook Session page, click Create Notebook Session.
-
On the Create Notebook Session page, configure the following Spark configuration .
spark.hadoop.fs.accessPolicies.discovery oss://<bucketname>.cn-<region>.oss-dls.aliyuncs.com spark.hadoop.fs.AbstractFileSystem.hdfs.impl com.aliyun.jindodata.hdfs.v3.HDFS spark.hadoop.fs.hdfs.impl com.aliyun.jindodata.hdfs.v3.JindoDistributedFileSystem -
-
In a Data Development Notebook task, use the custom prefix set by RootPolicy for the OSS-HDFS service.
-- Create a table spark.sql("""CREATE TABLE default.my_orc_table ( id INT, name STRING, age INT ) location 'hdfs://<ns_name>/user/hive/warehouse/ads_user_info_1d_emr/my_orc_table/'""") -- Insert data spark.sql("""INSERT INTO table default.my_orc_table(id, name, age) VALUES (1, 'Alice', 30)"""); -- Query data spark.sql("SELECT * FROM default.my_orc_table").show()NoteIn the
CREATE TABLEstatement, replace<ns_name>with the custom prefix that you use to access the OSS-HDFS service.
The !hadoop fs command in Notebook does not support RootPolicy.
Scenario 2: Use in an SQL session
-
Configure Spark settings.
-
On the EMR Serverless Spark page, click Sessions in the left-side navigation pane.
-
On the SQL Session page, click Connect to SQL Session.
-
On the Connect to SQL Session page, configure the following Spark Configuration.
spark.hadoop.fs.accessPolicies.discovery oss://<bucketname>.cn-<region>.oss-dls.aliyuncs.com spark.hadoop.fs.AbstractFileSystem.hdfs.impl com.aliyun.jindodata.hdfs.v3.HDFS spark.hadoop.fs.hdfs.impl com.aliyun.jindodata.hdfs.v3.JindoDistributedFileSystem -
-
In a Data Development SparkSQL task, use the custom prefix set by RootPolicy for the OSS-HDFS service.
-- Create a table CREATE TABLE default.my_orc_table1 ( id INT, name STRING, age INT ) location "hdfs://<ns_name>/user/hive/warehouse/ads_user_info_1d_emr/my_orc_table1/"; -- Insert data INSERT INTO table default.my_orc_table1(id, name, age) VALUES (1, 'Alice', 30); -- Query data select * from default.my_orc_table1;NoteIn the
CREATE TABLEstatement, replace<ns_name>with the custom prefix that you use to access the OSS-HDFS service.
Scenario 3: Use in a batch job
-
Upload a file. This topic uses the test.sql file as an example.
-
On the EMR Serverless Spark page, click Artifacts in the left-side navigation pane.
-
On the Managed File Directory page, click Upload File.
-
In the Upload File dialog box, click the upload area to select a local file, or drag the file to the upload area.
-
-
Configure Spark settings.
-
On the EMR Serverless Spark page, click Development in the left-side navigation pane.
-
On the Development tab, click the
(New) icon. -
In the dialog box that appears, enter a Name, select SQL Job under Application, and then click OK.
-
On the Task Configuration page, configure the following Spark Configuration.
spark.hadoop.fs.accessPolicies.discovery oss://<bucketname>.cn-<region>.oss-dls.aliyuncs.com spark.hadoop.fs.AbstractFileSystem.hdfs.impl com.aliyun.jindodata.hdfs.v3.HDFS spark.hadoop.fs.hdfs.impl com.aliyun.jindodata.hdfs.v3.JindoDistributedFileSystem
-
-
On the Data Development page, click Run. Set the engine version to
esr-4.3.0 (Spark 3.5.2, Scala 2.12)and disable Fusion Acceleration. For resource configuration, setdriver.coresto 2 CPU,driver.memoryto 8 GB,executor.coresto 1 CPU, andexecutor.memoryto 4 GB. After you submit the job, check the run history to confirm that the job status is Succeeded.