Access OSS-HDFS using RootPolicy - E-MapReduce - Alibaba Cloud Documentation Center

Prerequisites

You have created a Serverless Spark workspace. For more information, see Create a workspace.
You have created an EMR on ECS cluster with the OSS-HDFS service enabled. For more information, see Create a cluster.

Configure RootPolicy

Note

If you have already configured RootPolicy for the OSS-HDFS service on your cluster, skip this section and proceed to Using RootPolicy.

Configure environment variables.

Connect to an ECS instance. For more information, see Connect to an ECS instance.
Go to the bin directory of the installed JindoSDK JAR package.
```
cd jindosdk-x.x.x/bin/
```
Note
x.x.x indicates the version number of the JindoSDK JAR package.

Create a configuration file named jindosdk.cfg, and then add the following parameters to the configuration file.

[common] Retain the following default configurations. 
logger.dir = /tmp/jindo/
logger.sync = false
logger.consolelogger = false
logger.level = 0
logger.verbose = 0
logger.cleaner.enable = true
hadoopConf.enable = false

[jindosdk] Specify the following parameters. 
<!-- In this example, the China (Hangzhou) region is used. Specify your actual region.  -->
fs.oss.endpoint = cn-hangzhou.oss-dls.aliyuncs.com
<! -- Configure the AccessKey ID and AccessKey secret that is used to access OSS-HDFS.  -->
fs.oss.accessKeyId = yourAccessKeyId    
fs.oss.accessKeySecret = yourAccessKeySecret

Configure environment variables.
```
export JINDOSDK_CONF_DIR=<JINDOSDK_CONF_DIR>
```
Set <JINDOSDK_CONF_DIR> to the absolute path of the jindosdk.cfg configuration file.

Configure RootPolicy.

Run the following SetRootPolicy command to specify a registered address that contains a custom prefix for a bucket:

jindo admin -setRootPolicy oss://<bucket_name>.<dls_endpoint>/ hdfs://<your_ns_name>/

The following table describes the parameters in the SetRootPolicy command.

Parameter	Description
bucket_name	The name of the bucket for which OSS-HDFS is enabled.
dls_endpoint	The endpoint of the region in which the bucket for which OSS-HDFS is enabled. Example: `cn-hangzhou.oss-dls.aliyuncs.com`. If you do not want to repeatedly add the <dls_endpoint> parameter to the SetRootPolicy command each time you run RootPolicy, you can use one of the following methods to add configuration items to the `core-site.xml` file of Hadoop: Method 1: `<configuration> <property> <name>fs.oss.endpoint</name> <value><dls_endpoint></value> </property> </configuration>` Method 2: `<configuration> <property> <name>fs.oss.bucket.<bucket_name>.endpoint</name> <value><dls_endpoint></value> </property> </configuration>`
your_ns_name	The custom nsname that is used to access OSS-HDFS. A non-empty string is supported, such as `test`. The current version supports only the root directory.

Configure Access Policy discovery address and Scheme implementation class.

You must configure the following parameters in the core-site.xml file of Hadoop:

<configuration>
    <property>
        <name>fs.accessPolicies.discovery</name>
        <value>oss://<bucket_name>.<dls_endpoint>/</value>
    </property>
    <property>
        <name>fs.AbstractFileSystem.hdfs.impl</name>
        <value>com.aliyun.jindodata.hdfs.HDFS</value>
    </property>
    <property>
        <name>fs.hdfs.impl</name>
        <value>com.aliyun.jindodata.hdfs.JindoHdfsFileSystem</value>
    </property>
</configuration>

If you want to configure Access Policy discovery addresses and Scheme implementation classes for multiple buckets, separate the buckets with commas (,).

Run the following command to check whether RootPolicy is successfully configured:

hadoop fs -ls hdfs://<your_ns_name>/

If the following results are returned, RootPolicy is successfully configured:

drwxr-x--x   - hdfs  hadoop          0 2023-01-05 12:27 hdfs://<your_ns_name>/apps
drwxrwxrwx   - spark hadoop          0 2023-01-05 12:27 hdfs://<your_ns_name>/spark-history
drwxrwxrwx   - hdfs  hadoop          0 2023-01-05 12:27 hdfs://<your_ns_name>/tmp
drwxrwxrwx   - hdfs  hadoop          0 2023-01-05 12:27 hdfs://<your_ns_name>/user

Use a custom prefix to access OSS-HDFS.

After you restart services such as Hive and Spark, you can access OSS-HDFS by using a custom prefix.
Optional. Use RootPolicy for other purposes.
- List all registered addresses that contain a custom prefix specified for a bucket
  
  Run the following listAccessPolicies command to list all registered addresses that contain a custom prefix specified for a bucket:
```
jindo admin -listAccessPolicies oss://<bucket_name>.<dls_endpoint>/
```
- Delete all registered addresses that contain a custom prefix specified for a bucket:
  
  Run the following unsetRootPolicy command to delete all registered addresses that contain a custom prefix specified for a bucket:
```
jindo admin -unsetRootPolicy oss://<bucket_name>.<dls_endpoint>/ hdfs://<your_ns_name>/
```

Using RootPolicy

Scenario 1: Use in a Notebook session

Configure Spark settings.

On the EMR Serverless Spark page, click Sessions in the left-side navigation pane.
On the Notebook Session page, click Create Notebook Session.
On the Create Notebook Session page, configure the following Spark configuration .

spark.hadoop.fs.accessPolicies.discovery      oss://<bucketname>.cn-<region>.oss-dls.aliyuncs.com
spark.hadoop.fs.AbstractFileSystem.hdfs.impl  com.aliyun.jindodata.hdfs.v3.HDFS
spark.hadoop.fs.hdfs.impl                     com.aliyun.jindodata.hdfs.v3.JindoDistributedFileSystem

In a Data Development Notebook task, use the custom prefix set by RootPolicy for the OSS-HDFS service.

-- Create a table
spark.sql("""CREATE TABLE default.my_orc_table (
    id INT,
    name STRING,
    age INT
) 
location 'hdfs://<ns_name>/user/hive/warehouse/ads_user_info_1d_emr/my_orc_table/'""")
-- Insert data
spark.sql("""INSERT INTO table default.my_orc_table(id, name, age) VALUES (1, 'Alice', 30)""");
-- Query data
spark.sql("SELECT * FROM default.my_orc_table").show()

Note

In the CREATE TABLE statement, replace <ns_name> with the custom prefix that you use to access the OSS-HDFS service.

Note

The !hadoop fs command in Notebook does not support RootPolicy.

Scenario 2: Use in an SQL session

Configure Spark settings.

On the EMR Serverless Spark page, click Sessions in the left-side navigation pane.
On the SQL Session page, click Connect to SQL Session.
On the Connect to SQL Session page, configure the following Spark Configuration.

spark.hadoop.fs.accessPolicies.discovery      oss://<bucketname>.cn-<region>.oss-dls.aliyuncs.com
spark.hadoop.fs.AbstractFileSystem.hdfs.impl  com.aliyun.jindodata.hdfs.v3.HDFS
spark.hadoop.fs.hdfs.impl                     com.aliyun.jindodata.hdfs.v3.JindoDistributedFileSystem

In a Data Development SparkSQL task, use the custom prefix set by RootPolicy for the OSS-HDFS service.

-- Create a table
CREATE TABLE default.my_orc_table1 (
    id INT,
    name STRING,
    age INT
) 
location "hdfs://<ns_name>/user/hive/warehouse/ads_user_info_1d_emr/my_orc_table1/";
-- Insert data
INSERT INTO table default.my_orc_table1(id, name, age)
VALUES (1, 'Alice', 30);
-- Query data
select * from default.my_orc_table1;

Note

In the CREATE TABLE statement, replace <ns_name> with the custom prefix that you use to access the OSS-HDFS service.

Scenario 3: Use in a batch job

Upload a file. This topic uses the test.sql file as an example.
1. On the EMR Serverless Spark page, click Artifacts in the left-side navigation pane.
2. On the Managed File Directory page, click Upload File.
3. In the Upload File dialog box, click the upload area to select a local file, or drag the file to the upload area.
Configure Spark settings.
1. On the EMR Serverless Spark page, click Development in the left-side navigation pane.
2. On the Development tab, click the (New) icon.
3. In the dialog box that appears, enter a Name, select SQL Job under Application, and then click OK.
4. On the Task Configuration page, configure the following Spark Configuration.
```
spark.hadoop.fs.accessPolicies.discovery      oss://<bucketname>.cn-<region>.oss-dls.aliyuncs.com
spark.hadoop.fs.AbstractFileSystem.hdfs.impl  com.aliyun.jindodata.hdfs.v3.HDFS
spark.hadoop.fs.hdfs.impl                     com.aliyun.jindodata.hdfs.v3.JindoDistributedFileSystem
```
On the Data Development page, click Run. Set the engine version to esr-4.3.0 (Spark 3.5.2, Scala 2.12) and disable Fusion Acceleration. For resource configuration, set driver.cores to 2 CPU, driver.memory to 8 GB, executor.cores to 1 CPU, and executor.memory to 4 GB. After you submit the job, check the run history to confirm that the job status is Succeeded.

FAQ

The !hadoop fs command fails when you use a custom prefix set by RootPolicy.

Symptom: When you run the !hadoop fs -ls hdfs://<ns_name>/ command in a Notebook task, the following error message is returned:

init failed, Caused by error 30004: Invalid argument: Neither fs.hdfs.test.dfs.ha.namenodes nor dfs.ha.namenodes.test is configured for HA namenodes
ERROR: code=1002, message=ERROR: failed to init filesystem.

Cause: The !hadoop fs command in a notebook does not currently support RootPolicy.
Solution: When using the !hadoop fs command in a notebook, use the original OSS-HDFS address instead.