How to create an OSS-HDFS data shipping job - Simple Log Service

OSS-HDFS (JindoFS) is a cloud-native data lake storage feature. OSS-HDFS provides centralized metadata management capabilities and is fully compatible with Hadoop Distributed File System (HDFS) API. You can use OSS-HDFS to manage data in data lake-based computing scenarios in the big data and AI fields. Simple Log Service allows you to ship data to OSS-HDFS. This topic describes how to create an OSS-HDFS data shipping job.

Prerequisites

A project and a Logstore are created. For more information, see Create a project and a Logstore.
Data is collected. For more information, see Data collection overview.
An Object Storage Service (OSS) bucket is created in the region where the project resides, and OSS-HDFS is enabled for the bucket. For more information, see Create buckets and Enable OSS-HDFS.

Supported regions

Simple Log Service can ship data to OSS-HDFS only within the same region. The project and OSS bucket that are used must reside in the same region.
You can create an OSS-HDFS shipping job only in the Germany (Frankfurt) region.

Create a data shipping job

Log on to the Simple Log Service console.
In the Projects section, click the project that you want to manage.
On the Log Storage > Logstores tab, click the > icon to the left of the Logstore. Then, choose Data Processing > Export > Object Storage Service.
Move the pointer over Object Storage Service and click the + icon.
In the Create Data Shipping Job dialog box, select OSS-HDFS Export and click OK.

In the OSS-HDFS Shipping panel, configure the parameters and click OK.

The following table describes the parameters.

Important

After you create an OSS-HDFS data shipping job, you can check whether the job meets your requirements based on the status of the job and the data that is shipped to OSS-HDFS.

Parameter	Description
Job Name	The unique name of the data shipping job.
Display Name	The display name of the data shipping job.
Job Description	The description of the data shipping job.
OSS-HDFS Bucket	The name of the OSS bucket to which you want to ship data. Important You must specify an existing OSS bucket. Make sure that the OSS bucket resides in the same region as the project. You must specify an OSS bucket for which OSS-HDFS is enabled. You can specify an OSS bucket of the Standard or Infrequent Access (IA) storage class. By default, the storage class of the generated OSS objects that store the shipped data is the same as the storage class of the specified OSS bucket. For more information, see Overview of storage classes. The following limits apply to an OSS bucket of the IA storage class: minimum storage period and minimum billable size. For more information, see Differences between storage classes. OSS buckets of the Archive, Cold Archive, and Deep Cold Archive storage classes do not support OSS-HDFS.
File Delivery Directory	The directory to which you want to ship data in the OSS bucket. The directory name cannot start with a forward slash (/) or a backslash (\). After you create the OSS-HDFS data shipping job, the data in the Logstore is shipped to the directory.
Object Suffix	The suffix of the OSS objects in which the shipped data is stored. If you do not specify an object suffix, Simple Log Service automatically generates an object suffix based on the storage format and compression type that you specify. Example: `.suffix`.
Partition Format	The partition format that is used to generate subdirectories in the OSS bucket. A subdirectory is dynamically generated based on the shipping time. The default partition format is %Y/%m/%d/%H/%M. The partition format cannot start with a forward slash (/). For more information about partition format examples, see Partition formats. For more information about the parameters of partition formats, see strptime API.
OSS-HDFS Write RAM Role	The method that is used to authorize the OSS-HDFS data shipping job to write data to the OSS bucket. Valid values: Default Role: The OSS-HDFS data shipping job assumes the AliyunLogDefaultRole default role to write data to the OSS bucket. For more information, see Access data by using a default role. Custom Role: The OSS-HDFS data shipping job assumes a custom role to write data to the OSS bucket. If you select this option, you must grant the custom role the permissions to write data to the OSS bucket in advance. Then, enter the ARN of the custom role in the OSS-HDFS Write RAM Role field. For more information about how to obtain the ARN, see one of the following topics based on your business scenario: If the Logstore and the OSS bucket belong to the same Alibaba Cloud account, obtain the ARN by following the instructions that are provided in Step 2: Grant the RAM role the permissions to write data to an OSS bucket. If the Logstore and the OSS bucket belong to different Alibaba Cloud accounts, obtain the ARN by following the instructions that are provided in Step 2: Grant the RAM role role-b in Alibaba Cloud Account B the permissions to write data to the OSS bucket.
Logstore Read RAM Role	The method that is used to authorize the OSS-HDFS data shipping job to read data from the Logstore. Valid values: Default Role: The OSS-HDFS data shipping job assumes the AliyunLogDefaultRole default role to read data from the Logstore. For more information, see Access data by using a default role. Custom Role: The OSS-HDFS data shipping job assumes a custom role to read data from the Logstore. If you select this option, you must grant the custom role the permissions to read data from the Logstore in advance. Then, enter the ARN of the custom role in the Logstore Read RAM Role field. For more information about how to obtain the ARN, see one of the following topics based on your business scenario: If the Logstore and the OSS bucket belong to the same Alibaba Cloud account, obtain the ARN by following the instructions that are provided in Step 1: Grant the RAM role the permissions to read data from a logstore. If the Logstore and the OSS bucket belong to different Alibaba Cloud accounts, obtain the ARN by following the instructions that are provided in Step 1: Grant the RAM role role-a in Alibaba Cloud Account A the permissions to read data from the logstore.
Storage Format	The storage format of data. After data is shipped from Simple Log Service to OSS-HDFS, the data can be stored in different formats. For more information, see JSON format, CSV format, Parquet format, and ORC format.
Compression	Specifies whether to compress data that is shipped to OSS-HDFS. Valid values: No Compress(none): Data is not compressed. Compress(snappy): Data is compressed by using the snappy algorithm. This way, less storage space is occupied in the OSS bucket. For more information, see snappy. Compress(zstd): Data is compressed by using the zstd algorithm. This way, less storage space is occupied in the OSS bucket. Compress(gzip): Data is compressed by using the gzip algorithm. This way, less storage space is occupied in the OSS bucket.
Ship Tag	A reserved field in Simple Log Service. For more information, see Reserved fields.
Batch Size	The job starts to ship data when the data amount of logs in the shard reaches the value of this parameter. The value also determines the size of raw data in each OSS-HDFS object. Valid values: 5 to 256. Unit: MB. Note The Batch Size parameter specifies the data amount of logs that are read from a shard instead of the data amount of logs that are stored in Simple Log Service. The job starts to read and ship data only if the setting of the Batch Interval parameter is met.
Batch Interval	The job starts to ship data when the time difference between the first log obtained from the shard to the nth log reaches or exceeds the value of this parameter. Valid values: 300 to 900. Unit: seconds.
Shipping Latency	The latency of data shipping. For example, if you set the value to 3600, data is shipped after 1 hour. The data that is generated at 10:00:00 on June 5, 2023 is not written to the specified OSS bucket until 11:00:00 on June 5, 2023. For more information about limits, see Configuration items.
Start Time Range	The time when the data shipping job starts to pull data from the Logstore.
Time Zone	The time zone that is used to format the time. If you configure both Time Zone and Partition Format, the system generates subdirectories in the OSS bucket based on your configurations.

View data

After data is shipped to OSS-HDFS, you can view the data in OSS-HDFS. For more information, see Use the OSS console to access OSS-HDFS.