Use JindoSDK with Flume to write data to OSS-HDFS - Object Storage Service

Apache Flume is a distributed, reliable, and highly available system. You can use Apache Flume to collect, aggregate, and move large amounts of log data for centralized storage. Flume uses JindoSDK to write data to OSS-HDFS and ensures transactional writes by calling `flush()`. This process ensures that flushed data is immediately visible and that no data is lost.

Prerequisites

An ECS instance is created for the deployment environment. For more information, see Create an instance.
A Hadoop environment is created. For more information, see Create a Hadoop runtime environment.
Apache Flume is deployed. For more information, visit Apache Flume.
OSS-HDFS is enabled, and you have the permissions to access it. For more information, see Enable OSS-HDFS.

Procedure

Connect to an ECS instance. For more information, see Connect to an instance.
Configure JindoSDK.
1. Download the latest version of the JindoSDK JAR package. For the download link, see GitHub.
2. Decompress the downloaded installation package.
  The following sample code shows how to decompress the jindosdk-x.x.x-linux.tar.gz package. If you use a different version of JindoSDK, replace the package name with the actual name of your package.
```
tar -zxvf jindosdk-x.x.x-linux.tar.gz -C/usr/lib
```
3. Configure JINDOSDK_HOME.
```
export JINDOSDK_HOME=/usr/lib/jindosdk-x.x.x-linux
export PATH=$JINDOSDK_HOME/bin:$PATH
```
4. Configure HADOOP_CLASSPATH.
```
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:${FLUME_HOME}/lib/*
```
  Important
  On each node, deploy the installation directory to the `lib` folder in the Flume root directory and set the environment variables.
5. Configure FLUME_CLASSPATH.
```
cp ${FLUME_HOME}/conf/flume-env.sh.template ${FLUME_HOME}/conf/flume-env.sh
echo "FLUME_CLASSPATH=/usr/lib/jindosdk-x.x.x-linux/lib/*" >>  ${FLUME_HOME}/conf/flume-env.sh
```

Configure a sink.

The following sample code provides an example of how to configure a sink:

# Configure an OSS sink. Set your_bucket to the bucket for which OSS-HDFS is enabled.
xxx.sinks.oss_sink.hdfs.path = oss://${your_bucket}/flume_dir/%Y-%m-%d/%H

# The maximum number of events that can be written in a Flume transaction. We recommend that you flush more than 32 MB of data each time. This prevents impacts on the overall performance and prevents many staging files from being generated.
# The batchSize parameter specifies the number of events, which is the number of log entries. Before you configure this parameter, you must evaluate the average size of events. For example, the average size is 200 bytes. If the size of data that is flushed each time is 32 MB, the value of the batchSize parameter is approximately 160,000 (32 MB / 200 bytes).
xxx.sinks.oss_sink.hdfs.batchSize = 100000

...
# Specifies whether to partition Hadoop Distributed File System (HDFS) files by time. The timestamp is rounded down to an integer. Default value: true.
xxx.sinks.oss_sink.hdfs.round = true
# When you set the xxx.sinks.oss_sink.hdfs.round parameter to true, you must configure the xxx.sinks.oss_sink.hdfs.roundUnit parameter. For example, if you set the xxx.sinks.oss_sink.hdfs.roundUnit parameter to minute and the xxx.sinks.oss_sink.hdfs.roundValue parameter to 60, data is written to a file within 60 minutes, which is equivalent to generating a file every 60 minutes.
xxx.sinks.oss_sink.hdfs.roundValue = 15
# The time unit that is used by the time partition. Default value: minute. Valid values: second, minute, and hour.
xxx.sinks.oss_sink.hdfs.roundUnit = minute
# The fixed prefix of new files generated by Apache Flume in the HDFS folder.
xxx.sinks.oss_sink.hdfs.filePrefix = your_topic
# The file size that triggers the system to create new files. Each time the file size is reached, the system creates a new file. Unit: bytes. A value of 0 specifies that the file is not split based on the file size.
xxx.sinks.oss_sink.rollSize = 3600
# The number of threads, such as open and write, that are enabled for each sink instance to perform HDFS IO operations.
xxx.sinks.oss_sink.threadsPoolSize = 30
...

For more information about the sink configuration parameters, see Apache Flume.

FAQ

What do I do if the "org.apache.flume.conf.ConfigurationException: Component has no type. Cannot configure.user_sink" error message is returned?

To resolve this issue, add the following configuration to the core-site.xml configuration file of Hadoop.

<!-- Configure the implementation classes of JindoOSS. -->
fs.AbstractFileSystem.oss.impl com.aliyun.jindodata.oss.OSS 
fs.oss.impl com.aliyun.jindodata.oss.OssFileSystem
<!-- Configure the AccessKey pair and endpoint. -->
fs.oss.credentials.provider com.aliyun.jindodata.auth.SimpleAliyunCredentialsProvider 
fs.oss.accessKeyId LTAI******** 
fs.oss.accessKeySecret KZo1********
fs.oss.endpoint {regionId}.oss-dls.aliyuncs.com