Use JindoSDK with Apache Flume to write data to OSS-HDFS - Object Storage Service

Apache Flume is a distributed, reliable, and highly available system. You can use Apache Flume to collect, aggregate, and move large amounts of log data and store the data in a centralized manner. Data is flushed to a Flume transaction by calling flush() in Apache Flume and written to OSS-HDFS by using JindoSDK.

Prerequisites

An Elastic Compute Service (ECS) instance is created. For more information, see Create an instance.
A Hadoop environment is created. For more information about how to install Hadoop, see Step 2: Create a Hadoop runtime environment.
Apache Flume is deployed. For more information, visit Apache Flume.
OSS-HDFS is enabled for a bucket and access permissions on OSS-HDFS are granted. For more information, see Enable OSS-HDFS and grant access permissions.

Procedure

Connect to the ECS instance. For more information, see Connect to an instance.
Configure JindoSDK.
1. Download the latest version of the JindoSDK JAR package. For more information, visit GitHub.
2. Optional. If Kerberos-related and SASL-related dependencies are not included in your environment, install the following dependencies on all nodes on which JindoSDK is deployed.
  - Ubuntu or Debian
```
sudo apt-get install libkrb5-dev krb5-admin-server krb5-kdc krb5-user libsasl2-dev libsasl2-modules libsasl2-modules-gssapi-mit
```
  - Red Hat Enterprise Linux or CentOS
```
sudo yum install krb5-server krb5-workstation cyrus-sasl-devel cyrus-sasl-gssapi cyrus-sasl-plain
```
  - macOS
```
brew install krb5
```
3. Decompress the downloaded installation package.
  The following sample code provides an example on how to decompress a package named jindosdk-x.x.x-linux.tar.gz. If you use another version of JindoSDK, replace the package name with the name of the corresponding JAR package.
```
tar -zxvf jindosdk-x.x.x-linux.tar.gz -C/usr/lib
```
4. Configure JINDOSDK_HOME.
```
export JINDOSDK_HOME=/usr/lib/jindosdk-x.x.x-linux
export PATH=$JINDOSDK_HOME/bin:$PATH
```
5. Configure HADOOP_CLASSPATH.
```
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:${FLUME_HOME}/lib/*
```
  Important
  Deploy the installation directory and environment variables to the lib directory of the Flume root directory on each node.
6. Configure FLUME_CLASSPATH.
```
cp ${FLUME_HOME}/conf/flume-env.sh.template ${FLUME_HOME}/conf/flume-env.sh
echo "FLUME_CLASSPATH=/usr/lib/jindosdk-x.x.x-linux/lib/*" >>  ${FLUME_HOME}/conf/flume-env.sh
```

Configure a sink.

The following sample code provides an example on how to configure a sink:

# Configure an OSS sink. Set your_bucket to the bucket for which OSS-HDFS is enabled. 
xxx.sinks.oss_sink.hdfs.path = oss://${your_bucket}/flume_dir/%Y-%m-%d/%H

# Specify the maximum number of events that can be written in a Flume transaction. We recommend that you flush more than 32 MB of data each time. This prevents impacts on the overall performance and prevents a large number of staging files from being generated. 
# The batchSize parameter specifies the number of events, which is the number of log entries. Before you configure this parameter, you must evaluate the average size of events. For example, the average size is 200 bytes. If the size of data that is flushed each time is 32 MB, the value of the batchSize parameter is approximately 160,000 (32 MB/200 bytes). 
xxx.sinks.oss_sink.hdfs.batchSize = 100000

...
# Specify whether to partition Hadoop Distributed File System (HDFS) files by time. The timestamp is rounded down to an integer. Default value: true. 
xxx.sinks.oss_sink.hdfs.round = true
# When you set the xxx.sinks.oss_sink.hdfs.round parameter to true, you must configure the xxx.sinks.oss_sink.hdfs.roundUnit parameter. For example, if you set the xxx.sinks.oss_sink.hdfs.roundUnit parameter to minute and the xxx.sinks.oss_sink.hdfs.roundValue parameter to 60, data is written to a file within 60 minutes, which is equivalent to generating a file every 60 minutes. 
xxx.sinks.oss_sink.hdfs.roundValue = 15
# Specify the time unit that is used by the time partition. Default value: minute. Valid values: second, minute, and hour. 
xxx.sinks.oss_sink.hdfs.roundUnit = minute
# Specify the fixed prefix of new files generated by Apache Flume in the HDFS directory. 
xxx.sinks.oss_sink.hdfs.filePrefix = your_topic
# Specify the file size that triggers the system to create new files. Each time the file size is reached, the system creates a new file. Unit: bytes. A value of 0 specifies that the file is not split based on the file size. 
xxx.sinks.oss_sink.rollSize = 3600
# Specify the number of threads, such as open and write, that are enabled for each sink instance to perform HDFS IO operations. 
xxx.sinks.oss_sink.threadsPoolSize = 30
...

For more information about the parameters that are required to configure a sink, see Apache Flume.

FAQ

What do I do if the "org.apache.flume.conf.ConfigurationException: Component has no type. Cannot configure.user_sink" error message is returned?

Add the following configurations to the core-site.xml configuration file of Hadoop to resolve the issue:

Configure the implementation classes of JindoOSS. 
fs.AbstractFileSystem.oss.impl com.aliyun.jindodata.oss.OSS 
fs.oss.impl com.aliyun.jindodata.oss.OssFileSystem
Configure the AccessKey pair and endpoint. 
fs.oss.credentials.provider com.aliyun.jindodata.auth.SimpleAliyunCredentialsProvider 
fs.oss.accessKeyId LTAI******** 
fs.oss.accessKeySecret KZo1******** 
fs.oss.endpoint {regionId}.oss-dls.aliyuncs.com