All Products
Search
Document Center

Object Storage Service:Use JindoSDK with Flume to write data to OSS-HDFS

Last Updated:Aug 06, 2025

Apache Flume is a distributed, reliable, and highly available system. You can use Apache Flume to collect, aggregate, and move large amounts of log data for centralized storage. Flume uses JindoSDK to write data to OSS-HDFS and ensures transactional writes by calling `flush()`. This process ensures that flushed data is immediately visible and that no data is lost.

Prerequisites

Procedure

  1. Connect to an ECS instance. For more information, see Connect to an instance.

  2. Configure JindoSDK.

    1. Download the latest version of the JindoSDK JAR package. For the download link, see GitHub.

    2. Decompress the downloaded installation package.

      The following sample code shows how to decompress the jindosdk-x.x.x-linux.tar.gz package. If you use a different version of JindoSDK, replace the package name with the actual name of your package.

      tar -zxvf jindosdk-x.x.x-linux.tar.gz -C/usr/lib
    3. Configure JINDOSDK_HOME.

      export JINDOSDK_HOME=/usr/lib/jindosdk-x.x.x-linux
      export PATH=$JINDOSDK_HOME/bin:$PATH
    4. Configure HADOOP_CLASSPATH.

      export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:${FLUME_HOME}/lib/*
      Important

      On each node, deploy the installation directory to the `lib` folder in the Flume root directory and set the environment variables.

    5. Configure FLUME_CLASSPATH.

      cp ${FLUME_HOME}/conf/flume-env.sh.template ${FLUME_HOME}/conf/flume-env.sh
      echo "FLUME_CLASSPATH=/usr/lib/jindosdk-x.x.x-linux/lib/*" >>  ${FLUME_HOME}/conf/flume-env.sh
  3. Configure a sink.

    The following sample code provides an example of how to configure a sink:

    # Configure an OSS sink. Set your_bucket to the bucket for which OSS-HDFS is enabled.
    xxx.sinks.oss_sink.hdfs.path = oss://${your_bucket}/flume_dir/%Y-%m-%d/%H
    
    # The maximum number of events that can be written in a Flume transaction. We recommend that you flush more than 32 MB of data each time. This prevents impacts on the overall performance and prevents many staging files from being generated.
    # The batchSize parameter specifies the number of events, which is the number of log entries. Before you configure this parameter, you must evaluate the average size of events. For example, the average size is 200 bytes. If the size of data that is flushed each time is 32 MB, the value of the batchSize parameter is approximately 160,000 (32 MB / 200 bytes).
    xxx.sinks.oss_sink.hdfs.batchSize = 100000
    
    ...
    # Specifies whether to partition Hadoop Distributed File System (HDFS) files by time. The timestamp is rounded down to an integer. Default value: true.
    xxx.sinks.oss_sink.hdfs.round = true
    # When you set the xxx.sinks.oss_sink.hdfs.round parameter to true, you must configure the xxx.sinks.oss_sink.hdfs.roundUnit parameter. For example, if you set the xxx.sinks.oss_sink.hdfs.roundUnit parameter to minute and the xxx.sinks.oss_sink.hdfs.roundValue parameter to 60, data is written to a file within 60 minutes, which is equivalent to generating a file every 60 minutes.
    xxx.sinks.oss_sink.hdfs.roundValue = 15
    # The time unit that is used by the time partition. Default value: minute. Valid values: second, minute, and hour.
    xxx.sinks.oss_sink.hdfs.roundUnit = minute
    # The fixed prefix of new files generated by Apache Flume in the HDFS folder.
    xxx.sinks.oss_sink.hdfs.filePrefix = your_topic
    # The file size that triggers the system to create new files. Each time the file size is reached, the system creates a new file. Unit: bytes. A value of 0 specifies that the file is not split based on the file size.
    xxx.sinks.oss_sink.rollSize = 3600
    # The number of threads, such as open and write, that are enabled for each sink instance to perform HDFS IO operations.
    xxx.sinks.oss_sink.threadsPoolSize = 30
    ...

    For more information about the sink configuration parameters, see Apache Flume.

FAQ

What do I do if the "org.apache.flume.conf.ConfigurationException: Component has no type. Cannot configure.user_sink" error message is returned?

To resolve this issue, add the following configuration to the core-site.xml configuration file of Hadoop.

<!-- Configure the implementation classes of JindoOSS. -->
fs.AbstractFileSystem.oss.impl com.aliyun.jindodata.oss.OSS 
fs.oss.impl com.aliyun.jindodata.oss.OssFileSystem
<!-- Configure the AccessKey pair and endpoint. -->
fs.oss.credentials.provider com.aliyun.jindodata.auth.SimpleAliyunCredentialsProvider 
fs.oss.accessKeyId LTAI******** 
fs.oss.accessKeySecret KZo1********
fs.oss.endpoint {regionId}.oss-dls.aliyuncs.com