In versions earlier than SmartData 3.4.X, Jindo OSS Committer is supported. In SmartData 3.4.X and later, Jindo OSS Direct Committer is added to optimize job commit performance when bucket versioning is enabled for Object Storage Service (OSS).

Background information

Job Committer is a basic component of distributed computing frameworks, such as MapReduce and Spark. It is used to handle data consistency issues of write operations for distributed tasks.

Jindo Job Committer is an effective Job Committer developed by the Alibaba Cloud E-MapReduce (EMR) team. It is dedicated to job commits in the OSS scenario. Jindo Job Committer is developed based on the multipart upload feature of OSS and supports the file system customization feature of OSS. If Jindo Job Committer is used, the output data of tasks is directly written to the destination directory. Before the job is committed, intermediate data is invisible to users. Rename operations are not performed during the job commit process. This ensures data consistency.

Notice
  • The OSS bandwidth and the enabling of some advanced features may affect the data copy performance of OSS. Therefore, the data copy performance for different users or buckets may vary. If you have questions, contact the technical support personnel of OSS.
  • After all tasks are completed, MapReduce Application Master or Spark Driver commits the job. This triggers a short time window in which only some of the result files appear in the destination directory. The length of the time window is positively correlated with the number of files. You can set the fs.oss.committer.threads parameter to a larger value to speed up concurrent processing.
  • Hive and Presto jobs do not use Hadoop Job Committer.
  • In EMR clusters, Jindo OSS Committer is enabled by default.

Use Jindo OSS Committer in MapReduce jobs

  1. Go to the mapred-site tab for the YARN service.
    1. Log on to the Alibaba Cloud EMR console.
    2. In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
    3. Click the Cluster Management tab.
    4. On the Cluster Management page, find your cluster and click Details in the Actions column.
    5. In the left-side navigation pane, choose Cluster Service > YARN.
    6. Click the Configure tab.
    7. In the Service Configuration section, click the mapred-site tab.
  2. Configure the Job Committer parameter based on your Hadoop version:
    • Hadoop 2.X

      On the mapred-site tab, set the mapreduce.outputcommitter.class parameter to com.aliyun.emr.fs.oss.commit.JindoOssCommitter.

    • Hadoop 3.X

      On the mapred-site tab, set the mapreduce.outputcommitter.factory.scheme.oss parameter to com.aliyun.emr.fs.oss.commit.JindoOssCommitterFactory.

  3. Save the configuration.
    1. In the upper-right corner of the Service Configuration section, click Save.
    2. In the Confirm Changes dialog box, specify Description and turn on Auto-update Configuration.
    3. Click OK.
  4. Go to the smartdata-site tab for the SmartData service.
    1. In the left-side navigation pane, choose Cluster Service > SmartData.
    2. Click the Configure tab.
    3. In the Service Configuration section, click the smartdata-site tab.
  5. On the smartdata-site tab, set fs.oss.committer.magic.enabled to true.
  6. Save the configuration.
    1. In the upper-right corner of the Service Configuration section, click Save.
    2. In the Confirm Changes dialog box, specify Description and turn on Auto-update Configuration.
    3. Click OK.
Note After you set the mapreduce.outputcommitter.class parameter to com.aliyun.emr.fs.oss.commit.JindoOssCommitter, you can use the fs.oss.committer.magic.enabled parameter to determine which Job Committer is used. If you set this parameter to true, MapReduce jobs use Jindo OSS Magic Committer, which does not require rename operations. If you set this parameter to false, Jindo OSS Committer functions the same as File Output Committer.

Use Jindo OSS Committer in Spark jobs

  1. Go to the spark-defaults tab for the Spark service.
    1. In the left-side navigation pane, choose Cluster Service > Spark.
    2. Click the Configure tab.
    3. In the Service Configuration section, click the spark-defaults tab.
  2. On the spark-defaults tab, configure the parameters listed in the following table.
    Parameter Description
    spark.sql.sources.outputCommitterClass Set this parameter to com.aliyun.emr.fs.oss.commit.JindoOssCommitter.

    This parameter specifies the Job Committer that is used to write data to a data source table in Spark.

    spark.sql.parquet.output.committer.class Set this parameter to com.aliyun.emr.fs.oss.commit.JindoOssCommitter.

    This parameter specifies the Job Committer that is used to write data to a data source table in the Parquet format in Spark.

    spark.sql.hive.outputCommitterClass Set this parameter to com.aliyun.emr.fs.oss.commit.JindoOssCommitter.

    This parameter specifies the Job Committer that is used to write data to a Hive table.

  3. Save the configuration.
    1. In the upper-right corner of the Service Configuration section, click Save.
    2. In the Confirm Changes dialog box, specify Description and turn on Auto-update Configuration.
    3. Click OK.
  4. Go to the smartdata-site tab for the SmartData service.
    1. In the left-side navigation pane, choose Cluster Service > SmartData.
    2. Click the Configure tab.
    3. In the Service Configuration section, click the smartdata-site tab.
  5. On the smartdata-site tab, set fs.oss.committer.magic.enabled to true.
    Note You can use the fs.oss.committer.magic.enabled parameter to determine which Job Committer is used. If you set this parameter to true, Spark jobs use Jindo OSS Magic Committer, which does not require rename operations. If you set this parameter to false, Jindo OSS Committer functions the same as File Output Committer.
  6. Save the configuration.
    1. In the upper-right corner of the Service Configuration section, click Save.
    2. In the Confirm Changes dialog box, specify Description and turn on Auto-update Configuration.
    3. Click OK.

Optimize the performance of Jindo OSS Committer

If your MapReduce or Spark tasks write a large number of files, you can adjust the number of threads that can be concurrently executed for job commit-related tasks. This helps improve job commit performance. In data lake scenarios, you can enable the bucket versioning feature for OSS to prevent data from being deleted by mistake.

  1. Go to the smartdata-site tab for the SmartData service.
    1. In the left-side navigation pane, choose Cluster Service > SmartData.
    2. Click the Configure tab.
    3. In the Service Configuration section, click the smartdata-site tab.
  2. On the smartdata-site tab, set the fs.oss.committer.threads parameter to 8.
    The default value is 8.
  3. Save the configuration.
    1. In the upper-right corner of the Service Configuration section, click Save.
    2. In the Confirm Changes dialog box, specify Description and turn on Auto-update Configuration.
    3. Click OK.

Optimize the performance of Jindo OSS Magic Committer

In data lake scenarios, you can enable the bucket versioning feature for OSS to prevent data from being deleted by mistake. If the bucket versioning feature is enabled, files may be frequently created or deleted in a directory. In this case, the performance of list operations on the directory is affected. Jindo OSS Magic Committer is optimized in SmartData 3.4.0 to fix this issue. When Jindo OSS Magic Committer is used, temporary directories and files are deleted with their historical versions. This ensures that the performance of list operations is not affected by redundant temporary directories and files. By default, when Jindo OSS Magic Committer is used, the feature of automatically clearing historical temporary directories is enabled.

If you want to disable the feature, you can leave the fs.jfs.cache.oss.delete-marker.dirs parameter empty in the code of your job or set the fs.jfs.cache.oss.delete-marker.dirs parameter to temporary,.staging,.hive-staging,__magic on the smartdata-site tab of the SmartData service page.

Use Jindo OSS Direct Committer

If the bucket versioning feature is enabled, you can use Jindo OSS Direct Committer to directly write output data to the destination directory. No temporary files are generated during the data write process. To use Jindo OSS Direct Committer, you must set the parameters related to Job Committer to com.aliyun.emr.fs.oss.commit.direct.JindoOssDirectCommitter.

Configure the parameters based on your business requirements:
  • If you use the YARN service, set the mapreduce.outputcommitter.class parameter to com.aliyun.emr.fs.oss.commit.direct.JindoOssDirectCommitter on the mapred-site tab of the YARN service page.
  • If you use the Spark service, configure the parameters described in the following table on the spark-default tab of the Spark service page.
    Parameter Description
    mapreduce.outputcommitter.class Change the value to com.aliyun.emr.fs.oss.commit.direct.JindoOssDirectCommitter.
    spark.sql.parquet.output.committer.class Change the value to com.aliyun.emr.fs.oss.commit.direct.JindoOssDirectCommitter.
    spark.sql.hive.outputCommitterClass Change the value to com.aliyun.emr.fs.oss.commit.direct.JindoOssDirectCommitter.
    spark.sql.sources.outputCommitterClass Change the value to com.aliyun.emr.fs.oss.commit.direct.JindoOssDirectCommitter.