emr-core package: implements interaction between Hadoop/Spark and OSS data sources. It is by default existent in the cluster’s running environment. You don’t need to include the emr-core during job packaging, or you can keep the version consistent with the emr-core version in the cluster.
emr-sdk_2.10 package: implements interaction between Spark and other data sources of Alibaba Cloud, such as Log Service, MNS, ONS and ODPS . You must include emr-sdk_2.10 in the package during job packing. Otherwise errors of some class not found will be prompted.
Solve the dependency conflicts between MNS and Spark/Hadoop packages.
Solve null pointers in Spark Streaming + MNS scenarios.
Solve some bugs of Python SDK.
Spark Streaming + Loghub supports custom time and location.
Solve the problem that Hadoop is not able to support native Snappy files. Now, E-MapReduce supports processing files archived in OSS in Snappy format by Log Service.
Solve the problem that Spark is not able to support Snappy zip files.
Solve the problem that OSS is not able to support two algorithms of Hadoop 2.7.2 OutputCommitter.
Improve OSS reading/writing performance of Hadoop/Spark.
Solve the Log4j abnormal output of Spark job printing.
Solve the “ConnectionClosedException” occurred during slow reading/writing of OSS by the job.
Solve some unavailable Hadoop commands problem for OSS data sources.
Solve the “java.text.ParseException: Unparseable date” problem.
Optimize emr-core to support local debugging and running.
Compatible with the “_$folder$” files generated by earlier version by interpreting them as directories instead of normal files.
Add retry mechanism for failures for OSS reading/writing in Hadoop/Spark.
Solve the unbalanced usage between multiple disks during local writing of OSS temporary files.
Remove the $_folder$ tag file created during OSS directory creation in job execution.
Upgrade LogHub SDK to version 0.6.2, remove the Client DB mode and use Server DB instead.
Upgrade OSS SDK to version 2.2.0, repair the running exceptions caused by OSS SDK bugs.
Add MNS support.
- For the 1.0.x series SDKs
- Incompatible: The packet structure is adjusted. The packet name com.aliyun is replaced with com.aliyun.emr.
- For the 1.0.x series SDKs
Modify the project groupId from com.aliyun to com.aliyun.emr. The modified POM dependency is:
Optimize LoghubUtils interface and parameter input.
Optimize LogStore data output format and add the “topic” and “source” fields.
Add the configuration of time interval for pulling data from LogStore. The parameter “spark.logservice.fetch.interval.millis”, and the default value is 200 milliseconds.
Update the dependency ODPS SDK version to 0.20.7-public.
Downgrade the guava dependency version to 11.0.2 to avoid conflicts with the guava version in Hadoop.
The computing task supports a file size with more than 5GB of data.
- Add the configuration parameter of OSS Client.
- Repair the bug of OSS URI parsing error.
Optimize OSS URI settings.
Add MQ support.
Add Log Service support.
Support OSS append writing feature.
Support OSS data uploading using the multi-part method.
- Support OSS data copying using the upload part copy method.
This document introduces how to use SDK in Spark to read and write data in Alibaba Cloud OSS, ODPS, Log Service and MQ products. Please click to download the Latest version of document.