This topic describes how to use the data synchronization feature of DataWorks to migrate data from Hadoop Distributed File System (HDFS) to MaxCompute. Data synchronization between MaxCompute and Hadoop or Spark is supported.
Prerequisites
- MaxCompute is activated. A MaxCompute project is created.
In this example, the bigdata_DOC project in the China (Hangzhou) region is used. For more information, see Activate MaxCompute.
- A Hadoop cluster is created.
Before data migration, make sure that your Hadoop cluster works properly. You can use Alibaba Cloud E-MapReduce to create a Hadoop cluster. For more information, see Create a cluster.
In this example, the following configurations are specified for the Hadoop cluster:- E-MapReduce version: E-MapReduce V3.11.0
- Cluster type: Hadoop
- Required software services: HDFS 2.7.2, YARN 2.7.2, Hive 2.3.3, Ganglia 3.7.2, Spark 2.2.1, Hue 4.1.0, Zeppelin 0.7.3, Tez 0.9.1, Sqoop 1.4.6, Pig 0.14.0, ApacheDS 2.0.0, and Knox 0.13.0
The Hadoop cluster is deployed on the classic network in the China (Hangzhou) region with the high availability (HA) mode disabled. A public IP address and an internal IP address are configured for the Elastic Compute Service (ECS) instance in the primary instance group.
Procedure
Result
- On the left-side navigation submenu of the DataStudio page, click Ad-Hoc Query.
- On the Ad-Hoc Query tab, move the pointer over the Create icon and choose .
- In the Create Node dialog box, set the parameters and click Commit. In the code editor
of the created ad hoc query node, execute the following SQL statement to view the
data that is synchronized to the hive_doc_good_sale table:
--Check whether the data is written to MaxCompute. select * from hive_doc_good_sale where pt=1;
You can also run the
select * FROM hive_doc_good_sale where pt =1;
command by using the odpscmd tool to query the synchronized data.
{
"configuration": {
"reader": {
"plugin": "odps",
"parameter": {
"partition": "pt=1",
"isCompress": false,
"datasource": "odps_first",
"column": [
"create_time",
"category",
"brand",
"buyer_id",
"trans_num",
"trans_amount",
"click_cnt"
],
"table": "hive_doc_good_sale"
}
},
"writer": {
"plugin": "hdfs",
"parameter": {
"path": "/user/hive/warehouse/hive_doc_good_sale",
"fileName": "pt=1",
"datasource": "HDFS_data_source",
"column": [
{
"name": "create_time",
"type": "string"
},
{
"name": "category",
"type": "string"
},
{
"name": "brand",
"type": "string"
},
{
"name": "buyer_id",
"type": "string"
},
{
"name": "trans_num",
"type": "BIGINT"
},
{
"name": "trans_amount",
"type": "DOUBLE"
},
{
"name": "click_cnt",
"type": "BIGINT"
}
],
"defaultFS": "hdfs://47.99.162.100:9000",
"writeMode": "append",
"fieldDelimiter": ",",
"encoding": "UTF-8",
"fileType": "text"
}
},
"setting": {
"errorLimit": {
"record": "1000"
},
"speed": {
"throttle": false,
"concurrent": 1,
"mbps": "1",
}
}
},
"type": "job",
"version": "1.0"
}
Before you run a sync node to migrate data from MaxCompute to Hadoop, you must configure the Hadoop cluster. For more information, see HDFS Writer. After the sync node is run, you can copy the file that is synchronized.