By Lin En Shu, Solution Architect
MaxCompute (previously known as ODPS) is a general purpose, fully managed, multi-tenancy data processing platform for large-scale data warehousing. MaxCompute supports various data importing solutions and distributed computing models, enabling users to effectively query massive datasets, reduce production costs, and ensure data security.
Alibaba Cloud offers a platform called DataWorks for users to perform data ingestion, data processing and data management in MaxCompute. It provides fully hosted workflow services and a one-stop development and management interface to help enterprises mine and explore the full value of their data. DataWorks uses MaxCompute as its core computing and storage engine, to provide massive offline data processing, analysis, and mining capabilities.
Currently, data from the following data sources can be imported to or exported from the workspace through the data integration function: RDS, MySQL, SQL Server, PostgreSQL, MaxCompute, ApsaraDB for Memcache, DRDS, OSS, Oracle, FTP, dm, HDFS, and MongoDB.
In this document, the focus will be on data ingestion from Alibaba Cloud's Object Storage Service (OSS).
Visit the OSS console and select Create Bucket.
Fill in the bucket information (sample values for this tutorial below).
The new bucket name is visible on the left panel of the console.
Go to Files and Create Directory.
Note that the source file cannot be at the root of the bucket, hence a directory must be created.
Go into the directory and upload the source csv file. We have used a file called sample_telco_calls.csv.
After file is successfully uploaded, it will be visible in OSS console.
In order for Dataworks to be able to access files from OSS bucket, security token has to be authorized from OSS.
Press Security Token from OSS console.
Press Start Authorization and OSS security token for sub-account access through RAM and STS will be configured.
Go to DataWorks and then Data Integration
In Data Integration main page, press New Source to create data source sync from OSS
Select OSS as data source
Configure the OSS data source information (sample values in this tutorial below)
After that, press test connectivity to check whether the OSS bucket can be connected from DataWorks.
If it is successful, a green box will pop up at top right corner saying "connectivity test successfully"
In DataWorks Data Integration, click on Data Sources on left navigation panel and the newly created data source from OSS will be visible here.
Go to Sync Tasks at left panel in Data Integration
Press Wizard Mode to setup data ingestion from OSS
The data source will be the OSS data source that has been created earlier. The object prefix will be the absolute path of the OSS bucket. In this tutorial according to the setup above, it will be "sample_dataset/sample_telco_calls.csv"
Select the version/type of the source file as "csv", and delimiter of csv is "," (comma)
If the source data file in OSS has header, select header "Yes"
Press "data preview" to preview the data to validate whether it is the data is correct.
Choose the odps_first(odps) as data ingestion target. odps_first is the default data repository for MaxCompute.
Before data can be ingestion into MaxCompute, a table has to be created in MaxCompute.
Press Create New Target Table
Enter the table creation statement. (Sample table in this tutorial)
Press next after the table created is selected
It is important to ensure the order of columns in the source data file is correctly mapped to the columns of the table created. The recommended approach is to ensure the columns of the source data's order is the same as the order of the columns in ODPS table.
Press "peer mapping" to map source to target.
Press next after the column mapping is done correctly. In this tutorial the columns of the source data file and columns of the ODPS table are the same, hence the straight lines mapped from source to target is correct.
Select Maximum Operating Rate and Concurrent Jobs. Press Next
Verify configuration and if everything are correctly configured, press Save
Name this data ingestion task to save.
After saved, press "operation" button to initiate data ingestion from OSS.
Monitor the log at bottom panel to check the status of the data synchronization task.
If the data synchronization ended with return code: , it means it is successful.
Go to Data Development at the top panel, and select from the table which data has been ingested into.
Run "select * from xxxxx_demo.telco_call_mins_oss ;"
The result will be displayed at the log tab. Validate the data which has been queried from MaxCompute ODPS table with source file.
Ingesting data from OSS in DataWorks IDE is user friendly and easy, can be done end to end using web-based approach, which enabled customers especially business users to do it quickly and simply, allowing them to focus their time and effort on more important tasks - running computation of big data.
|Products||Product Links for Reference|
Alibaba Clouder - September 26, 2019
Alibaba Cloud Indonesia - August 28, 2020
ApsaraDB - February 20, 2021
Alibaba Clouder - April 8, 2019
Alibaba EMR - October 12, 2021
Alibaba Cloud Industry Solutions - January 13, 2022
Provides scalable, distributed, and high-performance block storage and object storage services in a software-defined manner.Learn More
Deploy custom Alibaba Cloud solutions for business-critical scenarios with Quick Start templates.Learn More
Block-level data storage attached to ECS instances to achieve high performance, low latency, and high reliabilityLearn More
Build a Data Lake with Alibaba Cloud Object Storage Service (OSS) with 99.9999999999% (12 9s) availability, 99.995% SLA, and high scalabilityLearn More
More Posts by Alibaba Clouder