MaxCompute (previously known as ODPS) is a general purpose, fully managed, multi-tenancy data processing platform for large-scale data warehousing. MaxCompute supports various data importing solutions and distributed computing models, enabling users to effectively query massive datasets, reduce production costs, and guarantee data security.
Alibaba Cloud offers a platform called DataWorks for users to perform data ingestion, data processing and data management in MaxCompute. It provides fully hosted workflow services and a one-stop development and management interface to help enterprises mine and explore the full value of their data. DataWorks uses MaxCompute as its core computing and storage engine, to provide massive offline data processing, analysis, and mining capabilities.
Currently, data from the following data sources can be imported to or exported from the workspace through the data integration function: RDS, MySQL, SQL Server, PostgreSQL, MaxCompute, ApsaraDB for Memcache, DRDS, OSS, Oracle, FTP, dm, HDFS, and MongoDB.
In this document, the focus is on data ingestion from OSS.
In this solution architecture, user ingests data to MaxCompute ODPS table from OSS, through web based DataWorks platform.
- An Alibaba Cloud Account.
- A sample dataset.
Define a sample database table.
Component Configuration Values (Grey: Sample values) OSS Bucket Name xxx-bigdata-demo Region Asia Pacific SE 3 (Kuala Lumpur) ACL Public Read Sub-directory sample_dataset Source csv file sample_telco_calls.csv DataWorks Data source name oss_demo_xxx Endpoint
Bucket xxx-bigdata-demo Access id ** Accesskey ** Object prefix sample_dataset/sample_telco_calls.csv ODPS Table create table if not exists telco_call_mins_oss ( state string, area_code string, phone_num string, day_min double, night_min double, intl_min double, cust_call bigint) ;
Go to OSS and click Create Bucket.
Enter the bucket information (sample values for this tutorial are shown as follows).
Bucket Name xxx-bigdata-demo Region Asia Pacific SE 3 (Kuala Lumpur) ACL Public Read
The new bucket name is visible on the left panel of the console.
Go to Files and Create Directory.
Note: The source file cannot be at the root of the bucket, hence a directory must be created.
Go into the directory and upload the source csv file.
After file is successfully uploaded, it can be visible in OSS console.
In order for Dataworks to be able to access files from OSS bucket, security token has to be authorized from OSS.
Click Security Token from OSS console.
Click Start Authorization and OSS security token for sub-account access through RAM and STS can be configured.
Go to DataWorks > Data Integration.
In Data Integration main page, click New Source to create data source sync from OSS.
Select OSS as data source.
Configure the OSS data source information (sample values in this tutorial are shown as follows).
Data source name oss_demo_xxx Endpoint
Bucket xxx-bigdata-demo AccessID XXXXXXXX Accesskey XXXXXXXX
Note: Please do not change the Access id and AccessKey, and do not leak it to other people.
After that, click test connectivity to check whether the OSS bucket can be connected from DataWorks.
If the connectivity test is successful, click Complete.
If it is successful, a green box displays at the upper right corner saying “connectivity test successfully”.
In DataWorks Data Integration, click on Data Sources on left navigation panel and the newly created data source from OSS is visible here.
Go to Sync Tasks at left panel in Data Integration.
Click Wizard Mode to setup data ingestion from OSS.
The data source is an OSS data source that has been created in preceding.
The object prefix is the absolute path of the OSS bucket.
In this tutorial according to the setup in preceding, it is the “sample_dataset/sample_telco_calls.csv”.
Select the version/type of the source file as csv, and delimiter of csv is “,” (comma).
If the source data file in OSS has header, select header Yes.
Click data preview to preview the data to validate whether it is the data is correct.
Choose the odps_first(odps) as data ingestion target.
The odps_first is the default data repository for MaxCompute.
Before data can be ingestion into MaxCompute, a table has to be created in MaxCompute.
Click Create New Target Table.
Enter the table creation statement. (Sample table in this tutorial)
create table if not exists telco_call_mins_oss (
Click next after the table created is selected.
It is important to guarantee the order of columns in the source data file is correctly mapped to the columns of the table created.
The recommended approach is to guarantee the columns of the source data’s order is the same as the order of the columns in ODPS table.
Click peer mapping to map source to target.
Click next after the column mapping is done correctly.
- In this tutorial the columns of the source data file and columns of the ODPS table are the same, hence the straight lines mapped from source to target is correct.
Select Maximum Operating Rate and Concurrent Jobs.
Verify configuration and if everything are correctly configured, click Save.
Name this data ingestion task to save.
After saved, click operation button to initiate data ingestion from OSS.
Monitor the log at bottom panel to check the status of the data synchronization task.
If the data synchronization with return code: , it means it is successful.
Go to Data Development at the upper panel, and select from the table which data has been ingested into.
Run select * from xxx_demo.telco_call_mins_oss ;.
The result is displayed at the log tab.
Validate the data which has been queried from MaxCompute table with source file.
Ingesting data from OSS in DataWorks IDE is user friendly and easy, can be done end to end using web-based approach, which enabled customers especially business users to do it quickly, allowing them to focus their time and effort on more important tasks - running computation of big data.
|Products||Product Links for Reference|