edit-icon download-icon

Data Ingestion from OSS

Last Updated: Apr 10, 2018

Introduction

MaxCompute (previously known as ODPS) is a general purpose, fully managed, multi-tenancy data processing platform for large-scale data warehousing. MaxCompute supports various data importing solutions and distributed computing models, enabling users to effectively query massive datasets, reduce production costs, and ensure data security.

Alibaba Cloud offers a platform called DataWorks for users to perform data ingestion, data processing and data management in MaxCompute. It provides fully hosted workflow services and a one-stop development and management interface to help enterprises mine and explore the full value of their data. DataWorks uses MaxCompute as its core computing and storage engine, to provide massive offline data processing, analysis, and mining capabilities.

Currently, data from the following data sources can be imported to or exported from the workspace through the data integration function: RDS, MySQL, SQL Server, PostgreSQL, MaxCompute, ApsaraDB for Memcache, DRDS, OSS, Oracle, FTP, dm, HDFS, and MongoDB.

In this document, the focus is on data ingestion from OSS.

Solution Architecture

In this solution architecture, user ingests data to MaxCompute ODPS table from OSS, through web based DataWorks platform.

solution architecture

Pre-Requisites and Preparation

  • An Alibaba Cloud Account.
  • A sample dataset.
  • Define a sample database table.

    ComponentConfigurationValues (Grey: Sample values)
    OSSBucket Namexxx-bigdata-demo
     RegionAsia Pacific SE 3 (Kuala Lumpur)
     ACLPublic Read
     Sub-directorysample_dataset
     Source csv filesample_telco_calls.csv
    DataWorksData source nameoss_demo_xxx
     Endpointhttp://xxxx.aliyuns.com
     Bucketxxx-bigdata-demo
     Access id**
     Access key**
     Object prefixsample_dataset/sample_telco_calls.csv
     ODPS Tablecreate table if not exists telco_call_mins_oss ( state string, area_code string, phone_num string, day_min double, night_min double, intl_min double, cust_call bigint) ;

Getting Started

Setting up OSS bucket

  • Go to OSS and click Create Bucket.

    create bucket

  • Enter the bucket information (sample values for this tutorial are shown as follows).

    Bucket Namexxx-bigdata-demo
    RegionAsia Pacific SE 3 (Kuala Lumpur)
    ACLPublic Read

    1

  • The new bucket name is visible on the left panel of the console.

    2

  • Go to Files and Create Directory.

    3

    Directory Name

    Note: The source file cannot be at the root of the bucket, hence a directory must be created.

  • Go into the directory and upload the source csv file. A sample can be downloaded here.

    4

  • After file is successfully uploaded, it can be visible in OSS console.

    5

    6

OSS Security Token Authorization

  • In order for Dataworks to be able to access files from OSS bucket, security token has to be authorized from OSS.

  • Click Security Token from OSS console.

    7

  • Click Start Authorization and OSS security token for sub-account access through RAM and STS can be configured.

    Start Authorization

    8

Setup OSS as Data Source in DataWorks

  • Go to DataWorks > Data Integration.

    Data Integration

  • In Data Integration main page, click New Source to create data source sync from OSS.

    14

  • Select OSS as data source.

  • Configure the OSS data source information (sample values in this tutorial are shown as follows).

    Data source nameoss_demo_xxx
    Endpointhttp://xxxx.aliyuns.com
    Bucketxxx-bigdata-demo
    Access id**
    Access key**

    Note: Please do not change the Access id and AccessKey, and do not leak it to other people.

  • After that, click test connectivity to check whether the OSS bucket can be connected from DataWorks.

    If the connectivity test is successful, click Complete.

  • If it is successful, a green box displays at the upper right corner saying “connectivity test successfully”.

    10

  • In DataWorks Data Integration, click on Data Sources on left navigation panel and the newly created data source from OSS is visible here.

    11

Data Ingestion: Data Source from OSS

  • Go to Sync Tasks at left panel in Data Integration.

  • Click Wizard Mode to setup data ingestion from OSS.

    12

Step 1: Configure Data Ingestion Source

  • The data source is an OSS data source that has been created in preceding.

  • The object prefix is the absolute path of the OSS bucket.

    In this tutorial according to the setup in preceding, it is the “sample_dataset/sample_telco_calls.csv”.

  • Select the version/type of the source file as csv, and delimiter of csv is “,” (comma).

  • If the source data file in OSS has header, select header Yes.

  • Click data preview to preview the data to validate whether it is the data is correct.

    13

Step 2: Configure Data Ingestion Target

  • Choose the odps_first(odps) as data ingestion target.

    The odps_first is the default data repository for MaxCompute.

  • Before data can be ingestion into MaxCompute, a table has to be created in MaxCompute.

  • Click Create New Target Table.

    Create New Target Table

  • Enter the table creation statement. (Sample table in this tutorial)

    ODPS Table:

    1. create table if not exists telco_call_mins_oss (
    2. state string,
    3. area_code string,
    4. phone_num string,
    5. day_min double,
    6. night_min double,
    7. intl_min double,
    8. cust_call bigint
    9. ) ;

    ODPS Table

  • Click next after the table created is selected.

    select target

Step 3: Configure Source and Target Column Mapping

  • It is important to ensure the order of columns in the source data file is correctly mapped to the columns of the table created.

    The recommended approach is to ensure the columns of the source data’s order is the same as the order of the columns in ODPS table.

  • Click peer mapping to map source to target.

    peer mapping

  • Click next after the column mapping is done correctly.

    • In this tutorial the columns of the source data file and columns of the ODPS table are the same, hence the straight lines mapped from source to target is correct.

    field mapping

Step 4: Configure Channel Control

  • Select Maximum Operating Rate and Concurrent Jobs.

  • Click Next.

    Next

Step 5: Configuration Preview

  • Verify configuration and if everything are correctly configured, click Save.

    Save

  • Name this data ingestion task to save.

    1

  • After saved, click operation button to initiate data ingestion from OSS.

    operation

  • Monitor the log at bottom panel to check the status of the data synchronization task.

  • If the data synchronization with return code: [0], it means it is successful.

    log

Data Validation: Validate the Ingested Data through MaxCompute Data Development

  • Go to Data Development at the upper panel, and select from the table which data has been ingested into.

  • Run select * from xxx_demo.telco_call_mins_oss ;.

  • The result is displayed at the log tab.

  • Validate the data which has been queried from MaxCompute ODPS table with source file.

    Run

Conclusion

Ingesting data from OSS in DataWorks IDE is user friendly and easy, can be done end to end using web-based approach, which enabled customers especially business users to do it quickly and simply, allowing them to focus their time and effort on more important tasks - running computation of big data.

Products Product Links for Reference
OSS https://www.alibabacloud.com/product/oss
DataWorks https://www.alibabacloud.com/product/ide
MaxCompute https://www.alibabacloud.com/product/maxcompute
Thank you! We've received your feedback.