×
Community Blog How to Ingest Data from Object Storage Service

How to Ingest Data from Object Storage Service

DataWorks can perform data ingestion quickly and simply, allowing you to focus on running computation of big data.

Data ingestion from OSS with DataWorks, a tool for data ingestion, is user friendly and easy, can be done end to end using web-based approach, which enabled customers especially business users to do it quickly and simply, allowing them to focus their time and effort on more important tasks - running computation of big data.

In this article, we will show you how to perform data ingestion from Alibaba Cloud's Object Storage Service (OSS) with DataWorks.

After you have prepared the OSS bucket, you can follow the following procedure to integrate data from OSS.

  1. Go to DataWorks and then Data Integration.
  2. In Data Integration main page, press New Source to create data source sync from OSS.
  3. Select OSS as data source.
  4. Configure the OSS data source information.
  5. After that, press test connectivity to check whether the OSS bucket can be connected from DataWorks. If it is successful, a green box will pop up at top right corner saying "connectivity test successfully".
  6. In DataWorks Data Integration, click on Data Sources on left navigation panel and the newly created data source from OSS will be visible here.
  7. Go to Sync Tasks at left panel in Data Integration.
  8. Press Wizard Mode to setup data ingestion from OSS.
  9. Configure Data Ingestion Source, Data Ingestion Target, Source and Target Column Mapping, Channel Control and Preview.
  10. Name this data ingestion task to save. After saved, press "operation" button to initiate data ingestion from OSS.
  11. Monitor the log at bottom panel to check the status of the data synchronization task. If the data synchronization ended with return code: [0], it means it is successful.

For more detailed information about how to prepare OSS bucket for data ingestion and the configuration in DataWorks Data Integration, please go to MaxCompute Data Ingestion from OSS.

Related Blog Posts

Drilling into Big Data – Data Interpretation (3)

In this blog series, we will walk you through the entire cycle of Big Data analytics. Now that we are familiar with the basics and with cluster creation, it is time to understand the data which is acquired from various sources and the most suitable data format to ingest it into Big Data environment.

  1. Source-This sheet is based on various events that is registered every time and can be processed in batches
  2. Sort of data-Data is extracted in excel sheets else captured in databases. Anyways the data is structured
  3. Tool to Ingest-Since it comes under batch processing, the tool used is Sqoop
  4. Storage-With the help of sqoop, move the data into HDFS
  5. Tool to Process-Spark
  6. Querying-Hive
  7. Analysis-Zeppelin/QuickBI

Drilling into Big Data – Data Ingestion (4)

In this article, we will take a closer look into the concepts and usage of HDFS and Sqoop for data ingestion.

We will take a dive deep into HDFS, the storage part of Hadoop which is one of the world’s most reliable storage system. The distributed storage and replication of data is the major feature of HDFS which makes it a fault-tolerant storage system. The features which make HDFS suitable for large datasets to run on commodity hardware are Fault tolerance, High availability, reliability and scalability.

Related Documentation

Import or export data using the Data Integration

Use Data integration overview function of DataWorks to create data synchronization tasks and import and export MaxCompute data.

Note:

  1. Only the project administrator can create a data source. Other roles can only view the data source.
  2. If the data source you want to add is a current MaxCompute project, skip this operation. After this project is created and appears as a Data Integration data source, this project is added as a MaxCompute data source named odps_first by default.

Create a Log Service source table

This topic describes how to create a Log Service source table in Realtime Compute. It also describes the attribute fields, WITH parameters, and field type mapping involved in the table creation process.

Log Service is an all-in-one real-time data logging service that Alibaba Group has developed and tested in many big data scenarios. Based on Log Service, you can quickly finish tasks such as data ingestion, consumption, delivery, query, and analysis without any extra development work. This can help you improve O&M and operational efficiency, and build up the capability to process large amounts of logs in the data technology era.

Related Products

Alibaba Cloud Elasticsearch

Alibaba Cloud Elasticsearch is a cloud-based Service that offers built-in integrations such as Kibana, commercial features, and Alibaba Cloud VPC, Cloud Monitor, and Resource Access Management. It can securely ingest data from any source and search, analyze, and visualize it in real time. With Pay-As-You-Go billing, Alibaba Cloud Elasticsearch costs 30% less than self-built solutions and saves you the hassle of maintaining and scaling your platform.

DataWorks

DataWorks is a Big Data platform product launched by Alibaba Cloud. It provides one-stop Big Data development, data permission management, offline job scheduling, and other features. It supports data integration, MaxCompute SQL, MaxCompute MR, machine learning, and shell tasks.

0 0 0
Share on

Alibaba Clouder

2,605 posts | 747 followers

You may also like

Comments