Develop and Manage Data Offline with DataWroks

Welcome

In this tutorial, we’ll show you how to get started with Alibaba Cloud DataWorks.

What Is DataWorks?

DataWorks is a web front end that provides a fully hosted Integrated Development Environment for your big data processing requirements. It is a one stop development and management interface that helps you extract more value from your business data.

DataWorks allows you to input and output large amounts of data from and to multiple sources as well as providing computational tools to do analysis of that data.

DataWorks uses the Alibaba Cloud product MaxCompute as its core computational engine. We present MaxCompute in another tutorial.

DataWorks allows for time-based and dependency-based task scheduling to perform tens of millions of data functions accurately and punctually ensuring there is no need to worry about the low-level details.

Preparatory Steps

• You’ll need an Alibaba Cloud account – we’ll assume you already have one.
• We also assume you have an Access Key and any Role Access Management set up that may be required by the service.
• You’ll have to Buy and Activate MaxCompute – we’ll show you how to do that.
• If we need to pick a region, we’ll use the European region EU Central 1 (Frankfurt).

Setting up DataWorks

Before we can use DataWorks, we have to activate MaxCompute. Go to the Alibaba Cloud MaxCompute product home https://www.alibabacloud.com/product/maxcompute and click Buy Now.

Choose your region and click Buy Now.

Confirm the Service Agreement and click Activate.

Wait for the success notification.

After a few minutes, click through to the console and create a project. Select the correct region and service.

We have picked Data Development, O&M Center, Data Management for our quick start tutorial. Data Integration we will see later on. Click Next step.

Fill out the project details details.

Select the Advanced Settings and click Create Project

You now have a project in your DataWorks IDE project list.

You can also see your project on the DataWorks Overview tab.

DataWorks Data Development

Let’s do some data development. First, we have to create a table and upload some data into it. Go to the DataWorks console and Enter Project.

This takes you to the DataWorks Integrated Development Environment (IDE).

Click New and Create Table.

The SQL editor appears.

The following example SQL will create a table called bank_data with a number of fields and data types.

CREATE TABLE IF NOT EXISTS bank_data
(
age BIGINT COMMENT 'age',
job STRING COMMENT 'job type',
marital STRING COMMENT 'marital status',
education STRING COMMENT 'educational level',
default STRING COMMENT 'credit card ownership',
housing STRING COMMENT 'mortgage',
loan STRING COMMENT 'loan',
contact STRING COMMENT 'contact information',
month STRING COMMENT 'month',
day_of_week STRING COMMENT 'day of the week',
duration STRING COMMENT 'Duration',
campaign BIGINT COMMENT 'contact times during the campaign',
pdays DOUBLE COMMENT 'time interval from the last contact',
previous DOUBLE COMMENT 'previous contact times with the customer',
poutcome STRING COMMENT 'marketing result',
emp_var_rate DOUBLE COMMENT 'employment change rate',
cons_price_idx DOUBLE COMMENT 'consumer price index',
cons_conf_idx DOUBLE COMMENT 'consumer confidence index',
euribor3m DOUBLE COMMENT 'euro deposit rate',
nr_employed DOUBLE COMMENT 'number of employees',
y BIGINT COMMENT 'has time deposit or not'
);

Watch out for formatting errors which the editor will show as red dots. Just delete them if you see them. Copy paste the following formatted SQL into the SQL command into the editor and click OK.

CREATE TABLE IF NOT EXISTS bank_data
(
age BIGINT COMMENT 'age',
job STRING COMMENT 'job type',
marital STRING COMMENT 'marital status',
education STRING COMMENT 'educational level',
default STRING COMMENT 'credit card ownership',
housing STRING COMMENT 'mortgage',
loan STRING COMMENT 'loan',
contact STRING COMMENT 'contact information',
month STRING COMMENT 'month',
day_of_week STRING COMMENT 'day of the week',
duration STRING COMMENT 'Duration',
campaign BIGINT COMMENT 'contact times during the campaign',
pdays DOUBLE COMMENT 'time interval from the last contact',
previous DOUBLE COMMENT 'previous contact times with the customer',
poutcome STRING COMMENT 'marketing result',
emp_var_rate DOUBLE COMMENT 'employment change rate',
cons_price_idx DOUBLE COMMENT 'consumer price index',
cons_conf_idx DOUBLE COMMENT 'consumer confidence index',
euribor3m DOUBLE COMMENT 'euro deposit rate',
nr_employed DOUBLE COMMENT 'number of employees',
y BIGINT COMMENT 'has time deposit or not');

You will see your table by clicking Table query in the left-hand column where any tables in the project will be listed.

Let’s create a new table. Follow the same steps to create a new table and add the following SQL to the editor and click OK.

CREATE TABLE IF NOT EXISTS result_table(  
education STRING COMMENT 'educational level',
num BIGINT COMMENT 'number of people'
);

Your project has two tables now.

We are now going to upload some data to the project.

Upload a File

As an example, we will use a local text file keeping in mind that DataWorks allows data imports from multiple sources. We will see examples of those later on. Download the banking.txt file here. In the project, click Import and Import local data.

Import the banking.txt file to the DataWorks IDE.

Click Next to import data.

Add the table details. The table name will autocomplete. Click Import.

The system will let you know if you have successfully imported the data or if there were any failures.

One of the ways you can check and preview the imported data is with the Table Query, Preview tab in the DataWorks IDE.

There are other ways to import data such as data saved in multiple source types, including RDS, MySQL, SQL Server, PostgreSQL, MaxCompute, ApsaraDB for Memcache, DRDS, OSS, Oracle, FTP, dm, HDFS, and MongoDB. We will look at integrating such data shortly.

You can also use the MaxCompute client console to tunnel data to DataWorks. We will look at that in the MaxCompute tutorial.

Now we have imported some data, we can do analysis on that data by creating a task. (References to Task and Flow are used interchangeably on the site and mean the same thing.)

Create a Task

In your project, click New and Create task.

We are going to create a Workflow task with manual scheduling. The scheduling setting cannot be changed later. Click Create.

Your task is created.

Double click Virtual to create a virtual node we’ll name start.

And an ODPS_SQL node we’ll call insert_data

You have two nodes created now and you can click and drag them around your IDE window.

I moved my start node to above the insert_data node. Click start node and draw a line to insert_data node.

This means that the insert_data node is dependent on the start node. Let’s add some SQL to the insert_data node. This script will count how many people in the bank_data table are single and own a house. The results will go into our result_table.

INSERT OVERWRITE TABLE result_table  --Insert data to result_table
SELECT education
, COUNT(marital) AS num
FROM bank_data
WHERE housing = 'yes'
AND marital = 'single'
GROUP BY education

To add the SQL to the node, double click the insert_data node and copy/paste the SQL.

Click Run and watch the logs for success.

You can then query the results table to see the output.

Now go back to your Workflow panel and Save and Submit your task. Tasks must be submitted before they can be used in other projects.

Create a Data Sync Job

Data syncing periodically imports business data to the DataWorks workspace and periodically exports the computational results to the data source you specify for further display or operation.

This section shows how to export data to a MySQL database. Back in the DataWorks console https://workbench.data.aliyun.com/consolenew#/ go to Data Integration.

Here you have the option of importing and exporting data from multiple data sources.

We will leave the details of setting up your data source to you. A data source can be locally hosted, be hosted at public or private IPs, as well as with Alibaba Cloud products.

Scheduling and Dependence Settings

For your tasks in DataWorks, scheduling and dependency is available. We have already seen dependency in that our insert_table node depended on our start node in the task we set up.

Let’s go back to our task and click the Scheduling configuration option to see what’s available.

If you created your task with ‘Periodic Scheduling’ option selected, you will be able to see the Scheduling attributes.

These attributes allow you to set the time and date and frequency for your tasks.

Scrolling down, you will see options to set dependencies on tasks so that, even if the time schedule is reached, the task will not run unless an upstream dependency has completed depending on your requirements.

Summary

DataWorks is a centralized big data analysis environment allowing for input and output of data from multiple sources as well as analysis on that data.

We set up our DataWorks Integrated Development Environment and created some tables which we populated with data coming from a local text file. We could do the same with a wide range of data sources with the relevant configuration settings.

We showed how easy it is to do SQL analysis on the data coming into the tables and how we might like to schedule our tasks and workflows dependent on other tasks or the date and time.

We gave you a good idea as to the breadth of DataWorks and how easy it is to set up reliable compute and analysis tasks on massive amounts of big data.

We will look carefully at the MaxCompute computational engine and how we can do much of what we can do in the DataWorks frontend with a client console on the command line.

Be sure to check online for more whitepapers, blogs, and tutorials on other Alibaba Cloud products.