View here to log in or access your console

OK

Process Big Data with MaxCompute

Welcome

In this tutorial, we’ll show you how to get started with the Alibaba Cloud MaxCompute service.

What is MaxCompute?

MaxCompute is the computational engine that runs DataWorks. DataWorks is a front-end Integrated Development Environment (IDE) that requires MaxCompute for its back-end computational processing.

A lot of what you can do on the front-end with DataWorks, you can do directly with MaxCompute via a client console. We’ll show you that here.

Although we will briefly touch upon DataWorks in this tutorial, you can learn the full details of the DataWorks IDE in another tutorial.

Prerequisites

• You need an Alibaba Cloud account. We will assume you already have one.
• We assume you have an access key and any role access management set up that may be required by the service.
• You’ll have to buy and activate MaxCompute – we’ll show you how to do that.
• If we need to pick a region, we’ll use the European region EU Central 1 (Frankfurt).

Setting up MaxCompute

Buy and activate the MaxCompute product from Alibaba Cloud. Go to the Alibaba Cloud MaxCompute product home page https://www.alibabacloud.com/product/maxcompute and click Buy Now.

Choose your region and click Buy Now.

Confirm the Service Agreement and click Activate.

Wait for the success notification.

After a few minutes, click through to the console. This is the DataWorks console. Here you will create the project that you will access with the MaxCompute client. Select the region and service and click Next step.

We have picked Data Development, O&M Center, Data Management for our quick start tutorial. Add the details for your project and click Create.

You will see your new project in the project list in DataWorks.

And in the DataWorks IDE console overview.

Console

Download and unzip the latest version of the client console from here:

http://repo.aliyun.com/download/odpscmd/latest/odpscmd_public.zip

Open it out and you will see the following directory structure.

You have to fill in the details in the odps_config.ini file. Open it in your favorite text editor. Complete the following information:

• The project name you already created.
• Your access key details (the ID and the secret).
• The endpoint URL details – you can find the URLs per region at this link: https://www.alibabacloud.com/help/doc-detail/34951.htm?spm=a3c0i.o27804en.b99.25.15da2cfaFTI2M5

Now you can run the client in the bin directory. For Windows users, run the bat file /bin/odpscmd.bat. For Linux users, open a terminal window in the bin directory and run the application ./odpscmd

Let’s run a few SQL queries and check the logs.

odps@ maxcomputeproject>create table tbl1(id bigint);
ID = 20180201151708390g91mrcj
OK

odps@ maxcomputeproject>insert overwrite table tbl1 select count(*) from tbl1;
ID = 20180201151725671gcrh4dj
Log view:
http://logview.odps.aliyun.com/logview/?h=http://service.eu-central-1.maxcompute.aliyun.com/api&p=maxcomputeproject&i=20180201151725671gcrh4dj&token=eXpQRG9OaS9QN2Rtc0RqT2sxVzYzRHlyLzNBPSxPRFBTX09CTzo1NjM5NjA3NzM5MjQ5NzY2LDE1MTgxMDMwNDUseyJTdGF0ZW1lbnQiOlt7IkFjdGlvbiI6WyJvZHBzOlJlYWQiXSwiRWZmZWN0IjoiQWxsb3ciLCJSZXNvdXJjZSI6WyJhY3M6b2RwczoqOnByb2plY3RzL21heGNvbXB1dGVwcm9qZWN0L2luc3RhbmNlcy8yMDE4MDIwMTE1MTcyNTY3MWdjcmg0ZGoiXX1dLCJWZXJzaW9uIjoiMSJ9
Job Queueing.
----------------------------------------------------------------------------------------------
STAGES STATUS TOTAL COMPLETED RUNNING PENDING BACKUP
M1_job_0 ................. TERMINATED 1 1 0 0 0
R2_1_job_0 ............... TERMINATED 1 1 0 0 0
----------------------------------------------------------------------------------------------
STAGES: 02/02 [==========================>>] 100% ELAPSED TIME: 15.42 s
----------------------------------------------------------------------------------------------
Summary:
resource cost: cpu 0.02 Core * Min, memory 0.02 GB * Min
inputs:
maxcomputeproject.tbl1: 0 (0 bytes)
outputs:
maxcomputeproject.tbl1: 1 (384 bytes)


Job run time: 11.000
Job run mode: fuxi job
Job run engine: execution engine
M1:
instance count: 1
run time: 5.000
instance time:
min: 0.000, max: 0.000, avg: 0.000
input records:
TableScan_REL70487: 0 (min: 0, max: 0, avg: 0)
output records:
StreamLineWrite_REL70488: 0 (min: 0, max: 0, avg: 0)
writer dumps:
StreamLineWrite_REL70488: (min: 0, max: 0, avg: 0)
R2_1:
instance count: 1
run time: 11.000
instance time:
min: 1.000, max: 1.000, avg: 1.000
input records:
StreamLineRead_REL70489: 0 (min: 0, max: 0, avg: 0)
output records:
TableSink_REL70492: 1 (min: 1, max: 1, avg: 1)
reader dumps:
StreamLineRead_REL70489: (min: 0, max: 0, avg: 0)

OK

odps@ maxcomputeproject>select 'welcome to MaxCompute!' from tbl1;
ID = 2018020115175675g16hpcj
Log view:
http://logview.odps.aliyun.com/logview/?h=http://service.eu-central-1.maxcompute.aliyun.com/api&p=maxcomputeproject&i=2018020115175675g16hpcj&token=UzNMN1pxSnNKLzR1ZWhwUk9WMnpmR05MbFRjPSxPRFBTX09CTzo1NjM5NjA3NzM5MjQ5NzY2LDE1MTgxMDMwNzYseyJTdGF0ZW1lbnQiOlt7IkFjdGlvbiI6WyJvZHBzOlJlYWQiXSwiRWZmZWN0IjoiQWxsb3ciLCJSZXNvdXJjZSI6WyJhY3M6b2RwczoqOnByb2plY3RzL21heGNvbXB1dGVwcm9qZWN0L2luc3RhbmNlcy8yMDE4MDIwMTE1MTc1Njc1ZzE2aHBjaiJdfV0sIlZlcnNpb24iOiIxIn0=
Job Queueing.
----------------------------------------------------------------------------------------------
STAGES STATUS TOTAL COMPLETED RUNNING PENDING BACKUP
M1_job_0 ................. TERMINATED 1 1 0 0 0
----------------------------------------------------------------------------------------------
STAGES: 01/01 [==========================>>] 100% ELAPSED TIME: 10.29 s
----------------------------------------------------------------------------------------------
Summary:
resource cost: cpu 0.00 Core * Min, memory 0.00 GB * Min
inputs:
maxcomputeproject.tbl1: 1 (384 bytes)
outputs:
Job run time: 5.000
Job run mode: fuxi job
Job run engine: execution engine

M1:
instance count: 1
run time: 5.000
instance time:
min: 0.000, max: 0.000, avg: 0.000
input records:
TableScan_REL106043: 1 (min: 1, max: 1, avg: 1)
output records:
ADHOC_SINK_106045: 1 (min: 1, max: 1, avg: 1)

+-----+
| _c0 |
+-----+
| welcome to MaxCompute! |
+-----+
odps@ maxcomputeproject>

Now let’s check the DataWorks front-end to see if the table was created.

Adding and Authorizing Users

If you need other users to have access to your project, you must add them to the project and assign them the required roles. The commands to ADD and REMOVE users are straightforward. The following console command adds a user with email address: bob@aliyun.com to the project:

add user bob@aliyun.com;

Before Bob can do anything, we have to grant him authorization privileges. In the following command, we give the Create Table privilege to Bob.

grant CreateTable on PROJECT $maxcomputeproject to USER bob@aliyun.com;

When Bob leaves the company, we first revoke his privileges then remove him as a user from our project.

revoke CreateTable on PROJECT $maxcomputeproject to USER bob@aliyun.com;  
remove user bob@aliyun.com;

If you have a number of users who will have the same privileges, you can create a role for those privileges and simply add the role to the users. Please see the user guide for more details of privileges, roles, and the user authorization syntax.

As the account owner, you have full authorization rights to all the projects you create. Let’s do some data development now.

Creating, Manipulating, and Dropping Tables

The operational objects of MaxCompute are tables. To process data with MaxCompute we must create tables that hold the data.

The syntax for creating a new table is as follows:

CREATE TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[LIFECYCLE days]
[AS select_statement]

Or by copying a table that already exists:

CREATE TABLE [IF NOT EXISTS] table_name
LIKE existing_table_name

There are some restrictions to the syntax, data types and size. Please see the documentation for details.

Let’s create some tables.

create table test1 (key string); 
-- create a no-partition table.
create table test2 (key bigint) partitioned by (pt string, ds string);
-- create a partitioned table.
create table test3 (key boolean) partitioned by (pt string, ds string) lifecycle 100;
-- create a table with a lifecycle of 100 days.
create table test4 like test3;
-- Except for the lifecycle property, the other properties of test3 (field type, partition type) are completely consistent with test4.
create table test5 as select * from test2;
-- This operation will create test5 as a copy of test2, but the partition and lifecycle information will not be copied to the object table.

You will see all the tables in DataWorks too.

Here’s a command to create a table containing user information:

CREATE TABLE user 
(user_id BIGINT, gender BIGINT COMMENT '0 unknown,1 male,2 female', age BIGINT)
PARTITIONED BY (region string, dt string) LIFECYCLE 365;

Use desc to get a description of a table.

Use drop to delete a table.

Let’s add a partition to the user table.

Alter table user add if not exists partition(region='hangzhou',dt='20150923');

To drop a partition, just change the wording around a little.

Alter table user drop if exists partition(region='hangzhou',dt='20150923');

Check the documentation for more table queries.

Data Channel

To import and export data into tables with the MaxCompute console, we use the tunnel command. I’ve saved some offline data in an example.txt file that I’ve put inside the MaxCompute client’s bin directory. Here’s the file:

Let’s import the file into a new table. First, let’s create a table that will hold the data. The table wc_in here contains one column called word which is of string data type.

CREATE TABLE wc_in (word string);

Upload the offline data file to the table with the tunnel command.

First, make sure you have the tunnel endpoint details for the correct region in the configuration file. These details can be found here: https://www.alibabacloud.com/help/doc-detail/34951.htm?spm=a3c0i.o27804en.b99.25.5a29d257Sg7r1f

Then run the command to tunnel in the data.

tunnel upload example.txt wc_in;

Now you can query the data in the table.

We will look at the Java SDK tunneling operations and any other methods for tunneling data into MaxCompute in further tutorials.

SQL

Please see the documentation for the full Alibaba Cloud SQL syntax rules as there are some differences to common SQL syntax rules from other SQL languages.

UDF, MapReduce, Graph

User Defined, MapReduce, and Graph Functions for MaxCompute are currently only supported with the Java SDK and Maven plugins. We will examine using the Java SDK for MaxCompute and other Alibaba Cloud products and services in future tutorials.

Summary

After creating a project in DataWorks, we can access it via the command line with the MaxCompute client console.

We showed you how to download, configure, and execute the client. We showed you how to create and manipulate tables and add and remove users to your project. We also showed you how to build data input and output tunnels, populate tables with the tunnel command, and query your data tables from the console.

We showed you that anything you do with the MaxCompute client is reflected in the DataWorks IDE console.

Some of us prefer the command line and Alibaba Cloud products cater to all data science tastes. Head over to your Alibaba Cloud account and use this tutorial as a guide to get started with MaxCompute. You will quickly see the wider implications of this product and what it can do for your business data.

Be sure to check online for more whitepapers, blogs, and tutorials on MaxCompute with the Java SDK and Maven as well as other Alibaba Cloud products.