Community Blog Friday Q&A - Week 9 - Questions From The Vault

Friday Q&A - Week 9 - Questions From The Vault

Come explore top questions about Alibaba Cloud's Big Data platform with us, as we delve deep into our question bank!

By Jeremy Pedersen

Welcome back for the ninth installment in our weekly Q&A blog series! We're taking a look at some questions from our last Big Data training. Interested in MaxCompute or DataWorks? Read closely.

What Alibaba Cloud Regions do MaxCompute and DataWorks support?

Most regions are well supported by both MaxCompute and DataWorks. In fact they are always deployed together. Here’s the full list as I write this blog post (May 2021):

  1. China (Hangzhou)
  2. China (Beijing)
  3. China (Shanghai)
  4. China (Chengdu)
  5. China (Shenzhen)
  6. China (Zhangjiakou)
  7. China (Hong Kong)
  8. US (Silicon Valley)
  9. US (Virginia)
  10. Singapore
  11. Australia (Sydney)
  12. Germany (Frankfurt)
  13. Malaysia (Kuala Lumpur)
  14. Indonesia (Jakarta)
  15. India (Mumbai)
  16. Japan (Tokyo)
  17. UK (London)
  18. UAE (Dubai)

Can data in OSS be imported into MaxCompute?

Absolutely. You can freely import and export data to/from MaxCompute using DataWorks' Data Integration console. More details here.

If you have archived data from MaxCompute by exporting it to OSS, can you then delete your MaxCompute tables?

Yes, of course! Let's look at this in a little more detail to see when, how, and why you might want to remove data from MaxCompute.

1: You are done with the data and do not need to keep a copy

In this case, you can use MaxCompute table lifecycle rules to automatically delete old tables. When a table's lifetime (in days) has passed, the table will be automatically dropped (deleted) from the system.

This is an especially good way to save costs on intermediate tables, which are needed only when you are computing results which will be stored in some other, final table.

2: You are done with the data (for now), but want an archival copy

In this case, setting up a scheduled task to export data to OSS using Data Integration is a good idea. Although MaxCompute tables are cheap, OSS is even cheaper. It's a good option for long term storage.

Better yet, if for some reason you need to run a query on your archived data, you may not even need to re-import it into MaxCompute! Depending on what you need to do with your data, you can directly run queries over OSS using our fully serverless data analytics tool, DLA (DataLake Analytics).

3: You are done with your data for now, but might need to pull it back into MaxCompute again later

No problem. Once again Data Integration comes to the rescue here. Simply set up new MaxCompute tables using the DataWorks GUI (or MaxCompute CLI), and pull the data back in!

I am synchronizing data from a MySQL database to MaxCompute. If the data stored in MySQL is changing, how do I update my MaxCompute tables?

As with so many import/export questions, the answer is Data Integration. You can use Data Integration to create a "data source" that points to your MySQL database. Once you've done that, you can set up a workflow in DataWorks that contains a batch synchronization node that runs on a regular schedule. That node will compare the contents of your MaxCompute table with MySQL, and import any new data.

One common way to do this is to include a datetime column in your MySQL database tables, and have DataWorks import any data where datetime in your MySQL table matches a particular date. It's a common best-practice to have a DataWorks job run once a day, and import all the data from the previous day.

How are SQL queries paid for, if I'm using MaxCompute + DataWorks in Pay-As-You-Go mode?

Understanding MaxCompute pricing is so tricky that Alibaba Cloud has created a PDF document just to explain it. No, really.

There's also this page in the documentation which explains how the charges work for different types of jobs.

Basically, if you are running a standard Pay-As-You-Go MaxCompute SQL job, you'll have to pay for two things:

  1. The amount of data your SQL query will need to scan (in GB)
  2. The "complexity" of your SQL query

As you might expect, there's a formula for calculating the complexity of a SQL query. You start by finding the number of SQL Keywords, like this:

Number of SQL keywords = Number of JOIN clauses + Number of GROUP BY clauses + Number of ORDER BY clauses + Number of DISTINCT clauses + Number of window functions + MAX(Number of INSERT statements - 1, 1)

But we're still not done! This just computes the raw number of SQL keywords. We must now convert that into a complexity score using the following set of rules:

  1. If the number of keywords is less than or equal to 3, the complexity of a SQL job is 1.
  2. If the number of keywords is less than or equal to 6 but greater than or equal to 4, the complexity of a SQL job is 1.5.
  3. If the number of keywords is less than or equal to 19 but greater than or equal to 7, the complexity of a SQL job is 2.
  4. If the number of keywords is greater than or equal to 20, the complexity of a SQL job is 4.

Luckily for us, if we use the DataWorks web GUI, it will give us a cost estimate before it runs our queries, and we can cancel a query if it looks like it's going to be too expensive.

If you're more of a CLI person, the command-line MaxCompute tool offers the COST function, which also allows you to get a cost estimate up front:

COST SQL <SQL Sentence>;

Note that the fixed cost (in USD) is 0.0438 USD. We need to multiply this by the number of Gigabytes of data our query will scan, and the SQL complexity score we calculated. Confused yet?

OK, let's have an example

Let's start with a very basic full table scan:

SELECT * FROM mytable

We will assume here that the table contains 10 GB of data. How much will this cost us to run? This query appears to have no SQL keywords at all, according to the formula above. But don't forget: there's a MAX(Number of INSERT statements - 1, 1) in there, so it actually has a total SQL keyword count of one.

Thus, the complexity score for this function is also one. The total cost will therefore be:

1 (SQL complexity) x 10 (GB) x 0.0438 USD = 0.438 USD

So this query costs us about 0.44 USD (44 cents) to run! Not bad.

In MaxCompute, what's the difference between a UDF, a UDTF, and a UDAF?

Glad you asked! UDF stands for User Defined Function. A UDF is a custom function written in Python or Java, which you can use inside your MaxCompute SQL statements, just like it was a regular SQL window function (thing SUM, or MAX). Custom functions can help you to extend the functionality offered by MaxCompute SQL.

Need to do some type of processing to split, merge, or replace text? Use a UDF.

Need to convert from an IP address to a location using a GeoIP lookup? Use a UDF.

There are three different types of UDF function, depending on the number of inputs and outputs you'll consume/produce:

  1. A UDF takes one input and produces one output (one value in, one value out)
  2. A UDTF takes one input and produces multiple outputs (it produces a table)
  3. A UDAF takes multiple inputs and produces one output (it is an aggregator)

You can write all 3 types of function using Python 2, Python 3, or Java. There are some Java examples here.

When might you use these functions?

  1. You could use a UDF to do a one-to-one replacement: say, calling an API to convert an IP address to a location (San Francisco, USA, for instance)
  2. You could use a UDAF to compute a sum or average over a range of inputs
  3. You could use a UDTF to split a comma-separated string into separate rows in a table

These are trivial examples...you could do a lot more! Take a look at this development guide / blog for more.

I've Got A Question!

Great! Reach out to me at jierui.pjr@alibabacloud.com and I'll do my best to answer in a future Friday Q&A blog.

You can also follow the Alibaba Cloud Academy LinkedIn Page. We'll re-post these blogs there each Friday.

Not a LinkedIn person? We're also on Twitter and YouTube.

0 0 0
Share on


71 posts | 152 followers

You may also like