By Jeremy Pedersen
Welcome back for the ninth installment in our weekly Q&A blog series! We're taking a look at some questions from our last Big Data training. Interested in MaxCompute or DataWorks? Read closely.
Most regions are well supported by both MaxCompute and DataWorks. In fact they are always deployed together. Here’s the full list as I write this blog post (May 2021):
Absolutely. You can freely import and export data to/from MaxCompute using DataWorks' Data Integration console. More details here.
Yes, of course! Let's look at this in a little more detail to see when, how, and why you might want to remove data from MaxCompute.
In this case, you can use MaxCompute table lifecycle rules to automatically delete old tables. When a table's lifetime (in days) has passed, the table will be automatically dropped (deleted) from the system.
This is an especially good way to save costs on intermediate tables, which are needed only when you are computing results which will be stored in some other, final table.
In this case, setting up a scheduled task to export data to OSS using Data Integration is a good idea. Although MaxCompute tables are cheap, OSS is even cheaper. It's a good option for long term storage.
Better yet, if for some reason you need to run a query on your archived data, you may not even need to re-import it into MaxCompute! Depending on what you need to do with your data, you can directly run queries over OSS using our fully serverless data analytics tool, DLA (DataLake Analytics).
No problem. Once again Data Integration comes to the rescue here. Simply set up new MaxCompute tables using the DataWorks GUI (or MaxCompute CLI), and pull the data back in!
As with so many import/export questions, the answer is Data Integration. You can use Data Integration to create a "data source" that points to your MySQL database. Once you've done that, you can set up a workflow in DataWorks that contains a batch synchronization node that runs on a regular schedule. That node will compare the contents of your MaxCompute table with MySQL, and import any new data.
One common way to do this is to include a datetime column in your MySQL database tables, and have DataWorks import any data where datetime in your MySQL table matches a particular date. It's a common best-practice to have a DataWorks job run once a day, and import all the data from the previous day.
Understanding MaxCompute pricing is so tricky that Alibaba Cloud has created a PDF document just to explain it. No, really.
There's also this page in the documentation which explains how the charges work for different types of jobs.
Basically, if you are running a standard Pay-As-You-Go MaxCompute SQL job, you'll have to pay for two things:
As you might expect, there's a formula for calculating the complexity of a SQL query. You start by finding the number of SQL Keywords, like this:
Number of SQL keywords = Number of JOIN clauses + Number of GROUP BY clauses + Number of ORDER BY clauses + Number of DISTINCT clauses + Number of window functions + MAX(Number of INSERT statements - 1, 1)
But we're still not done! This just computes the raw number of SQL keywords. We must now convert that into a complexity score using the following set of rules:
Luckily for us, if we use the DataWorks web GUI, it will give us a cost estimate before it runs our queries, and we can cancel a query if it looks like it's going to be too expensive.
If you're more of a CLI person, the command-line MaxCompute tool offers the COST
function, which also allows you to get a cost estimate up front:
COST SQL <SQL Sentence>;
Note that the fixed cost (in USD) is 0.0438 USD. We need to multiply this by the number of Gigabytes of data our query will scan, and the SQL complexity score we calculated. Confused yet?
Let's start with a very basic full table scan:
SELECT * FROM mytable
We will assume here that the table contains 10 GB of data. How much will this cost us to run? This query appears to have no SQL keywords at all, according to the formula above. But don't forget: there's a MAX(Number of INSERT statements - 1, 1)
in there, so it actually has a total SQL keyword count of one.
Thus, the complexity score for this function is also one. The total cost will therefore be:
1 (SQL complexity) x 10 (GB) x 0.0438 USD = 0.438 USD
So this query costs us about 0.44 USD (44 cents) to run! Not bad.
Glad you asked! UDF stands for User Defined Function. A UDF is a custom function written in Python or Java, which you can use inside your MaxCompute SQL statements, just like it was a regular SQL window function (thing SUM
, or MAX
). Custom functions can help you to extend the functionality offered by MaxCompute SQL.
Need to do some type of processing to split, merge, or replace text? Use a UDF.
Need to convert from an IP address to a location using a GeoIP lookup? Use a UDF.
There are three different types of UDF function, depending on the number of inputs and outputs you'll consume/produce:
You can write all 3 types of function using Python 2, Python 3, or Java. There are some Java examples here.
When might you use these functions?
These are trivial examples...you could do a lot more! Take a look at this development guide / blog for more.
Great! Reach out to me at jierui.pjr@alibabacloud.com
and I'll do my best to answer in a future Friday Q&A blog.
You can also follow the Alibaba Cloud Academy LinkedIn Page. We'll re-post these blogs there each Friday.
JDP - March 26, 2021
JDP - May 7, 2021
JDP - April 9, 2021
JDP - April 30, 2021
JDP - June 4, 2021
JDP - July 9, 2021
An end-to-end solution to efficiently build a secure data lake
Learn MoreBuild a Data Lake with Alibaba Cloud Object Storage Service (OSS) with 99.9999999999% (12 9s) availability, 99.995% SLA, and high scalability
Learn MoreConduct large-scale data warehousing with MaxCompute
Learn MoreA premium, serverless, and interactive analytics service
Learn MoreMore Posts by JDP