MaxCompute Unleashed - Part 5: SELECT TRANSFORM

By Peng Qiu, senior technical expert of Alibaba Cloud

MaxCompute (formerly known as ODPS) is a leading distributed big data processing platform developed by Alibaba Cloud. It is widely utilized, especially within the Alibaba Group, and supports the core businesses of multiple business units (BUs). ODPS V2.0, continuously optimizing performance, aims to enhance the user experience and expression capabilities of SQL, as well as improve the productivity of ODPS developers.

Building upon the SQL engine of ODPS V2.0, MaxCompute simplifies SQL compilation and enhances the language's expressiveness. We present a series of articles titled Unleash the Power of MaxCompute to explore the capabilities of MaxCompute (ODPS V2.0).

In the previous article, I introduced CTE, VALUES, SEMIJOIN. This article introduces the support of MaxCompute for other scripting languages - SELECT TRANSFORM.

• Scenario 1

I needed to migrate my system to the MaxCompute platform, which originally relied on scripts like python, shell, and ruby. Converting these scripts into UDFs, UDAFs, and UDTFs for MaxCompute migration was time-consuming and required extensive testing. I wished for a simpler migration approach.

• Scenario 2

While SQL is adept at set operations, I required more intricate calculations on individual data records. The existing built-in functions and UDF framework were not flexible enough, and I lacked familiarity with Java/Python. I preferred writing scripts and wanted to input data into my script, perform custom calculations, and output results. MaxCompute could handle data splitting, distributed execution of my scripts, and management of input/output tables and relational operations like JOIN and UNION.

To achieve these features, you can utilize SELECT TRANSFORM.

The Introduction to SELECT TRANSFORM

This article uses MaxCompute Studio as an example. First, please install MaxCompute Studio, connect to a MaxCompute project, and create a new MaxCompute script file, as follows.

Submitting a job allows you to view the execution plan (expanded view):

SELECT TRANSFORM enables SQL users to execute a shell command on the server. It separates input data fields with tabs, passes each line as stdin to the shell command, and reads data from stdout as output for downstream processing. The shell command is essentially a Unix utility that can also invoke other script interpreters such as python, java, php, awk, and ruby.

This command is compatible with Hive's Transform feature. For more information, refer to the Hive document.

Some key points to note are::

The USING clause specifies the command to be executed, instead of a resource list, to maintain compatibility with Hive syntax.
Input is received from stdin, and output is sent to stdout.
The delimiter can be configured. The default is t for column separation and line break for row separation.
Customized reader/writer can be used, but using the built-in reader/writer provides better performance.
Custom resources (script files, data files, etc.) can be specified using the set odps.sql.session.resources=foo.sh,bar.txt; command. Multiple resource files can be specified, separated by commas (,). Alternatively, the resources clause can be used after the using clause to specify resources ('foo.sh', 'bar.txt'). Both methods are equivalent (see the example of run tests with odps).
Resource files will be downloaded to the working directory where the specified command is executed. You can use file interfaces to access files like ./bar.txt.

Currently, odps select transform is fully compatible with Hive syntax, features, and behaviors. Most Hive scripts can be directly added to the SELECT TRANSFORM statement, while others may require minor changes. Additionally, many performance optimizations have been implemented using a more efficient language (C++) compared to Hive.

Examples of Application Scenarios

In theory, all features that can be implemented by UDTFs can also be implemented by SELECT TRANSFORM. However, SELECT TRANSFORM is much more flexible than UDTFs. It supports not only Java and Python but also other scripts and tools like shell and perl. The process of writing SELECT TRANSFORM is simple, making it particularly suitable for implementing adhoc features. Here are a few examples:

1. Generating data out of nothing.

select transform(script) using 'sh' as (data) from
(
    select 'for i in `seq 1 50`; do echo $i; done' as script
) t;

Or use python

select transform('for i in xrange(1, 51):  print i;') using 'python' as (data);

The provided statement creates a data table with 50 rows, containing values from 1 to 50. Generating test data becomes much easier with this feature. Although it may seem simple, it used to be a pain point in ODPS without a convenient way to create data, which made testing and exploration inconvenient for beginners. While this functionality can also be implemented using UDTFs, it requires a complex process: go to the IDE -> write UDTFs -> package -> add jar/python -> create function -> execute -> drop function -> drop resource.

2. awk users will find this feature very useful

select transform(*) using "awk '//{print $2}'" as (data) from src;

The above statement only outputs value as it is, but users who are familiar with awk have lived a life of writing awk scripts instead of SQL.

3. Run tests with ODPS.

select transform(key, value) 
   using 'java -cp a.jar org.junit.runner.JUnitCore MyTestClass' 
   resources 'a.jar' 
from testdata;

set odps.sql.session.resources=a.jar;
select transform(key, value) 
   using 'java -cp a.jar org.junit.runner.JUnitCore MyTestClass' 
from testdata;

This example is to illustrate that a lot of utility of Java can be run directly. Although Java and Python have existing UDTF frameworks, it is simpler to write with SELECT TRANSFORM, and there is no additional dependency or format requirement. Even offline scripts can be used directly.

4. Support other scripting languages.

select transform (key, value) using "perl -e 'while($input = <STDIN>){print $input;}'" from src;

The above example uses Perl. This is not just a language support extension. For some simple functionalities, awk, python, perl, and shell all support writing scripts directly in the command without the need for script files and resource uploads, making the development process simpler. Additionally, PHP and Ruby are not currently supported on our computing cluster, so these two scripting languages are not supported.

5. It can be used in chaining operations, preprocessing input data with distribute by and sort by.

select transform(key, value) using 'cmd2' from 
(
    select transform(*) using 'cmd1' from 
    (
        select * from data distribute by col2 sort by col1
    ) t distribute by key sort by value
) t2;

Or you can use the keywords of the map and reduce to make the logic clear.

@a := select * from data distribute by col2 sort by col1;
@b := map * using 'cmd1' distribute by col1 sort by col2 from @a;
reduce * using 'cmd2' from @b;

In theory, the OpenMR models can be mapped to the calculation process described above. It's important to note that the syntax of map, reduce, and select transform is essentially the same. The choice of keywords or writing method does not affect the process or the final result.

Performance

Both SELECT TRANSFORM and UDTFs have their own advantages in terms of performance. After comparing and testing in various scenarios, SELECT TRANSFORM tends to perform better in most scenarios when dealing with smaller data volumes. On the other hand, UDTFs have an advantage when dealing with larger data volumes. SELECT TRANSFORM is particularly suitable for adhoc data analysis due to its simpler development process.

Advantages of UDTFs

UDTFs have types, while Transform's child process transmits data using stdin/stdout, treating all data as strings. This means that Transform requires an additional step for type conversion.
The data transmission in Transform relies on the pipeline of the operating system, which typically has a buffer size of only 4 KB and cannot be adjusted. If the Transform process reads or writes to an empty or full pipe, the process may be suspended.
Constant parameters in UDTFs do not need to be transmitted, but Transform cannot take advantage of this optimization.

Benefits of SELECT TRANSFORM:

The child process and the parent process are separate entities, whereas UDTFs are single-threaded. If the computational workload is high and the data throughput is low, you can leverage the multi-core capabilities of the server.
SELECT TRANSFORM utilizes underlying systems to read and write data, providing higher performance compared to Java.
Some tools supported by SELECT TRANSFORM, such as awk, are implemented using native code, which theoretically offers performance advantages over Java.

Summary

The MaxCompute SQL engine based on ODPS V2.0 provides the SELECT TRANSFORM feature, which significantly simplifies the reference to script code while improving performance. We highly recommend using SELECT TRANSFORM whenever possible.

>> Next article

Annotations

• [1]: The string after USING is used to start a subprocess directly in the background to invoke commands. This does not involve invoking a shell, so certain shell syntaxes, such as input/output redirection and pipelines, are not supported. If you need to use a shell command and treat the actual command as data input, refer to the Generating data out of nothing section.
• [2]: The actual paths for Java and Python can be obtained from the JAVA_HOME and PYTHON_HOME environment variables.

Community

MaxCompute Unleashed - Part 5: SELECT TRANSFORM

The Introduction to SELECT TRANSFORM

Examples of Application Scenarios

Performance

Advantages of UDTFs

Benefits of SELECT TRANSFORM:

Summary

Annotations

Read previous post:

Read next post:

Alibaba Cloud MaxCompute

You may also like

Comments

Alibaba Cloud MaxCompute

Related Products

MaxCompute

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Realtime Compute for Apache Flink