All Products
Search
Document Center

MaxCompute:SELECT TRANSFORM

Last Updated:Oct 10, 2023

The SELECT TRANSFORM statement allows you to start a specified child process and use standard input to enter data in the required format. Then, you can parse the standard output of the child process to obtain the output data. The SELECT TRANSFORM statement allows you to run scripts in other programming languages without the need to write user-defined table-valued functions (UDTFs).

Description

The performance of SELECT TRANSFORM and UDTFs varies based on scenarios. The results of the comparative test show that SELECT TRANSFORM outperforms UDTFs when you query small amounts of data. However, UDTFs outperform SELECT TRANSFORM when you query large amounts of data. SELECT TRANSFORM is easy to develop and more suitable for ad hoc queries.

SELECT TRANSFORM is compatible with a variety of programming languages. This statement allows you to write scripts in commands to implement simple features. Programming languages, such as AWK, Python, Perl, and Shell, support this operation. This way, you do not need to perform additional operations, such as write script files and upload resources. This simplifies the development process. To implement complex features, you can upload script files. For more information, see Example of calling Python scripts and Example of calling Java scripts.

The following table provides the comparison results of UDTFs and SELECT TRANSFORM in different dimensions.

Category

select transform

UDTF

Data type

A child process uses standard input and output to transmit data. All data is processed as strings. If you use SELECT TRANSFORM, a data type conversion is required.

The output results and input parameters of UDTFs support multiple data types.

Data transmission

Data transmission is based on the pipeline of an operating system. However, the cache size of the pipeline is only 4 KB and cannot be changed. If the pipeline is empty or fully occupied, SELECT TRANSFORM cannot be executed.

SELECT TRANSFORM calls underlying systems to read and write data during data transmission. This statement provides higher performance in data transmission than Java programs.

No limit is imposed on the pipeline cache.

Transmission of constant parameters

Constant parameters need to be transmitted.

Constant parameters are optionally transmitted.

Process

SELECT TRANSFORM supports parent and child processes. If the usage of computing resources is high and the data throughput is low, SELECT TRANSFORM can use the multi-core feature of servers.

A single process is used.

Performance

SELECT TRANSFORM supports the native code of tools such as AWK. This allows SELECT TRANSFORM to deliver higher performance than Java programs.

The performance is low.

Limits

PHP and Ruby are not deployed on MaxCompute compute clusters. Therefore, you cannot call PHP or Ruby scripts in MaxCompute.

Syntax

select transform(<arg1>, <arg2> ...) 
[(row format delimited (fields terminated by <field_delimiter> (escaped by <character_escape>)) (null defined as <null_value>))]
using '<unix_command_line>' 
(resources '<res_name>' (',' '<res_name>')*)
[(as <col1>, <col2> ...)]
(row format delimited (fields terminated by <field_delimiter> (escaped by <character_escape>)) (null defined as <null_value>))
  • SELECT TRANSFORM keyword: required. You can replace this keyword with the keyword map or reduce that uses the same semantics. To make the syntax clearer, we recommend that you use SELECT TRANSFORM.

  • arg1,arg2...: required. These parameters specify the input data. The formats of these parameters are the same as those in SELECT statements. If you use the default formats, the results of expressions for each parameter are implicitly converted into values of the STRING type. Then, the parameters are combined with \t and passed to the specified child process.

  • ROW FORMAT clause: optional. This clause allows you to customize the format of input and output data.

    Two ROW FORMAT clauses are used in the syntax. The first clause specifies the format of input data, whereas the second clause specifies the format of output data. By default, \t is used as a column delimiter, \n is used as a row delimiter, and \N is used to represent null values.

    Note
    • Only one character can be used as field_delimiter or character_escape. If you specify a string for these parameters, the first character in the string is used.

    • MaxCompute supports syntaxes in the formats that are specified by Apache Hive, such as, inputRecordReader, outputRecordReader, and SerDe. You must enable the Hive-compatible data type edition to use these syntaxes. To enable the Hive-compatible data type edition, add set odps.sql.hive.compatible=true; before SQL statements. For more information about the syntaxes that are supported by Apache Hive, see Hive documentation.

    • If you specify a syntax that is supported by Apache Hive, such as inputRecordReader or outputRecordReader, SQL statements may be executed at a lower speed.

  • USING clause: required. This clause specifies the command that is used to start a child process.

    • In most MaxCompute SQL statements, the USING clause specifies resources. However, the USING clause in the SELECT TRANSFORM statement specifies the command to start a child process. The USING clause is used to ensure the compatibility with the syntax of Apache Hive.

    • The syntax of the USING clause is similar to the syntax of a shell script. However, instead of running a shell script, the USING clause creates a child process based on the command that you specify. Therefore, some shell features, such as input and output redirection, pipeline, and loop, cannot be used. A shell script can be used as the commands to start a child process if necessary.

  • RESOURCES clause: optional. This clause specifies the resources that a child process can access. You can use one of the following methods to specify the resources that a child process can access:

    • Use the RESOURCES clause to specify resources, such as using 'sh foo.sh bar.txt' resources 'foo.sh','bar.txt'.

    • Use set odps.sql.session.resources to specify resources. For example, you can add the set odps.sql.session.resources=foo.sh,bar.txt; flag before SQL statements to specify resources.

      After this global configuration applies, all SELECT TRANSFORM statements can access the resources. Separate multiple resource files with commas (,).

  • AS clause: optional. This clause specifies the output columns and their data types, such as as(col1 bigint, col2 boolean).

    • If you do not specify the data types for output columns, the default data type STRING is used. For example, AS(col1, col2) indicates that the output columns are of the STRING type.

    • The output data is obtained by parsing the standard output of the child process. If the specified data is not of the STRING type, MaxCompute implicitly calls the CAST function to convert the data type into STRING. A runtime exception may occur during the conversion.

    • You cannot specify data types for only some of the specified columns, such as as(col1, col2:bigint).

    • If you omit the AS clause, the field before the first \t in the standard output data is the key and all the following parts are the value. This is equivalent to AS(key, value).

Example of calling shell commands

Run a shell command to generate 50 rows of data that starts from 1 to 50. The output is the data field. Use the output of the shell command as the input of SELECT TRANSFORM. Sample statement:

select transform(script) using 'sh' as (data) 
from (
        select  'for i in `seq 1 50`; do echo $i; done' as script
      ) t
;
-- The preceding statement is equivalent to the following statement: 
select transform('for i in `seq 1 50`; do echo $i; done') using 'sh' as (data);

The following result is returned:

+------------+
| data       |
+------------+
| 1          |
| 2          |
| 3          |
| 4          |
| 5          |
| 6          |
| 7          |
| 8          |
| 9          |
| 10         |
| 11         |
| 12         |
| 13         |
| 14         |
| 15         |
| 16         |
| 17         |
| 18         |
| 19         |
| 20         |
| 21         |
| 22         |
| 23         |
| 24         |
| 25         |
| 26         |
| 27         |
| 28         |
| 29         |
| 30         |
| 31         |
| 32         |
| 33         |
| 34         |
| 35         |
| 36         |
| 37         |
| 38         |
| 39         |
| 40         |
| 41         |
| 42         |
| 43         |
| 44         |
| 45         |
| 46         |
| 47         |
| 48         |
| 49         |
| 50         |
+------------+

Example of calling Python commands

Use a Python command to generate 50 rows of data that starts from 1 to 50. The output is the data field. Use the output of the Python command as the input of SELECT TRANSFORM. Sample statement:

select transform(script) using 'python' as (data) 
from (
        select  'for i in xrange(1, 51):  print i;' as script
      ) t
;
-- The preceding statement is equivalent to the following statement: 
select transform('for i in xrange(1, 51):  print i;') using 'python' as (data);

The following result is returned:

+------------+
| data       |
+------------+
| 1          |
| 2          |
| 3          |
| 4          |
| 5          |
| 6          |
| 7          |
| 8          |
| 9          |
| 10         |
| 11         |
| 12         |
| 13         |
| 14         |
| 15         |
| 16         |
| 17         |
| 18         |
| 19         |
| 20         |
| 21         |
| 22         |
| 23         |
| 24         |
| 25         |
| 26         |
| 27         |
| 28         |
| 29         |
| 30         |
| 31         |
| 32         |
| 33         |
| 34         |
| 35         |
| 36         |
| 37         |
| 38         |
| 39         |
| 40         |
| 41         |
| 42         |
| 43         |
| 44         |
| 45         |
| 46         |
| 47         |
| 48         |
| 49         |
| 50         |
+------------+

Example of calling AWK commands

Create a test table. Run an AWK command to provide the second column of the test table as the output. The output data is the data field. Use the output of the AWK command as the input of SELECT TRANSFORM. Sample statement:

-- Create a test table. 
create table testdata(c1 bigint,c2 bigint);
-- Insert test data into the test table.  
insert into table testdata values (1,4),(2,5),(3,6); 
-- Execute the SELECT TRANSFORM statement.  
select transform(*) using "awk '//{print $2}'" as (data) from testdata;

The following result is returned:

+------------+
| data       |
+------------+
| 4          |
| 5          |
| 6          |
+------------+

Example of calling Perl commands

Create a test table. Run a Perl command to provide data of the test table as the output. The output data is the data field. Use the output of the Perl command as the input of SELECT TRANSFORM. Sample statement:

-- Create a test table. 
create table testdata(c1 bigint,c2 bigint);
-- Insert test data into the test table.  
insert into table testdata values (1,4),(2,5),(3,6); 
-- Execute the SELECT TRANSFORM statement.  
select transform(testdata.c1, testdata.c2) using "perl -e 'while($input = <STDIN>){print $input;}'" from testdata;

The following result is returned:

+------------+------------+
| key        | value      |
+------------+------------+
| 1          | 4          |
| 2          | 5          |
| 3          | 6          |
+------------+------------+

Example of calling Python scripts

  1. Create the myplus.py file. Sample statements:

    #!/usr/bin/env python
    import sys
    line = sys.stdin.readline()
    while line:
        token = line.split('\t')
        if (token[0] == '\\N') or (token[1] == '\\N'):
            print '\\N'
        else:
            print str(token[0]) +'\t' + str(token[1])
        line = sys.stdin.readline()
  2. Add the Python script file as a resource to MaxCompute.

    add py ./myplus.py -f;
    Note

    You can also use the DataWorks console to add the Python file as a resource. For more information, see Create and use MaxCompute resources.

  3. Execute the SELECT TRANSFORM statement to call this file.

    -- Create a test table. 
    create table testdata(c1 bigint,c2 bigint);
    -- Insert test data into the test table.  
    insert into table testdata values (1,4),(2,5),(3,6); 
    -- Execute the SELECT TRANSFORM statement.  
    select 
    transform (testdata.c1, testdata.c2) 
    using 'python myplus.py' resources 'myplus.py' 
    as (result1,result2) 
    from testdata;
    -- The preceding statements are equivalent to the following statements: 
    set odps.sql.session.resources=myplus.py;
    select transform (testdata.c1, testdata.c2) 
    using 'python myplus.py' 
    as (result1,result2) 
    from testdata;

    The following result is returned:

    +------------+------------+
    | result1    | result2    |
    +------------+------------+
    | 1          | 4          |
    |            | NULL       |
    | 2          | 5          |
    |            | NULL       |
    | 3          | 6          |
    |            | NULL       |
    +------------+------------+

Example of calling Java scripts

  1. Write a Java script and export it as the Sum.jar file. Sample Java code:

    package com.aliyun.odps.test;
    import java.util.Scanner;
    public class Sum {
        public static void main(String[] args) {
            Scanner sc = new Scanner(System.in);
            while (sc.hasNext()) {
                String s = sc.nextLine();
                String[] tokens = s.split("\t");
                if (tokens.length < 2) {
                    throw new RuntimeException("illegal input");
                }
                if (tokens[0].equals("\\N") || tokens[1].equals("\\N")) {
                    System.out.println("\\N");
                }
                System.out.println(Long.parseLong(tokens[0]) + Long.parseLong(tokens[1]));
            }
        }
    }
  2. Add the file as a resource to MaxCompute.

    add jar ./Sum.jar -f;
  3. Execute the SELECT TRANSFORM statement to call this file.

    -- Create a test table. 
    create table testdata(c1 bigint,c2 bigint); 
    -- Insert test data into the test table. 
    insert into table testdata values (1,4),(2,5),(3,6); 
    -- Execute the SELECT TRANSFORM statement. 
    select transform(testdata.c1, testdata.c2) using 'java -cp Sum.jar com.aliyun.odps.test.Sum' resources 'Sum.jar' as cnt from testdata;
    -- The preceding statements are equivalent to the following statements: 
    set odps.sql.session.resources=Sum.jar; 
    select transform(testdata.c1, testdata.c2) using 'java -cp Sum.jar com.aliyun.odps.test.Sum' as cnt from testdata;

    The following result is returned:

    +-----+
    | cnt |
    +-----+
    | 5   |
    | 7   |
    | 9   |
    +-----+
Note

Java and Python have a ready-made UDTF framework. However, SELECT TRANSFORM makes it easier for you to write scripts. SELECT TRANSFORM does not require additional dependencies and has no format requirements, and even can directly use offline scripts. The directory for saving offline Java scripts can be obtained from the JAVA_HOME environment variable. The directory for saving offline Python scripts can be obtained from the PYTHON_HOME environment variable.

Example of calling scripts in series

You can execute the SELECT TRANSFORM statements in series. To perform this operation, you can use the DISTRIBUTE BY and SORT BY clauses to preprocess input data. Sample statement:

select transform(key, value) using '<cmd2>' from 
(
    select transform(*) using '<cmd1>' from 
    (
        select * from testdata distribute by c2 sort by c1 
    ) t distribute by key sort by value 
) t2;

cmd1 and cmd2 are the commands that are used to start the child processes.

You can also use the map and reduce keywords to execute the SELECT TRANSFORM statements in series.

@a := select * from data distribute by col2 sort by col1;
@b := map * using 'cmd1' distribute by col1 sort by col2 from @a;
reduce * using 'cmd2' from @b;