The SELECT TRANSFORM statement allows you to start a specified child process and enter data in the required format into the child process by using standard input. Then, you can parse the standard output of the child process to obtain the output data. SELECT TRANSFORM allows MaxCompute SQL statements to support other scripting languages. You do not need to compile user-defined functions (UDFs).

Syntax overview

The following code shows the statement syntax:
SELECT TRANSFORM(arg1, arg2 ...) 
(ROW FORMAT DELIMITED (FIELDS TERMINATED BY field_delimiter (ESCAPED BY character_escape))
(LINES SEPARATED BY line_separator)
(NULL DEFINED AS null_value))
USING 'unix_command_line' 
(RESOURCES 'res_name' (',' 'res_name')*)
( AS col1, col2 ...)
(ROW FORMAT DELIMITED (FIELDS TERMINATED BY field_delimiter (ESCAPED BY character_escape))
(LINES SEPARATED BY line_separator) (NULL DEFINED AS null_value))
where:
  • The SELECT TRANSFORM keyword can be replaced with the MAP or REDUCE keyword. They have the same semantics. We recommend that you use SELECT TRANSFORM to make the syntax clearer.
  • arg1,arg2... specifies the input parameters. The format of the TRANSFORM clause is similar to the format of the SELECT clause. In the default format, the results of expressions for each parameter are implicitly converted into a value of the STRING type. Then, the parameters are combined with \t and passed to the specified child process.
  • The USING clause specifies the commands that are used to start a child process.
    • In most MaxCompute SQL statements, the USING clause specifies resources. However, in the SELECT TRANSFORM statement, the USING clause specifies the commands to start a child process. The USING clause is used to ensure compatibility with the Apache Hive syntax.
    • The syntax of the USING clause is similar to the syntax of a shell script. However, instead of running the shell script, the USING clause creates a child process based on the input commands. Therefore, some shell features such as input and output redirection, pipe, and loop are unavailable. A shell script can be used as the commands to start a child process if necessary.
  • The RESOURCES clause specifies the resources that the specified child process can access. You can use the following methods to specify resources:
    • Use the RESOURCES clause to specify resources. For example, USING 'sh foo.sh bar.txt' RESOURCES 'foo.sh','bar.txt'.
    • Use flags to specify resources. Use set odps.sql.session.resources=foo.sh,bar.txt; before SQL statements to specify resources.

      This is a global configuration. It allows all SELECT TRANSFORM statements to access the specified resources. Separate multiple resource files with commas (,).

  • The ROW FORMAT clause specifies the input and output formats.
    The syntax includes two ROW FORMAT clauses. The first clause specifies the format of the input data, and the second clause specifies the format of the output data. By default, \t is used as a column delimiter, \n is used as a row delimiter, and \N is used to represent NULL values.
    Note
    • Only one character is accepted by field_delimiter, character_escape, and line_separator. If you specify a string, the first character in the string is used.
    • MaxCompute supports syntaxes in the formats that are specified by Apache Hive. For example, inputRecordReader, outputRecordReader, and SerDe. You must enable the Hive compatibility mode to use these syntaxes. To enable the Hive compatibility mode, add set odps.sql.hive.compatible=true; before SQL statements. For more information about the syntaxes that are supported by Apache Hive, see Hive documentation.
    • If you specify a syntax that is supported by Apache Hive, such as inputRecordReader or outputRecordReader, SQL statements may be executed at lower speed.
  • The AS clause specifies the output columns. You can specify the data types for output columns. For example, AS(col1:bigint, col2:boolean).
    • If you do not specify the data types for output columns, the default data type STRING is used. For example, AS(col1, col2).
    • The output data is obtained by parsing the standard output of the child process. If the specified data is not of the STRING type, the system implicitly calls the CAST function to convert the data to the STRING type. Runtime exceptions may occur during the conversion process.
    • You cannot specify data types for only some of the specified columns, for example, AS(col1, col2:BIGINT).
    • If you omit the AS clause, the field before the first \t in the standard output data is the key and all the following parts are the value by default. This is equivalent to AS(key, value).

Call shell scripts

Assume that you use a shell script to generate 50 rows of data that is numbered 1 to 50. The following code shows the output of the data field:
SELECT TRANSFORM(script) USING 'sh' AS (data) 
FROM (
        SELECT  'for i in `seq 1 50`; do echo $i; done' AS script
      ) t
;

Use the shell commands as the input for the TRANSFORM clause.

SELECT TRANSFORM is more than a language extension. AWK, Python, Perl, and shell that are supported by SELECT TRANSFORM allow you to compile scripts in commands to implement simple features. You do not need to compile independent script files or upload resources. This simplifies the development process. To implement complex features, you can upload script files. For more information, see Call Python scripts.

Call Python scripts

  1. Compile a Python script file. In this example, the file name is myplus.py.
    #! /usr/bin/env python
    import sys
    line = sys.stdin.readline()
    while line:
        token = line.split('\t')
        if (token[0] == '\\N') or (token[1] == '\\N'):
            print '\\N'
        else:
            print str(token[0]) +'\t' + str(token[1])
        line = sys.stdin.readline()
  2. Add the Python script file as a resource to MaxCompute.
    add py ./myplus.py -f;
    Note You can also use the DataWorks console to add resources.
  3. Execute the SELECT TRANSFORM statement to call the resource.
    -- Create a test table.
    CREATE TABLE testdata(c1 bigint,c2 bigint);
    -- Insert test data into the test table. 
    INSERT INTO TABLE testdata VALUES (1,4),(2,5),(3,6); 
    -- Execute the following SELECT TRANSFORM statement: 
    SELECT 
    TRANSFORM (testdata.c1, testdata.c2) 
    USING 'python myplus.py'resources 'myplus.py' 
    AS (result1,result2) 
    FROM testdata;
    -- The preceding statement is equivalent to the following statement:
    set odps.sql.session.resources=myplus.py;
    SELECT TRANSFORM (testdata.c1, testdata.c2) 
    USING 'python myplus.py' 
    AS (result1,result2) 
    FROM testdata;
    Note In MaxCompute, Python commands can be used as the input for the TRANSFORM clause. For example, you can call shell scripts by running Python commands.
    SELECT TRANSFORM('for i in xrange(1, 50):  print i;') USING 'python' AS (data);
    The SELECT TRANSFORM statement returns the following result:
    +------------+------------+
    | result1    | result2    |
    +------------+------------+
    | 1          | 4          |
    |            | NULL       |
    | 2          | 5          |
    |            | NULL       |
    | 3          | 6          |
    |            | NULL       |
    +------------+------------+

Call Java scripts

The method to call Java scripts is similar to the method to call Python scripts. In this example, you must compile a Java script file, export the file as a JAR package, and run the add command to add the JAR package as a resource to MaxCompute. Then, execute the SELECT TRANSFORM statement to call the resource.
  1. Compile a Java script file and export it as a JAR package. In this example, the name of the JAR package is Sum.jar.
    package com.aliyun.odps.test;
    import java.util.Scanner
    public class Sum {
        public static void main(String[] args) {
            Scanner sc = new Scanner(System.in);
            while (sc.hasNext()) {
                String s = sc.nextLine();
                String[] tokens = s.split("\t");
                if (tokens.length < 2) {
                    throw new RuntimeException("illegal input");
                }
                if (tokens[0].equals("\\N") || tokens[1].equals("\\N")) {
                    System.out.println("\\N");
                }
                System.out.println(Long.parseLong(tokens[0]) + Long.parseLong(tokens[1]));
            }
        }
    }
  2. Add the JAR package as a resource to MaxCompute.
    add jar . /Sum.jar -f;
  3. Execute the SELECT TRANSFORM statement to call the resource.
    -- Create a test table.
    CREATE TABLE testdata(c1 bigint,c2 bigint); 
    -- Insert test data into the test table.
    INSERT INTO TABLE testdata VALUES (1,4),(2,5),(3,6); 
    -- Execute the following SELECT TRANSFORM statement:
    SELECT TRANSFORM(testdata.c1, testdata.c2) USING 'java -cp Sum.jar com.aliyun.odps.test.Sum' resources 'Sum.jar' FROM testdata;
    -- The preceding statement is equivalent to the following statement:
    set odps.sql.session.resources=Sum.jar;
    SELECT TRANSFORM(testdata.c1, testdata.c2) USING 'java -cp Sum.jar com.aliyun.odps.test.Sum' FROM testdata;
    The SELECT TRANSFORM statement returns the following result:
    +-----+
    | cnt |
    +-----+
    | 5   |
    | 7   |
    | 9   |
    +-----+
You can run most Java utilities by using the preceding method.
Note Although user-defined table-valued function (UDTF) frameworks are provided for Java and Python, the SELECT TRANSFORM statement allows you to compile code in an easier manner. This statement simplifies the development process because it does not require additional dependencies, has no format requirements, and can directly use offline scripts. The JAVA_HOME and PYTHON_HOME environment variables provide the paths for Java and Python offline scripts.

Call scripts of other languages

SELECT TRANSFORM also supports commonly used UNIX commands and script interpreters such as AWK and Perl.
Note PHP and Ruby are not deployed in MaxCompute clusters. Therefore, you cannot call PHP or Ruby scripts in MaxCompute.
  • The following example shows how to call AWK to display the second column:
    SELECT TRANSFORM(*) USING "awk '//{print $2}'" AS (data) FROM testdata;
  • The following example shows how to call Perl:
    SELECT TRANSFORM (testdata.c1, testdata.c2) USING "perl -e 'WHILE($input = <STDIN>){print $input;}'" FROM testdata;

Call scripts in tandem

SELECT TRANSFORM allows you to call scripts in tandem. For example, you can use DISTRIBUTE BY and SORT BY to preprocess data.
SELECT TRANSFORM(key, value) USING 'cmd2' FROM 
(
    SELECT TRANSFORM(*) USINg 'cmd1' FROM 
    (
        SELECt * FROM data DISTRIBUTE BY col2 SORT BY col1
    ) t DISTRIBUTE BY key SORT BY value
) t2;
You can also use the MAP or REDUCE keyword to obtain the same results.
@a := SELECT * FROM data DISTRIBUTE BY col2 SORT BY col1;
@b := MAP * USING 'cmd1' DISTRIBUTE BY col1 SORT BY col2 FROM @a;
REDUCE * USING 'cmd2' FROM @b;

Benefits and scenarios of SELECT TRANSFORM

  • Scenarios

    SELECT TRANSFORM and UDTFs deliver different performance benefits in different scenarios. In general, SELECT TRANSFORM provides better performance for small data queries, whereas UDTFs provide better performance for large data queries. SELECT TRANSFORM facilitates a simpler development process. Therefore, SELECT TRANSFORM is more suitable for ad hoc data analytics.

  • Benefits of UDTFs
    • For UDTFs, the output results and input parameters support multiple data types. For SELECT TRANSFORM, the child process transfers data based on the standard input and standard output and processes all data as the STRING type. Therefore, SELECT TRANSFORM requires one more step of data type conversion than UDTFs.
    • For SELECT TRANSFORM, the data transfer depends on the operating system pipe. The pipe has only a 4 KB cache that cannot be set. If the operating system pipe is empty or fully occupied, the process of SELECT TRANSFORM cannot respond. UDTFs are not affected in such cases.
    • For UDTFs, the constant parameters do not need to be transmitted. However, SELECT TRANSFORM does not support this feature.
  • Benefits of SELECT TRANSFORM
    • SELECT TRANSFORM supports two processes: the child process and the parent process. UDTFs support only a single process. If the computing resource usage is high and the data throughput is low, SELECT TRANSFORM can take advantage of the multi-core feature of the server.
    • SELECT TRANSFORM calls underlying systems to read and write data during data transmission. This allows SELECT TRANSFORM to provide higher performance in data transfer than Java programs.
    • SELECT TRANSFORM supports tools such as AWK that can run native code. This allows SELECT TRANSFORM to deliver more performance benefits than Java programs.