The SELECT TRANSFORM statement allows you to start a specified child process and enter data in the required format into the child process by using standard input. Then, you can parse the standard output of the child process to obtain the output data. SELECT TRANSFORM allows MaxCompute SQL statements to support other scripting languages. You do not need to compile user-defined functions (UDFs).
Syntax overview
SELECT TRANSFORM(arg1, arg2 ...)
(ROW FORMAT DELIMITED (FIELDS TERMINATED BY field_delimiter (ESCAPED BY character_escape))
(LINES SEPARATED BY line_separator)
(NULL DEFINED AS null_value))
USING 'unix_command_line'
(RESOURCES 'res_name' (',' 'res_name')*)
( AS col1, col2 ...)
(ROW FORMAT DELIMITED (FIELDS TERMINATED BY field_delimiter (ESCAPED BY character_escape))
(LINES SEPARATED BY line_separator) (NULL DEFINED AS null_value))
- The SELECT TRANSFORM keyword can be replaced with the
MAP
orREDUCE
keyword. They have the same semantics. We recommend that you useSELECT TRANSFORM
to make the syntax clearer. - arg1,arg2... specifies the input parameters. The format of the TRANSFORM clause is similar to
the format of the SELECT clause. In the default format, the results of expressions
for each parameter are implicitly converted into a value of the STRING type. Then,
the parameters are combined with
\t
and passed to the specified child process. - The USING clause specifies the commands that are used to start a child process.
- In most MaxCompute SQL statements, the USING clause specifies resources. However, in the SELECT TRANSFORM statement, the USING clause specifies the commands to start a child process. The USING clause is used to ensure compatibility with the Apache Hive syntax.
- The syntax of the USING clause is similar to the syntax of a shell script. However, instead of running the shell script, the USING clause creates a child process based on the input commands. Therefore, some shell features such as input and output redirection, pipe, and loop are unavailable. A shell script can be used as the commands to start a child process if necessary.
- The RESOURCES clause specifies the resources that the specified child process can access. You can
use the following methods to specify resources:
- Use the RESOURCES clause to specify resources. For example,
USING 'sh foo.sh bar.txt' RESOURCES 'foo.sh','bar.txt'
. - Use flags to specify resources. Use
set odps.sql.session.resources=foo.sh,bar.txt;
before SQL statements to specify resources.This is a global configuration. It allows all
SELECT TRANSFORM
statements to access the specified resources. Separate multiple resource files with commas (,).
- Use the RESOURCES clause to specify resources. For example,
- The ROW FORMAT clause specifies the input and output formats.
The syntax includes two ROW FORMAT clauses. The first clause specifies the format of the input data, and the second clause specifies the format of the output data. By default,
\t
is used as a column delimiter,\n
is used as a row delimiter, and\N
is used to represent NULL values.Note- Only one character is accepted by field_delimiter, character_escape, and line_separator. If you specify a string, the first character in the string is used.
- MaxCompute supports syntaxes in the formats that are specified by Apache Hive. For
example,
inputRecordReader
,outputRecordReader
, andSerDe
. You must enable the Hive compatibility mode to use these syntaxes. To enable the Hive compatibility mode, addset odps.sql.hive.compatible=true;
before SQL statements. For more information about the syntaxes that are supported by Apache Hive, see Hive documentation. - If you specify a syntax that is supported by Apache Hive, such as
inputRecordReader
oroutputRecordReader
, SQL statements may be executed at lower speed.
- The AS clause specifies the output columns. You can specify the data types for output columns.
For example,
AS(col1:bigint, col2:boolean)
.- If you do not specify the data types for output columns, the default data type STRING
is used. For example,
AS(col1, col2)
. - The output data is obtained by parsing the standard output of the child process. If
the specified data is not of the STRING type, the system implicitly calls the
CAST
function to convert the data to the STRING type. Runtime exceptions may occur during the conversion process. - You cannot specify data types for only some of the specified columns, for example,
AS(col1, col2:BIGINT)
. - If you omit the
AS
clause, the field before the first\t
in the standard output data is the key and all the following parts are the value by default. This is equivalent toAS(key, value)
.
- If you do not specify the data types for output columns, the default data type STRING
is used. For example,
Call shell scripts
data
field:SELECT TRANSFORM(script) USING 'sh' AS (data)
FROM (
SELECT 'for i in `seq 1 50`; do echo $i; done' AS script
) t
;
Use the shell commands as the input for the TRANSFORM
clause.
SELECT TRANSFORM
is more than a language extension. AWK, Python, Perl, and shell that are supported
by SELECT TRANSFORM allow you to compile scripts in commands to implement simple features.
You do not need to compile independent script files or upload resources. This simplifies
the development process. To implement complex features, you can upload script files.
For more information, see Call Python scripts.
Call Python scripts
- Compile a Python script file. In this example, the file name is myplus.py.
#! /usr/bin/env python import sys line = sys.stdin.readline() while line: token = line.split('\t') if (token[0] == '\\N') or (token[1] == '\\N'): print '\\N' else: print str(token[0]) +'\t' + str(token[1]) line = sys.stdin.readline()
- Add the Python script file as a resource to MaxCompute.
add py ./myplus.py -f;
Note You can also use the DataWorks console to add resources. - Execute the
SELECT TRANSFORM
statement to call the resource.-- Create a test table. CREATE TABLE testdata(c1 bigint,c2 bigint); -- Insert test data into the test table. INSERT INTO TABLE testdata VALUES (1,4),(2,5),(3,6); -- Execute the following SELECT TRANSFORM statement: SELECT TRANSFORM (testdata.c1, testdata.c2) USING 'python myplus.py'resources 'myplus.py' AS (result1,result2) FROM testdata; -- The preceding statement is equivalent to the following statement: set odps.sql.session.resources=myplus.py; SELECT TRANSFORM (testdata.c1, testdata.c2) USING 'python myplus.py' AS (result1,result2) FROM testdata;
Note In MaxCompute, Python commands can be used as the input for theTRANSFORM
clause. For example, you can call shell scripts by running Python commands.SELECT TRANSFORM('for i in xrange(1, 50): print i;') USING 'python' AS (data);
The SELECT TRANSFORM statement returns the following result:+------------+------------+ | result1 | result2 | +------------+------------+ | 1 | 4 | | | NULL | | 2 | 5 | | | NULL | | 3 | 6 | | | NULL | +------------+------------+
Call Java scripts
add
command to add the JAR package as a resource to MaxCompute. Then, execute the SELECT TRANSFORM
statement to call the resource.
- Compile a Java script file and export it as a JAR package. In this example, the name
of the JAR package is Sum.jar.
package com.aliyun.odps.test; import java.util.Scanner public class Sum { public static void main(String[] args) { Scanner sc = new Scanner(System.in); while (sc.hasNext()) { String s = sc.nextLine(); String[] tokens = s.split("\t"); if (tokens.length < 2) { throw new RuntimeException("illegal input"); } if (tokens[0].equals("\\N") || tokens[1].equals("\\N")) { System.out.println("\\N"); } System.out.println(Long.parseLong(tokens[0]) + Long.parseLong(tokens[1])); } } }
- Add the JAR package as a resource to MaxCompute.
add jar . /Sum.jar -f;
- Execute the
SELECT TRANSFORM
statement to call the resource.
The SELECT TRANSFORM statement returns the following result:-- Create a test table. CREATE TABLE testdata(c1 bigint,c2 bigint); -- Insert test data into the test table. INSERT INTO TABLE testdata VALUES (1,4),(2,5),(3,6); -- Execute the following SELECT TRANSFORM statement: SELECT TRANSFORM(testdata.c1, testdata.c2) USING 'java -cp Sum.jar com.aliyun.odps.test.Sum' resources 'Sum.jar' FROM testdata; -- The preceding statement is equivalent to the following statement: set odps.sql.session.resources=Sum.jar; SELECT TRANSFORM(testdata.c1, testdata.c2) USING 'java -cp Sum.jar com.aliyun.odps.test.Sum' FROM testdata;
+-----+ | cnt | +-----+ | 5 | | 7 | | 9 | +-----+
SELECT TRANSFORM
statement allows you to compile code in an easier manner. This statement simplifies
the development process because it does not require additional dependencies, has no
format requirements, and can directly use offline scripts. The JAVA_HOME
and PYTHON_HOME
environment variables provide the paths for Java and Python offline scripts.
Call scripts of other languages
SELECT TRANSFORM
also supports commonly used UNIX commands and script interpreters such as AWK and
Perl.
- The following example shows how to call AWK to display the second column:
SELECT TRANSFORM(*) USING "awk '//{print $2}'" AS (data) FROM testdata;
- The following example shows how to call Perl:
SELECT TRANSFORM (testdata.c1, testdata.c2) USING "perl -e 'WHILE($input = <STDIN>){print $input;}'" FROM testdata;
Call scripts in tandem
SELECT TRANSFORM
allows you to call scripts in tandem. For example, you can use DISTRIBUTE BY
and SORT BY
to preprocess data.SELECT TRANSFORM(key, value) USING 'cmd2' FROM
(
SELECT TRANSFORM(*) USINg 'cmd1' FROM
(
SELECt * FROM data DISTRIBUTE BY col2 SORT BY col1
) t DISTRIBUTE BY key SORT BY value
) t2;
MAP
or REDUCE
keyword to obtain the same results.@a := SELECT * FROM data DISTRIBUTE BY col2 SORT BY col1;
@b := MAP * USING 'cmd1' DISTRIBUTE BY col1 SORT BY col2 FROM @a;
REDUCE * USING 'cmd2' FROM @b;
Benefits and scenarios of SELECT TRANSFORM
- Scenarios
SELECT TRANSFORM
and UDTFs deliver different performance benefits in different scenarios. In general,SELECT TRANSFORM
provides better performance for small data queries, whereas UDTFs provide better performance for large data queries.SELECT TRANSFORM
facilitates a simpler development process. Therefore,SELECT TRANSFORM
is more suitable for ad hoc data analytics. - Benefits of UDTFs
- For UDTFs, the output results and input parameters support multiple data types. For
SELECT TRANSFORM
, the child process transfers data based on the standard input and standard output and processes all data as the STRING type. Therefore,SELECT TRANSFORM
requires one more step of data type conversion than UDTFs. - For
SELECT TRANSFORM
, the data transfer depends on the operating system pipe. The pipe has only a 4 KB cache that cannot be set. If the operating system pipe is empty or fully occupied, the process ofSELECT TRANSFORM
cannot respond. UDTFs are not affected in such cases. - For UDTFs, the constant parameters do not need to be transmitted. However,
SELECT TRANSFORM
does not support this feature.
- For UDTFs, the output results and input parameters support multiple data types. For
- Benefits of SELECT TRANSFORM
- SELECT TRANSFORM supports two processes: the child process and the parent process.
UDTFs support only a single process. If the computing resource usage is high and the
data throughput is low,
SELECT TRANSFORM
can take advantage of the multi-core feature of the server. - SELECT TRANSFORM calls underlying systems to read and write data during data transmission. This allows SELECT TRANSFORM to provide higher performance in data transfer than Java programs.
SELECT TRANSFORM
supports tools such as AWK that can run native code. This allowsSELECT TRANSFORM
to deliver more performance benefits than Java programs.
- SELECT TRANSFORM supports two processes: the child process and the parent process.
UDTFs support only a single process. If the computing resource usage is high and the
data throughput is low,