SELECT TRANSFORM pipes rows from a MaxCompute table through an external script or shell command and maps the script's standard output back to SQL columns. This lets you run AWK, Python, Perl, Shell, or Java logic inline in a SQL query without writing a user-defined table-valued function (UDTF).
When to use SELECT TRANSFORM
SELECT TRANSFORM and UDTFs both process rows with custom code, but they suit different situations:
| Dimension | SELECT TRANSFORM | UDTF |
|---|---|---|
| Development effort | Write a shell command or inline script directly in SQL — no packaging required | Requires writing and deploying a Java or Python class |
| Ad hoc analysis | Fast iteration; works with AWK, Python, Perl, Shell without uploading files | Better suited to production pipelines |
| Data types | All input and output is treated as STRING; explicit conversion required | Full data type support for input and output |
| Data transmission | OS pipeline-based; 4 KB pipeline buffer (fixed); lower overhead than JVM-based code | No pipeline buffer limit |
| Constant parameters | Always transmitted to the child process | Optional |
| Process model | Spawns a child process; leverages multi-core servers when compute is high and throughput is low | Runs in a single process |
| Performance at small data volumes | Higher — native tools such as AWK bypass JVM overhead | Lower |
| Performance at large data volumes | Lower | Higher |
Bottom line: Use SELECT TRANSFORM for ad hoc analysis and lightweight scripting. For high-throughput production workloads, use UDTFs.
Limitations
PHP and Ruby are not deployed on MaxCompute compute clusters. You cannot call PHP or Ruby scripts in MaxCompute.
Syntax
SELECT TRANSFORM(<arg1>, <arg2>, ...)
[ROW FORMAT DELIMITED
[FIELDS TERMINATED BY '<field_delimiter>' [ESCAPED BY '<character_escape>']]
[NULL DEFINED AS '<null_value>']]
USING '<unix_command_line>'
[RESOURCES '<res_name>' [, '<res_name>'...]]
[AS (<col1> [<type>], <col2> [<type>], ...)]
[ROW FORMAT DELIMITED
[FIELDS TERMINATED BY '<field_delimiter>' [ESCAPED BY '<character_escape>']]
[NULL DEFINED AS '<null_value>']]SELECT TRANSFORMcan be replaced with themaporreducekeyword — both are aliases with identical semantics. UseSELECT TRANSFORMfor clarity.
Parameters
| Parameter | Required | Description |
|---|---|---|
arg1, arg2, ... | Yes | Input columns or expressions. Values are implicitly cast to STRING, joined with \t, and written to the child process's standard input. |
First ROW FORMAT | No | Format of data sent to the child process. Defaults: \t (field delimiter), \n (row delimiter), \N (null). |
USING '<unix_command_line>' | Yes | Command used to start the child process. Creates a subprocess — not a shell session. Shell features such as pipes, loops, and I/O redirection are not available in the command string itself. To use those features, pass a shell script as the command (for example, using 'sh myscript.sh'). |
RESOURCES '<res_name>' | No | Resource files the child process can access. Alternative: set odps.sql.session.resources globally (see Attach script files). |
AS (<col1>, <col2>, ...) | No | Output column names and optional types. If omitted, output defaults to key STRING, value STRING where key is everything before the first \t and value is the rest. You cannot specify types for only some columns — either type all columns or none. |
Second ROW FORMAT | No | Format of data read from the child process's standard output. Same defaults as the first ROW FORMAT. |
ROW FORMAT constraints
field_delimiterandcharacter_escapeare single characters. If you specify a multi-character string, only the first character is used.ROW FORMATwithinputRecordReader,outputRecordReader, orSerDerequires Hive-compatible mode. Addset odps.sql.hive.compatible=true;before the SQL statement. These Hive-specific formats may reduce execution speed.
For the full Apache Hive ROW FORMAT syntax, see Hive LanguageManual Transform.
Data format behavior
Understanding how data moves between MaxCompute and the child process prevents silent errors.
Input to child process:
All column values are cast to
STRING.Columns are joined with
\tand written to standard input, one row per line.
Output from child process:
Each line of standard output becomes one output row.
Fields are split on
\t(or the delimiter you set in the secondROW FORMAT).If the
ASclause is omitted, output maps tokey STRING, value STRING— the field before the first\tbecomeskeyand all the following parts becomevalue. This is equivalent toAS(key, value).If output columns are not
STRING, MaxCompute callsCASTimplicitly to convert values. A runtime exception may occur if the conversion fails.
Inline script examples
For simple transformations, write the script directly in the SQL statement. No resource upload is needed.
Shell
Generate 50 rows numbered 1 to 50:
SELECT TRANSFORM('for i in `seq 1 50`; do echo $i; done') USING 'sh' AS (data);Output:
+------+
| data |
+------+
| 1 |
| 2 |
| ... |
| 50 |
+------+Python
Generate 50 rows numbered 1 to 50:
SELECT TRANSFORM('for i in xrange(1, 51): print(i);') USING 'python' AS (data);AWK
Extract the second column from a table:
-- Create and populate a test table
CREATE TABLE testdata (c1 BIGINT, c2 BIGINT);
INSERT INTO TABLE testdata VALUES (1, 4), (2, 5), (3, 6);
-- Extract c2 using AWK
SELECT TRANSFORM(*) USING "awk '//{print $2}'" AS (data) FROM testdata;Output:
+------+
| data |
+------+
| 4 |
| 5 |
| 6 |
+------+Perl
Echo all columns back (demonstrating the default key/value output when AS is omitted):
CREATE TABLE testdata (c1 BIGINT, c2 BIGINT);
INSERT INTO TABLE testdata VALUES (1, 4), (2, 5), (3, 6);
SELECT TRANSFORM(testdata.c1, testdata.c2)
USING "perl -e 'while($input = <STDIN>){print $input;}'"
FROM testdata;Output (no AS clause — defaults to key, value):
+-----+-------+
| key | value |
+-----+-------+
| 1 | 4 |
| 2 | 5 |
| 3 | 6 |
+-----+-------+Attach script files
For complex logic, upload a script file as a MaxCompute resource, then reference it in USING.
Using the RESOURCES clause
Specify the resource inline in the statement:
SELECT TRANSFORM(...) USING 'python myplus.py' RESOURCES 'myplus.py' AS (...) FROM ...;Using a session-level setting
Set resources once for all SELECT TRANSFORM statements in the session:
set odps.sql.session.resources=myplus.py,bar.txt;Multiple resource files are separated by commas. After this setting takes effect, all subsequent SELECT TRANSFORM statements in the session can access the listed resources without specifying RESOURCES in each statement.
Python script example
Create
myplus.py:#!/usr/bin/env python import sys line = sys.stdin.readline() while line: token = line.split('\t') if (token[0] == '\\N') or (token[1] == '\\N'): print('\\N') else: print(str(token[0]) + '\t' + str(token[1])) line = sys.stdin.readline()Upload the file as a MaxCompute resource:
You can also upload the file through the DataWorks console. For details, see Create and use MaxCompute resources.
add py ./myplus.py -f;Run the transform:
CREATE TABLE testdata (c1 BIGINT, c2 BIGINT); INSERT INTO TABLE testdata VALUES (1, 4), (2, 5), (3, 6); SELECT TRANSFORM(testdata.c1, testdata.c2) USING 'python myplus.py' RESOURCES 'myplus.py' AS (result1, result2) FROM testdata;Output:
+---------+---------+ | result1 | result2 | +---------+---------+ | 1 | 4 | | | NULL | | 2 | 5 | | | NULL | | 3 | 6 | | | NULL | +---------+---------+
Java script example
Write and compile a Java class, then export it as
Sum.jar:package com.aliyun.odps.test; import java.util.Scanner; public class Sum { public static void main(String[] args) { Scanner sc = new Scanner(System.in); while (sc.hasNext()) { String s = sc.nextLine(); String[] tokens = s.split("\t"); if (tokens.length < 2) { throw new RuntimeException("illegal input"); } if (tokens[0].equals("\\N") || tokens[1].equals("\\N")) { System.out.println("\\N"); } System.out.println(Long.parseLong(tokens[0]) + Long.parseLong(tokens[1])); } } }Upload the JAR file as a MaxCompute resource:
add jar ./Sum.jar -f;Run the transform:
CREATE TABLE testdata (c1 BIGINT, c2 BIGINT); INSERT INTO TABLE testdata VALUES (1, 4), (2, 5), (3, 6); SELECT TRANSFORM(testdata.c1, testdata.c2) USING 'java -cp Sum.jar com.aliyun.odps.test.Sum' RESOURCES 'Sum.jar' AS cnt FROM testdata;Output:
+-----+ | cnt | +-----+ | 5 | | 7 | | 9 | +-----+
Java and Python each have a UDTF framework, butSELECT TRANSFORMis often faster to write — no dependencies, no class structure required. Offline Java scripts are resolved fromJAVA_HOME; offline Python scripts fromPYTHON_HOME.
Chain multiple transforms
Pass the output of one transform as the input to the next by nesting SELECT TRANSFORM statements. Use DISTRIBUTE BY and SORT BY to control how rows are partitioned and ordered before each transform.
SELECT TRANSFORM(key, value) USING '<cmd2>' FROM
(
SELECT TRANSFORM(*) USING '<cmd1>' FROM
(
SELECT * FROM testdata DISTRIBUTE BY c2 SORT BY c1
) t DISTRIBUTE BY key SORT BY value
) t2;The map and reduce aliases make the map-reduce pattern explicit:
@a := SELECT * FROM data DISTRIBUTE BY col2 SORT BY col1;
@b := map * USING 'cmd1' DISTRIBUTE BY col1 SORT BY col2 FROM @a;
reduce * USING 'cmd2' FROM @b;cmd1 and cmd2 are the commands used to start each child process.