Call external scripts and commands using SELECT TRANSFORM - MaxCompute

SELECT TRANSFORM pipes rows from a MaxCompute table through an external script or shell command and maps the script's standard output back to SQL columns. This lets you run AWK, Python, Perl, Shell, or Java logic inline in a SQL query without writing a user-defined table-valued function (UDTF).

When to use SELECT TRANSFORM

SELECT TRANSFORM and UDTFs both process rows with custom code, but they suit different situations:

Dimension	SELECT TRANSFORM	UDTF
Development effort	Write a shell command or inline script directly in SQL — no packaging required	Requires writing and deploying a Java or Python class
Ad hoc analysis	Fast iteration; works with AWK, Python, Perl, Shell without uploading files	Better suited to production pipelines
Data types	All input and output is treated as `STRING`; explicit conversion required	Full data type support for input and output
Data transmission	OS pipeline-based; 4 KB pipeline buffer (fixed); lower overhead than JVM-based code	No pipeline buffer limit
Constant parameters	Always transmitted to the child process	Optional
Process model	Spawns a child process; leverages multi-core servers when compute is high and throughput is low	Runs in a single process
Performance at small data volumes	Higher — native tools such as AWK bypass JVM overhead	Lower
Performance at large data volumes	Lower	Higher

Bottom line: Use SELECT TRANSFORM for ad hoc analysis and lightweight scripting. For high-throughput production workloads, use UDTFs.

Limitations

PHP and Ruby are not deployed on MaxCompute compute clusters. You cannot call PHP or Ruby scripts in MaxCompute.

Syntax

SELECT TRANSFORM(<arg1>, <arg2>, ...)
  [ROW FORMAT DELIMITED
    [FIELDS TERMINATED BY '<field_delimiter>' [ESCAPED BY '<character_escape>']]
    [NULL DEFINED AS '<null_value>']]
USING '<unix_command_line>'
  [RESOURCES '<res_name>' [, '<res_name>'...]]
[AS (<col1> [<type>], <col2> [<type>], ...)]
  [ROW FORMAT DELIMITED
    [FIELDS TERMINATED BY '<field_delimiter>' [ESCAPED BY '<character_escape>']]
    [NULL DEFINED AS '<null_value>']]

SELECT TRANSFORM can be replaced with the map or reduce keyword — both are aliases with identical semantics. Use SELECT TRANSFORM for clarity.

Parameters

Parameter	Required	Description
`arg1`, `arg2`, ...	Yes	Input columns or expressions. Values are implicitly cast to `STRING`, joined with `\t`, and written to the child process's standard input.
First `ROW FORMAT`	No	Format of data sent to the child process. Defaults: `\t` (field delimiter), `\n` (row delimiter), `\N` (null).
`USING '<unix_command_line>'`	Yes	Command used to start the child process. Creates a subprocess — not a shell session. Shell features such as pipes, loops, and I/O redirection are not available in the command string itself. To use those features, pass a shell script as the command (for example, `using 'sh myscript.sh'`).
`RESOURCES '<res_name>'`	No	Resource files the child process can access. Alternative: set `odps.sql.session.resources` globally (see Attach script files).
`AS (<col1>, <col2>, ...)`	No	Output column names and optional types. If omitted, output defaults to `key STRING, value STRING` where `key` is everything before the first `\t` and `value` is the rest. You cannot specify types for only some columns — either type all columns or none.
Second `ROW FORMAT`	No	Format of data read from the child process's standard output. Same defaults as the first `ROW FORMAT`.

ROW FORMAT constraints

field_delimiter and character_escape are single characters. If you specify a multi-character string, only the first character is used.
ROW FORMAT with inputRecordReader, outputRecordReader, or SerDe requires Hive-compatible mode. Add set odps.sql.hive.compatible=true; before the SQL statement. These Hive-specific formats may reduce execution speed.

For the full Apache Hive ROW FORMAT syntax, see Hive LanguageManual Transform.

Data format behavior

Understanding how data moves between MaxCompute and the child process prevents silent errors.

Input to child process:

All column values are cast to STRING.
Columns are joined with \t and written to standard input, one row per line.

Output from child process:

Each line of standard output becomes one output row.
Fields are split on \t (or the delimiter you set in the second ROW FORMAT).
If the AS clause is omitted, output maps to key STRING, value STRING — the field before the first \t becomes key and all the following parts become value. This is equivalent to AS(key, value).
If output columns are not STRING, MaxCompute calls CAST implicitly to convert values. A runtime exception may occur if the conversion fails.

Inline script examples

For simple transformations, write the script directly in the SQL statement. No resource upload is needed.

Shell

Generate 50 rows numbered 1 to 50:

SELECT TRANSFORM('for i in `seq 1 50`; do echo $i; done') USING 'sh' AS (data);

Output:

+------+
| data |
+------+
| 1    |
| 2    |
| ...  |
| 50   |
+------+

Python

Generate 50 rows numbered 1 to 50:

SELECT TRANSFORM('for i in xrange(1, 51):  print(i);') USING 'python' AS (data);

AWK

Extract the second column from a table:

-- Create and populate a test table
CREATE TABLE testdata (c1 BIGINT, c2 BIGINT);
INSERT INTO TABLE testdata VALUES (1, 4), (2, 5), (3, 6);

-- Extract c2 using AWK
SELECT TRANSFORM(*) USING "awk '//{print $2}'" AS (data) FROM testdata;

Output:

+------+
| data |
+------+
| 4    |
| 5    |
| 6    |
+------+

Perl

Echo all columns back (demonstrating the default key/value output when AS is omitted):

CREATE TABLE testdata (c1 BIGINT, c2 BIGINT);
INSERT INTO TABLE testdata VALUES (1, 4), (2, 5), (3, 6);

SELECT TRANSFORM(testdata.c1, testdata.c2)
USING "perl -e 'while($input = <STDIN>){print $input;}'"
FROM testdata;

Output (no AS clause — defaults to key, value):

+-----+-------+
| key | value |
+-----+-------+
| 1   | 4     |
| 2   | 5     |
| 3   | 6     |
+-----+-------+

Attach script files

For complex logic, upload a script file as a MaxCompute resource, then reference it in USING.

Using the RESOURCES clause

Specify the resource inline in the statement:

SELECT TRANSFORM(...) USING 'python myplus.py' RESOURCES 'myplus.py' AS (...) FROM ...;

Using a session-level setting

Set resources once for all SELECT TRANSFORM statements in the session:

set odps.sql.session.resources=myplus.py,bar.txt;

Multiple resource files are separated by commas. After this setting takes effect, all subsequent SELECT TRANSFORM statements in the session can access the listed resources without specifying RESOURCES in each statement.

Python script example

Create myplus.py:

#!/usr/bin/env python
import sys

line = sys.stdin.readline()
while line:
    token = line.split('\t')
    if (token[0] == '\\N') or (token[1] == '\\N'):
        print('\\N')
    else:
        print(str(token[0]) + '\t' + str(token[1]))
    line = sys.stdin.readline()

Upload the file as a MaxCompute resource:
You can also upload the file through the DataWorks console. For details, see Create and use MaxCompute resources.
```
add py ./myplus.py -f;
```

Run the transform:

CREATE TABLE testdata (c1 BIGINT, c2 BIGINT);
INSERT INTO TABLE testdata VALUES (1, 4), (2, 5), (3, 6);

SELECT TRANSFORM(testdata.c1, testdata.c2)
USING 'python myplus.py' RESOURCES 'myplus.py'
AS (result1, result2)
FROM testdata;

Output:

+---------+---------+
| result1 | result2 |
+---------+---------+
| 1       | 4       |
|         | NULL    |
| 2       | 5       |
|         | NULL    |
| 3       | 6       |
|         | NULL    |
+---------+---------+

Java script example

Write and compile a Java class, then export it as Sum.jar:

package com.aliyun.odps.test;
import java.util.Scanner;

public class Sum {
    public static void main(String[] args) {
        Scanner sc = new Scanner(System.in);
        while (sc.hasNext()) {
            String s = sc.nextLine();
            String[] tokens = s.split("\t");
            if (tokens.length < 2) {
                throw new RuntimeException("illegal input");
            }
            if (tokens[0].equals("\\N") || tokens[1].equals("\\N")) {
                System.out.println("\\N");
            }
            System.out.println(Long.parseLong(tokens[0]) + Long.parseLong(tokens[1]));
        }
    }
}

Upload the JAR file as a MaxCompute resource:
```
add jar ./Sum.jar -f;
```

Run the transform:

CREATE TABLE testdata (c1 BIGINT, c2 BIGINT);
INSERT INTO TABLE testdata VALUES (1, 4), (2, 5), (3, 6);

SELECT TRANSFORM(testdata.c1, testdata.c2)
USING 'java -cp Sum.jar com.aliyun.odps.test.Sum' RESOURCES 'Sum.jar'
AS cnt
FROM testdata;

Output:

+-----+
| cnt |
+-----+
| 5   |
| 7   |
| 9   |
+-----+

Java and Python each have a UDTF framework, but SELECT TRANSFORM is often faster to write — no dependencies, no class structure required. Offline Java scripts are resolved from JAVA_HOME; offline Python scripts from PYTHON_HOME.

Chain multiple transforms

Pass the output of one transform as the input to the next by nesting SELECT TRANSFORM statements. Use DISTRIBUTE BY and SORT BY to control how rows are partitioned and ordered before each transform.

SELECT TRANSFORM(key, value) USING '<cmd2>' FROM
(
    SELECT TRANSFORM(*) USING '<cmd1>' FROM
    (
        SELECT * FROM testdata DISTRIBUTE BY c2 SORT BY c1
    ) t DISTRIBUTE BY key SORT BY value
) t2;

The map and reduce aliases make the map-reduce pattern explicit:

@a := SELECT * FROM data DISTRIBUTE BY col2 SORT BY col1;
@b := map * USING 'cmd1' DISTRIBUTE BY col1 SORT BY col2 FROM @a;
reduce * USING 'cmd2' FROM @b;

cmd1 and cmd2 are the commands used to start each child process.