All Products
Search
Document Center

MaxCompute:SELECT TRANSFORM

Last Updated:Mar 25, 2026

SELECT TRANSFORM pipes rows from a MaxCompute table through an external script or shell command and maps the script's standard output back to SQL columns. This lets you run AWK, Python, Perl, Shell, or Java logic inline in a SQL query without writing a user-defined table-valued function (UDTF).

When to use SELECT TRANSFORM

SELECT TRANSFORM and UDTFs both process rows with custom code, but they suit different situations:

DimensionSELECT TRANSFORMUDTF
Development effortWrite a shell command or inline script directly in SQL — no packaging requiredRequires writing and deploying a Java or Python class
Ad hoc analysisFast iteration; works with AWK, Python, Perl, Shell without uploading filesBetter suited to production pipelines
Data typesAll input and output is treated as STRING; explicit conversion requiredFull data type support for input and output
Data transmissionOS pipeline-based; 4 KB pipeline buffer (fixed); lower overhead than JVM-based codeNo pipeline buffer limit
Constant parametersAlways transmitted to the child processOptional
Process modelSpawns a child process; leverages multi-core servers when compute is high and throughput is lowRuns in a single process
Performance at small data volumesHigher — native tools such as AWK bypass JVM overheadLower
Performance at large data volumesLowerHigher

Bottom line: Use SELECT TRANSFORM for ad hoc analysis and lightweight scripting. For high-throughput production workloads, use UDTFs.

Limitations

PHP and Ruby are not deployed on MaxCompute compute clusters. You cannot call PHP or Ruby scripts in MaxCompute.

Syntax

SELECT TRANSFORM(<arg1>, <arg2>, ...)
  [ROW FORMAT DELIMITED
    [FIELDS TERMINATED BY '<field_delimiter>' [ESCAPED BY '<character_escape>']]
    [NULL DEFINED AS '<null_value>']]
USING '<unix_command_line>'
  [RESOURCES '<res_name>' [, '<res_name>'...]]
[AS (<col1> [<type>], <col2> [<type>], ...)]
  [ROW FORMAT DELIMITED
    [FIELDS TERMINATED BY '<field_delimiter>' [ESCAPED BY '<character_escape>']]
    [NULL DEFINED AS '<null_value>']]
SELECT TRANSFORM can be replaced with the map or reduce keyword — both are aliases with identical semantics. Use SELECT TRANSFORM for clarity.

Parameters

ParameterRequiredDescription
arg1, arg2, ...YesInput columns or expressions. Values are implicitly cast to STRING, joined with \t, and written to the child process's standard input.
First ROW FORMATNoFormat of data sent to the child process. Defaults: \t (field delimiter), \n (row delimiter), \N (null).
USING '<unix_command_line>'YesCommand used to start the child process. Creates a subprocess — not a shell session. Shell features such as pipes, loops, and I/O redirection are not available in the command string itself. To use those features, pass a shell script as the command (for example, using 'sh myscript.sh').
RESOURCES '<res_name>'NoResource files the child process can access. Alternative: set odps.sql.session.resources globally (see Attach script files).
AS (<col1>, <col2>, ...)NoOutput column names and optional types. If omitted, output defaults to key STRING, value STRING where key is everything before the first \t and value is the rest. You cannot specify types for only some columns — either type all columns or none.
Second ROW FORMATNoFormat of data read from the child process's standard output. Same defaults as the first ROW FORMAT.

ROW FORMAT constraints

  • field_delimiter and character_escape are single characters. If you specify a multi-character string, only the first character is used.

  • ROW FORMAT with inputRecordReader, outputRecordReader, or SerDe requires Hive-compatible mode. Add set odps.sql.hive.compatible=true; before the SQL statement. These Hive-specific formats may reduce execution speed.

For the full Apache Hive ROW FORMAT syntax, see Hive LanguageManual Transform.

Data format behavior

Understanding how data moves between MaxCompute and the child process prevents silent errors.

Input to child process:

  • All column values are cast to STRING.

  • Columns are joined with \t and written to standard input, one row per line.

Output from child process:

  • Each line of standard output becomes one output row.

  • Fields are split on \t (or the delimiter you set in the second ROW FORMAT).

  • If the AS clause is omitted, output maps to key STRING, value STRING — the field before the first \t becomes key and all the following parts become value. This is equivalent to AS(key, value).

  • If output columns are not STRING, MaxCompute calls CAST implicitly to convert values. A runtime exception may occur if the conversion fails.

Inline script examples

For simple transformations, write the script directly in the SQL statement. No resource upload is needed.

Shell

Generate 50 rows numbered 1 to 50:

SELECT TRANSFORM('for i in `seq 1 50`; do echo $i; done') USING 'sh' AS (data);

Output:

+------+
| data |
+------+
| 1    |
| 2    |
| ...  |
| 50   |
+------+

Python

Generate 50 rows numbered 1 to 50:

SELECT TRANSFORM('for i in xrange(1, 51):  print(i);') USING 'python' AS (data);

AWK

Extract the second column from a table:

-- Create and populate a test table
CREATE TABLE testdata (c1 BIGINT, c2 BIGINT);
INSERT INTO TABLE testdata VALUES (1, 4), (2, 5), (3, 6);

-- Extract c2 using AWK
SELECT TRANSFORM(*) USING "awk '//{print $2}'" AS (data) FROM testdata;

Output:

+------+
| data |
+------+
| 4    |
| 5    |
| 6    |
+------+

Perl

Echo all columns back (demonstrating the default key/value output when AS is omitted):

CREATE TABLE testdata (c1 BIGINT, c2 BIGINT);
INSERT INTO TABLE testdata VALUES (1, 4), (2, 5), (3, 6);

SELECT TRANSFORM(testdata.c1, testdata.c2)
USING "perl -e 'while($input = <STDIN>){print $input;}'"
FROM testdata;

Output (no AS clause — defaults to key, value):

+-----+-------+
| key | value |
+-----+-------+
| 1   | 4     |
| 2   | 5     |
| 3   | 6     |
+-----+-------+

Attach script files

For complex logic, upload a script file as a MaxCompute resource, then reference it in USING.

Using the RESOURCES clause

Specify the resource inline in the statement:

SELECT TRANSFORM(...) USING 'python myplus.py' RESOURCES 'myplus.py' AS (...) FROM ...;

Using a session-level setting

Set resources once for all SELECT TRANSFORM statements in the session:

set odps.sql.session.resources=myplus.py,bar.txt;

Multiple resource files are separated by commas. After this setting takes effect, all subsequent SELECT TRANSFORM statements in the session can access the listed resources without specifying RESOURCES in each statement.

Python script example

  1. Create myplus.py:

    #!/usr/bin/env python
    import sys
    
    line = sys.stdin.readline()
    while line:
        token = line.split('\t')
        if (token[0] == '\\N') or (token[1] == '\\N'):
            print('\\N')
        else:
            print(str(token[0]) + '\t' + str(token[1]))
        line = sys.stdin.readline()
  2. Upload the file as a MaxCompute resource:

    You can also upload the file through the DataWorks console. For details, see Create and use MaxCompute resources.
    add py ./myplus.py -f;
  3. Run the transform:

    CREATE TABLE testdata (c1 BIGINT, c2 BIGINT);
    INSERT INTO TABLE testdata VALUES (1, 4), (2, 5), (3, 6);
    
    SELECT TRANSFORM(testdata.c1, testdata.c2)
    USING 'python myplus.py' RESOURCES 'myplus.py'
    AS (result1, result2)
    FROM testdata;

    Output:

    +---------+---------+
    | result1 | result2 |
    +---------+---------+
    | 1       | 4       |
    |         | NULL    |
    | 2       | 5       |
    |         | NULL    |
    | 3       | 6       |
    |         | NULL    |
    +---------+---------+

Java script example

  1. Write and compile a Java class, then export it as Sum.jar:

    package com.aliyun.odps.test;
    import java.util.Scanner;
    
    public class Sum {
        public static void main(String[] args) {
            Scanner sc = new Scanner(System.in);
            while (sc.hasNext()) {
                String s = sc.nextLine();
                String[] tokens = s.split("\t");
                if (tokens.length < 2) {
                    throw new RuntimeException("illegal input");
                }
                if (tokens[0].equals("\\N") || tokens[1].equals("\\N")) {
                    System.out.println("\\N");
                }
                System.out.println(Long.parseLong(tokens[0]) + Long.parseLong(tokens[1]));
            }
        }
    }
  2. Upload the JAR file as a MaxCompute resource:

    add jar ./Sum.jar -f;
  3. Run the transform:

    CREATE TABLE testdata (c1 BIGINT, c2 BIGINT);
    INSERT INTO TABLE testdata VALUES (1, 4), (2, 5), (3, 6);
    
    SELECT TRANSFORM(testdata.c1, testdata.c2)
    USING 'java -cp Sum.jar com.aliyun.odps.test.Sum' RESOURCES 'Sum.jar'
    AS cnt
    FROM testdata;

    Output:

    +-----+
    | cnt |
    +-----+
    | 5   |
    | 7   |
    | 9   |
    +-----+
Java and Python each have a UDTF framework, but SELECT TRANSFORM is often faster to write — no dependencies, no class structure required. Offline Java scripts are resolved from JAVA_HOME; offline Python scripts from PYTHON_HOME.

Chain multiple transforms

Pass the output of one transform as the input to the next by nesting SELECT TRANSFORM statements. Use DISTRIBUTE BY and SORT BY to control how rows are partitioned and ordered before each transform.

SELECT TRANSFORM(key, value) USING '<cmd2>' FROM
(
    SELECT TRANSFORM(*) USING '<cmd1>' FROM
    (
        SELECT * FROM testdata DISTRIBUTE BY c2 SORT BY c1
    ) t DISTRIBUTE BY key SORT BY value
) t2;

The map and reduce aliases make the map-reduce pattern explicit:

@a := SELECT * FROM data DISTRIBUTE BY col2 SORT BY col1;
@b := map * USING 'cmd1' DISTRIBUTE BY col1 SORT BY col2 FROM @a;
reduce * USING 'cmd2' FROM @b;

cmd1 and cmd2 are the commands used to start each child process.