MaxCompute Unleashed - Part 6: User Defined Type

By Haiqing

MaxCompute (formerly known as ODPS) is a leading distributed big data processing platform developed by Alibaba Cloud. It is widely utilized, especially within the Alibaba Group, and supports the core businesses of multiple business units (BUs). ODPS V2.0, continuously optimizing performance, aims to enhance the user experience and expression capabilities of SQL, as well as improve the productivity of ODPS developers.

Building upon the SQL engine of ODPS V2.0, MaxCompute simplifies SQL compilation and enhances the language's expressiveness. We present a series of articles titled Unleash the Power of MaxCompute to explore the capabilities of MaxCompute (ODPS V2.0).

The previous article explained how to incorporate scripts from various languages into MaxCompute. SELECT TRANSFORM's main advantage is the ability to execute scripts in different languages without creating functions or uploading resources. Even if resources must be written, MaxCompute imposes no format or dependency requirements. This article introduces another advancement that elevates this benefit: User Defined Type, or UDT.

Scenario 1

A task that can be easily accomplished in other languages—like a single method call from a built-in Java class—may lack a corresponding built-in function in MaxCompute. Writing a UDF for such a simple task feels unnecessarily complex and offers a poor experience.

Scenario 2

SELECT TRANSFORM allows me to directly embed scripts into SQL statements, which enhances code readability (as opposed to the "black box" nature of UDFs, scripts within SQL are immediately understandable) and maintainability (since there is no need to commit multiple code files, particularly when SQL and script files are in separate repositories). However, this does not work for languages like Java, which need to be compiled before execution. Is there a way for these languages to enjoy the same benefits?

Scenario 3

There's a need to invoke third-party libraries within SQL to perform certain functions. Ideally, this could be done directly in SQL without wrapping the functionality in a UDF layer.

UDTs offer viable solutions to these scenarios. The following sections will introduce how to use UDTs in detail.

Many examples in this article are demonstrated using MaxCompute Studio. For users who haven't installed MaxCompute Studio, please refer to the installation guide to import and test MaxCompute projects and create new projects.

Note: The provided examples are for reference and require the installation of MaxCompute Studio.

Feature Introduction

UDTs (User Defined Types) in MaxCompute allow you to directly reference classes or objects from third-party languages in SQL statements to retrieve data or invoke methods.

While other SQL engines also have the concept of UDTs, there are significant differences compared to MaxCompute. In many SQL engines, UDTs resemble the struct composite type in MaxCompute. Some languages provide features to call third-party libraries, such as the CREATE TYPE statement in Oracle databases. In contrast, UDTs in MaxCompute are more aligned with the concept of CREATE TYPE. A MaxCompute UDT not only includes data fields but can also contain methods. Additionally, MaxCompute simplifies the usage by allowing direct usage in SQL, without the need for special DDL syntax to define type mapping.

A simple example is as follows:

set odps.sql.type.system.odps2=true;    -- Open the new type, because the following operation will use Integer, that is, the INT type

SELECT java.lang.Integer.MAX_VALUE;

The output of the above example:

+-----------+
| max_value |
+-----------+
| 2147483647 |
+-----------+

Like the Java language, the java.lang package can be omitted. So the above example can be abbreviated as:

set odps.sql.type.system.odps2=true;

SELECT Integer.MAX_VALUE;

As you can see, the above example directly writes an expression similar to a Java expression in the select list, and this expression is indeed executed according to the semantics of Java. This example demonstrates the capabilities of MaxCompute UDTs.

You can use UDFs to implement all features provided by UDTs. For example, if you use a UDF to implement the preceding example, you need to perform the following operations.

First, define a UDF class.

package com.aliyun.odps.test;

public class IntegerMaxValue extends com.aliyun.odps.udf.UDF {
  public Integer evaluate() {
    return Integer.MAX_VALUE;
  } 
}

Then, compile the UDF as a jar package. Then, upload the jar package and create a function.

add jar odps-test.jar;
create function integer_max_value as 'com.aliyun.odps.test.IntegerMaxValue' using 'odps-test.jar';

Finally, it can be used in SQL.

select integer_max_value();

UDTs simplify the preceding procedures and allow developers to easily extend SQL features in other languages.

The preceding example shows the ability of Java to access static domains, while the ability of UDTs is far more powerful. For example,

-- Sample data
@table1 := select * from values ('100000000000000000000') as t(x);
@table2 := select * from values (100L) as t(y);

-- Code logic
@a := select new java.math.BigInteger(x) x from @table1;          -- Create an object by using the new method.
@b := select java.math.BigInteger.valueOf(y) y from @table2;      -- Call a static method.
select /*+mapjoin(b)*/ x.add(y).toString() from @a a join @b b;   -- Call an instance method.

The output of the preceding example is 100000000000000000100.

This example also shows a feature that is not easy to implement with UDFs: the results of subqueries allow columns of the UDT. For example, the x column retrieved by variable a is of the java.math.BigInteger type instead of a built-in type. You can pass the UDT data to another operator and then call its method. You can also use the UDT data in data shuffling. In the preceding example, the execution diagram in MaxCompute Studio is as follows:

It can be seen that there are three STAGEs in the figure: M1, R2, and J3. Users who are familiar with the MapReduce principle will know that multiple stages will appear because the existence of a join requires a data reshuffle. Typically, data processing at different stages is performed in different processes or different physical machines. Double-click the box representing M1 to display the following:

It can be seen that M1 only performs the new java.math.BigInteger(x) operation. Similarly, if you click the box representing J3, you can see that J3 performs the java.math.BigInteger.valueOf(y) operation and the x.add(y).toString() operation at different stages:

These operations are not only executed in stages but even on different processes and different physical machines. However, UDTs encapsulate this process to make it look almost the same as if it is executed in the same JVM.

UDTs also allow you to upload JAR packages and directly reference the packages. It is like the preceding UDF jar package. Use UDTs to:

set odps.sql.type.system.odps2=true;
set odps.sql.session.resources=odps-test.jar; -- Specify the jar to be referenced. These jars must be uploaded to the project in advance and need to be jar-type resources.

select new com.aliyun.odps.test.IntegerMaxValue().evaluate();

If you find it troublesome to write the full path of the package, you can use the flag to specify the default package, like the import of Java.

set odps.sql.type.system.odps2=true;
set odps.sql.session.resources=odps-test.jar;
set odps.sql.session.java.imports=com.aliyun.odps.test.*;  -- Specify the default package

select new IntegerMaxValue().evaluate();

Detailed Description

Currently, UDTs support only the Java language.
Provide some flags to improve usage efficiency:
- odps.sql.session.resources: Specify the resources to be referenced. Separate multiple resources with commas (,). set odps.sql.session.resources=foo.sh,bar.txt; Note that this flag is the same as the flag of the resources specified in SELECT TRANSFORM. So, this flag affects both the features of SELECT TRANSFORM and UDTs.
- odps.sql.session.java.imports: Specify the default package. You can specify multiple packages and separate them with commas (,). Similar to the Java import statement, you can provide the full class path, such as java.math.BigInteger, or you can use *. Currently, static import is not supported.
UDTs support the following operations:
- Create objects by using the new method.
- Create objects by using the new method, including creating an array with an initialization list, as new Integer[] { 1, 2, 3 }.
- Call methods, including static methods. You can create objects in the factory method pattern.
- Access fields, including static fields.
- Note:
  - Only public methods and common domain access are supported.
  - Identifiers in UDTs are case-sensitive, including package, class, method, and field names.
  - UDTs support data type conversions in SQL statements, such as cast(1 as java.lang.Object). Java type conversion is not supported, such as (Object)1.
  - Anonymous classes and lambda expressions are not supported. They may be supported in later versions.
  - Currently, function calls that do not return values are not supported (This is because that UDTs are used in expressions, while function calls that do not return values cannot be embedded in expressions. This problem may be solved in later versions).
The classes of the Java SDK are available by default. However, note that the JDK version used by the runtime is JDK V1.8. JDK features newer than this version may not be supported.
Note that all operators are the semantics of MaxCompute SQL, not UDTs. For example, the result of String.valueOf(1) + String.valueOf(2) is 3 (string is implicitly converted to double, and double is added), not '12' (string addition is the semantics of concatenate in Java). In addition to the string addition operation, another operation that is easy to confuse is the = operation. In SQL,= is not an assignment, but is used to determine equality. For Java objects, the equals method should be used to determine equality. The equality determined by the equal sign cannot guarantee its behavior (in UDT scenarios, the concept of the same object cannot be guaranteed. For specific reasons, see the following No.8).
There is a one-to-one mapping between built-in types and specific Java types. This mapping also works in UDTs:
- The data of a built-in type can directly call the methods of the Java type to which it is mapped, such as '123'.length() , 1L.hashCode().
- UDTs can be directly involved in the operations of built-in functions or UDFs, such as chr(Long.valueOf('100')). Long.valueOf returns data of the java.lang.Long type, while the built-in function chr accepts data of the BIGINT type.
- The primitive type of Java can be automatically converted to its BOXING type, and apply the above two rules.
- Note: ome built-in types require set odps.sql.type.system.odps2=true; in order to be used. Otherwise, an error will occur.
UDTs have relatively complete support for generic types. Take java.util.Arrays.asList(new java.math.BigInteger('1')) as an example. The compiler can know that the return value of the method is java.util.List<java.math.BigInteger> type based on the parameter type. Note that constructor functions need to specify type parameters. Otherwise, they use java.lang.Object, which is consistent with Java:

The result of the new java.util.ArrayList(java.util.Arrays.asList('1', '2')) is the java.util.ArrayList<Object> type; And the new java.util.ArrayList<String>(java.util.Arrays.asList('1', '2')) result is java.util.ArrayList<String> type.

UDTs do not have a clear definition of equal objects. This is caused by data reshuffling. As can be seen from the join example in the first part above, objects may be transmitted between different processes and different physical machines. During the transmission, two references of the same object may refer to different objects (for example, the object is shuffled to two machines first, and then shuffled back together next time).
- When using UDTs, you should avoid using the = operator to determine equality but use equals methods instead.
- Objects in the same row or column are correlated in some way. However, a correlation between objects in different rows or columns cannot be ensured.
UDTs cannot be used as shuffle keys, such as join, group by, distribute by, sort by, order by, and cluster by. This is not to say that UDTs cannot be used in these structures. UDTs can be used at any stage in the middle of expression, but cannot be used as the final output. For example, although it cannot group by new java.math.BigInteger('123'), it can group by new java.math.BigInteger('123').hashCode(). As the return value of hashCode is INT., the class type can be used as a built-in type INT (corresponding to the above rule of "mapping built-in types and specific Java types").

Note: This restriction is planned to be removed in future versions.

The following type conversion rules apply in UDTs:
- UDT objects can be implicitly converted to the objects of their base classes.
- UDT objects can be forcibly converted to the objects of their base classes or subclasses.
- Data type conversion for two objects without inheritance applies the native conversion rules. The conversion may change the data. For example, java.lang.Long type data can be forcibly converted to java.lang.Integer type data. It is a process of converting built-in BIGINT type data to INT type data. This process may cause data changes or even data precision changes.
Currently, UDT objects cannot be stored. This means that UDT objects cannot be inserted into tables (in fact, DDL does not support UDTs and cannot create such tables), except for the implicit type conversion to built-in types. At the same time, the final result of the PRINT cannot be of UDTs. For PRINT scenarios, all Java classes have the toString() method, so the java.lang.String type is legal. You can use this method to check UDT data during debugging.
- You can configure set odps.sql.udt.display.tostring=true; MaxCompute can automatically put java.util.Objects.toString(...) on the column wrap that uses all UDTs as the final output to facilitate debugging. This flag is typically used for debugging because it can be applied only to PRINT statements. It cannot be applied to INSERT statements.
- Built-in data types support the BINARY or STRING type. So you can customize the serialize process to store byte[] data on disks. The next time you read it out, you revert it. See the following examples.
- Some classes may have their own serialization and deserialization methods, such as protobuffer. At present, UDTs still support data storage on disks. You still need to call the serialization and deserialization method to change the data type to BINARY to achieve data storage on disks.
UDTs not only implement the features of the scalar function but also implement the features of aggregator and table functions together with the collect_list and explode(doc) built-in functions.

More Examples

Example of Using Java Arrays

set odps.sql.type.system.odps2=true;
set odps.sql.udt.display.tostring=true;

select
    new Integer[10],    -- Create an array with 10 elements.
    new Integer[] {c1, c2, c3},  --Initialize an array that contains three elements.
    new Integer[][] { new Integer[] {c1, c2}, new Integer[] {c3, c4} },  -- Create a multi-dimensional array.
    new Integer[] {c1, c2, c3} [2], -- Access the elements in the array using indexes.
    java.util.Arrays.asList(c1, c2, c3);    -- This creates a List<Integer>, which can also be used as an array< int >, so this is another way to create built-in array data.
from values (1,2,3,4) as t(c1, c2, c3, c4);

The Benefits for JSON Users

The runtime of UDTs comes with a gson dependency (V2.2.4). So users can directly use gson.

set odps.sql.type.system.odps2=true;
set odps.sql.session.java.imports=java.util.*,java,com.google.gson.*; -- Import multiple packages at the same time and separate them by commas.

@a := select new Gson() gson;   -- Build a GSON object

select 
  gson.toJson(new ArrayList<Integer>(Arrays.asList(1, 2, 3))),       -- Converts any object into a JSON string
  cast(gson.fromJson('["a","b","c"]', List.class) as array<string>)  -- Deserialize the JSON string, pay attention to the GSON interface, directly deserialize it, and you get the List<Object> type. So there is a strong conversion into List<String>, convenient for the following use
from @a;

Compared with get_json_object, the benefit of the above usage is more convenient to use. When the content of multiple parts of the JSON string needs to be extracted, the GSON string is first deserialized into formatted data, which is much more efficient.

In addition to GSON, the built-in dependencies of MaxCompute runtime include commons-logging (V1.1.1), commons-lang (V2.5), commons-io (V2.4), and protobuf-java (V2.4.1).

Complex Type Operations

Built-in types of array and map are mapped to java.util.List and java.util.Map, respectively. The result is:

Objects of types that implement java.util.List or java.util.Map interfaces in Java can participate in complex type operations of MaxCompute SQL.
You can directly call the List or Map interface for array and map data in MaxCompute.

set odps.sql.type.system.odps2=true;
set odps.sql.session.java.imports=java.util.*;

select
    size(new ArrayList<Integer>()),        -- Call built-in function size for ArrayList data
size
    array(1,2,3).size(),                   -- Call the List method on the built-in type array
    sort_array(new ArrayList<Integer>()),  -- Sort the data in the ArrayList
    al[1],                                 -- The Java List method does not support indexing. However, the array type supports indexing.
    Objects.toString(a)),        -- With this method, you can convert array type to string type data.
    array(1,2,3).subList(1, 2)             -- Get a subList
from (select new ArrayList<Integer>(array(1,2,3)) as al, array(1,2,3) as a) t;

It can also implement some special features, such as the distinct of the array.

select cast (new java.util.ArrayList<Long>(new java.util.HashSet<Long>(array(1L, 2L, 2L))) as array<bigint>);   -- Output [1,2]

The Implementation of Aggregation Operations

The principle of UDT aggregation is to use the COLLECT_SET or COLLECT_LIST function to convert data into a List, and then apply the scalar method of UDT to the list to obtain the aggregate value of this group of data.

For example, the following example is used to find the median of BigInteger (as the data is of the java.math.BigInteger type, the built-in median function cannot be directly used).

set odps.sql.session.java.imports=java.math.*;
@test_data := select * from values (1),(2),(3),(5) as t(value);

@a := select collect_list(new BigInteger(value)) values from @test_data;  -- Aggregate the data to a list
@b := select sort_array(values) as values, values.size() cnt from @a;  -- To obtain the median, first sort the data.
@c := select if(cnt % 2 == 1, new BigDecimal(values[cnt div 2]), new BigDecimal(values[cnt div 2 - 1].add(values[cnt div 2])).divide(new BigDecimal(2))) med from @b;

-- The final result.
select med.toString() from @c;

As collect_list collects all data into one piece first, there is no way to implement partial aggregate. Thus, the efficiency of this method is lower than that of the built-in aggregator or UDAFs. Therefore, we should try to use the built-in aggregator when it can be implemented. Aggregating all data in a group increases the risk of data skew.

However, if the logic of UDAFs is to collect all data together (for example, a feature similar to wm_concat), the preceding method may be more efficient than UDAFs (note that it is not a built-in aggregator).

The Implementation of Table-valued Functions

Table-valued functions allow you to specify multiple input rows and columns and can generate multiple output rows and columns. It can be implemented according to the following principles:

For more information about how to input multiple rows or columns, see the example of using aggregate functions.
To implement multi-row output, you can use the UDT method to output a Collection type data (List or Map), and then call the explode function to expand the Collections into multiple rows.
UDTs can contain multiple data fields. You can call different getter methods to obtain the content of each field to expand it into multiple columns.

The following example shows how to split a json string and output the splitting result in multiple columns:

@a := select '[{"a":"1","b":"2"},{"a":"1","b":"2"}]' str; -- Sample data
@b := select new com.google.gson.Gson().fromJson(str, java.util.List.class) l from @a; -- Deserialize the JSON string.
@c := select cast(e as java.util.Map<Object,Object>) m from @b lateral view explode(l) t as e;  -- Use explode to type multiple lines
@d := select m.get('a') as a, m.get('b') as b from @c; -- Output the splitting result in multiple columns

select a.toString() a, b.toString() b from @d; -- The final result output. Columns a and b in variable are of the Object type.

Read Resource Files

We know that resource files can be read by ExecutionContext objects in UDFs. Now UDTs can also be used by com.aliyun.odps.udt.UDTExecutionContext.get() method or such an ExecutionContext object.

The following example reads the resource file 1.txt into a STRING object and outputs:

set odps.sql.session.resources=1.txt;

select new String(com.aliyun.odps.udt.UDTExecutionContext.get().readResourceFile('1.txt')) text;

Persist UDT Objects

UDT objects do not support data storage on disks by default. However, there are methods to persist UDT objects. The basic idea is to serialize data into BINARY or STRING for persistence or expand UDT objects to persist key data that can be converted into built-in types.

The UDT is defined as follows:

public class Shape
{
    public List<Point> points;
    public Shape(List<Point> points)
    {
        this.points = points;
    }
}

public class Point
{
    public int x;
    public int y;

    public Point(int x, int y)
    {
        this.x = x;
        this.y = y;
    }
}

To expand an object into a built-in type:

@data := select key, shape from ...;

@exploded := select key, point from @data lateral view explode(shape.points) t as point;
@expanded := select key, point.x, point.y from @exploded;

insert into table points select * from @expanded;

Reconstruct when needed:

select key, new Shape(collect_list(new Point(x, y))) as shape from points group by key;

Or serialize the object into BINARY.

The biggest problem with flat expansion is in serialization and deserialization. Of course, it can be directly converted to BINARY. For example, transform the Shape class:

-- Transform the Shape class.
  public com.aliyun.odps.data.Binary serialize() {
    ByteBuffer buffer = ByteBuffer.allocate(points.size() * 8 + 4);
    buffer.putInt(points.size());
    for (Point point : points) {
      buffer.putInt(point.x);
      buffer.putInt(point.y);
    }
    return new com.aliyun.odps.data.Binary(buffer.array());
  }

  public static Shape deserialize(com.aliyun.odps.data.Binary bytes) {
    ByteBuffer buffer = ByteBuffer.wrap(bytes.data());
    int size = buffer.getInt();
    List<Point> points = new ArrayList<>(size);
    for (int i = 0; i < size; i++) {
      points.add(new Point(buffer.getInt(), buffer.getInt()));
    }
    return new Shape(points);
  }

-- Call serialize() when persistence is required.
select key, shape.serialize() data from ...

-- Call the deserialize method when you need to read data.
select key, Shape.deserialize(data) as Shape from ...

It may be more convenient to use the existing framework directly. For example, Shape is defined with ProtoBuffer

-- The definition of Shape
message Point
{
    required int32 x = 1;
    required int32 y = 2;
}

message Shape
{
    repeated Point points = 1;
}

Method of directly calling pb in SQL

select key, new com.aliyun.odps.data.Binary(shape.toByteArray()) from ...

The Support of MaxCompute Studio

This feature can be used together with MaxCompute Studio to maximize its value.

The intelligent prompts of MaxCompute Studio can greatly improve coding efficiency.

The type derivation process of MaxCompute Studio allows you to know what type an expression is.

The real-time syntax check of MaxCompute Studio can quickly locate problematic syntax issues.

Feature, Performance, and Security

When it comes to features, UDTs offer several advantages:

Easy to use. No need to define additional functions.
Full access to JDK features, expanding the capabilities of SQL.
Directly reference objects and classes from other languages within SQL statements.
Directly utilize libraries from other languages, promoting code reuse.
Ability to incorporate object-oriented design principles.

In terms of performance, the execution process of UDTs is very similar to that of UDFs, with comparable performance. Additional optimizations have been implemented for UDTs, resulting in slightly better performance in certain scenarios:

Deserialization is not required for objects in only one process. Deserialization is required only when the objects are transmitted among processes. This means that UDTs do not incur any serialization or deserialization overhead when no data reshuffling is performed, such as calling the join or aggregator function.
UDTs suffer no performance loss from reflection because the Runtime of UDTs is based on Codegen, rather than based on reflection.
In fact, multiple constant UDT operations can be merged and executed together in a FunctionCall. For example, values[x].add(values[y]).divide(java.math.BigInteger.valueOf(2)) is an operation that seems to have multiple UDT method calls, but in fact, there is only one call. UDTs focus on small-granularity data processing. This does not incur additional overhead for the API where multiple functions are called.

In terms of security control, UDTs are subject to the same sandbox policy as UDFs. Restricted operations can be performed by enabling sandbox isolation or applying to join a sandbox whitelist.

Summary

This article introduces the features of UDTs, which allow the direct inclusion of Java expressions in SQL and the referencing of JDK classes. This feature greatly extends the functionality of SQL.

There are still several areas for improvement in UDTs, as mentioned in the article:

Supporting function calls that do not return values or where the returned value is ignored, allowing operations based solely on the passed-in data (e.g., invoking the add method of a List interface that returns the modified List itself).
Supporting anonymous classes and lambda expressions.
Allowing the use of UDTs as shuffle keys.
Expanding language support beyond Java, such as Python.

Community

MaxCompute Unleashed - Part 6: User Defined Type

Feature Introduction

Detailed Description

More Examples

Example of Using Java Arrays

The Benefits for JSON Users

Complex Type Operations

The Implementation of Aggregation Operations

The Implementation of Table-valued Functions

Read Resource Files

Persist UDT Objects

The Support of MaxCompute Studio

Feature, Performance, and Security

Summary

Read previous post:

Read next post:

Alibaba Cloud MaxCompute

You may also like

Comments

Alibaba Cloud MaxCompute

Related Products

MaxCompute

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Realtime Compute for Apache Flink