MaxCompute Unleashed - Part 2: Basic Data Types and Built-in Functions

By Zhenyu

MaxCompute (formerly known as ODPS) is a leading distributed big data processing platform developed by Alibaba Cloud. It is widely utilized, especially within the Alibaba Group, and supports the core businesses of multiple business units (BUs). ODPS V2.0, continuously optimizing performance, aims to enhance the user experience and expression capabilities of SQL, as well as improve the productivity of ODPS developers.

Building upon the SQL engine of ODPS V2.0, MaxCompute simplifies SQL compilation and enhances the language's expressiveness. We present a series of articles titled Unleash the Power of MaxCompute to explore the capabilities of MaxCompute (ODPS V2.0).

MaxCompute Unleashed - Part 1: Harnessing Compiler Errors and Warnings Effectively

In the previous article, I introduced the usability improvements of the MaxCompute compiler. In this article, I will introduce the new basic data types and built-in functions.

The original ODPS had only six basic data types: BIGINT, DOUBLE, DECIMAL, STRING, DATETIME, and BOOLEAN. These types are generally sufficient, but may not be enough in certain scenarios.

Scenario 1

A project needed to migrate an ETL system that was originally running on SQL SERVER to MaxCompute due to a significant increase in data volume. It was discovered that some tables used VARCHAR while others used INT. These types were also used in multiple SQL scripts in the system and were involved in operations. After migrating to ODPS, VARCHAR was replaced with STRING and INT was replaced with BIGINT. [1]

After the migration, it was found that the data did not match the original system. Was this due to the truncation of VARCHAR and the overflow of INT? Or were there other reasons? Faced with the existing system, there was no choice but to check the code, run the data, and analyze it step by step. What was thought to be a relatively easy project ended up taking a couple of weeks to resolve. . .

Scenario 2

My project required storing binary data in a table for a speech recognition project. If each small segment of collected audio was stored as a field, it would be convenient to process it with a UDF. However, ODPS does not have a BINARY type. Okay, I decided to save it as STRING. However, writing the UDF became quite troublesome. In order to save the data, the byte[] had to be encoded into a STRING and decoded when reading it. A significant amount of code had to be written, resulting in slower execution speed. . .

MaxCompute uses an SQL engine based on ODPS V2.0, which significantly expands the basic data types and provides supporting built-in functions to address the aforementioned problems.

The Expansion of Basic Types

This article uses MaxCompute Studio as an example. First, please install MaxCompute Studio, connect to a MaxCompute project, and create a new MaxCompute script file, as follows.

After running, create another file to insert data, as follows.

After running, you can find the newly created table in the Project Explorer of MaxCompute Studio. View the details of the table and preview the data, as shown in the following figure.

The results can be seen here.

• When you create a table, you must first specify the MaxCompute new type system. For compatibility reasons, you must enable this setting. It can also be specified by default in MaxCompute Studio, as shown in the following figure.

If you do not use MaxCompute Studio, you can specify set odps.sql.type.system.odps2=true; in the script. Studio actually uses this switch in the backend to control whether new types are enabled. When the odps.sql.type.system.odps2 is set to true, in addition to using the new type, it also controls some behavior changes in other aspects. It will be explained in the relevant section.

If you want to enable the feature in a MaxCompute project by default, contact your project administrator and configure the feature in the project template.

• The following table lists the basic data types supported by MaxCompute after the expansion. The newly added data types include TINYINT, SMALLINT, INT, FLOAT, VARCHAR, TIMESTAMP, and BINARY.

Type	Newly added or not	Constant definition	Description
TINYINT	Yes	1Y, -127Y	8-bit signed integer, ranging from -128 to 127.
SMALLINT	Yes	32767S, -100S	16-bit signed integer, ranging from -32,768 to 32,767.
INT	Yes	1000, -15645787 [1]	32-bit signed integer, ranging from -2^31 to 2^31 - 1
BIGINT	No	100000000000L, -1L	64-bit signed integer, ranging from -2^63 + 1 to 2^63 - 1
FLOAT	Yes	No	32-bit binary floating point type
DOUBLE	No	3.1415926 1E+7	64-bit binary floating point type
DECIMAL	No	3.5BD, 99999999999.9999999BD	The precise numeric type based on the decimal system. • precision: indicates the number of digits in a value. Valid values: 1 to 38. • scale: indicates the number of digits to the right of the decimal point in a value. Valid values: 0 to 18.
VARCHAR	Yes	None [2]	n is the length. Range from 1 to 65535.
STRING	No	"abc", 'bcd', "alibaba", and 'inc' [3]	The string type. The maximum length is 8 MB.
BINARY	Yes	No	A binary number. The maximum length is 8 MB.
DATETIME	No	DATETIME '2017-11-11 00:00:00'	The DATETIME type. Valid values: 0001-01-01 00:00:00.000 to 9999-12-31 23:59:59.999, accurate to the millisecond.
TIMESTAMP	Yes	TIMESTAMP '2017-11-11 00:00:00.123456789'	The TIMESTAMP type. Valid values: 0001-01-01 00:00:00.000000000 to 9999-12-31 23:59:59.999999999, accurate to the nanosecond. [4]
BOOLEAN	No	TRUE,FALSE	Valid values: TRUE or FALSE

The following table describes the new implicit conversion rules. [5]

	boolean	tinyint	smallint	int	bigint	float	double	decimal	string	varchar	timestamp	binary
boolean to	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE
tinyint to	FALSE	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE	FALSE	FALSE
smallint to	FALSE	FALSE	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE	FALSE	FALSE
int to	FALSE	FALSE	FALSE	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE	FALSE	FALSE
bigint to	FALSE	FALSE	FALSE	FALSE	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE	FALSE	FALSE
float to	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	TRUE	TRUE	TRUE	TRUE	FALSE	FALSE
double to	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	TRUE	TRUE	TRUE	FALSE	FALSE
decimal to	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	TRUE	TRUE	FALSE	FALSE
string to	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	TRUE	TRUE	TRUE	FALSE	FALSE
varchar to	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	TRUE	TRUE	TRUE	FALSE	FALSE
timestamp to	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	TRUE	TRUE	FALSE
binary to	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE

In addition, the definition of constants of the DECIMAL type and the DATETIME type is added. 100BD is DECIMAL with a value of 100, and datetime '2017-11-11 00:00:00 'is a constant of the DATETIME type. The convenience of constant definition is that it can be directly used in the values clause and the values table, which will be described separately later.

The Expansion of Built-in Functions

For any programming language, including SQL, no matter how powerful the language itself is, if there are no rich functions supported by the latter class library, it will still be very inconvenient when it is applied. MaxCompute with new data types greatly enriches the built-in functions, as follows.

• Mathematical functions

log2, log10, bin, hex, unhex, degrees, radians, sign, e, pi, factorial, cbrt, shiftleft, shiftright, shiftrightunsigned

• Date functions

unix_timestamp, year, quarter, month, day, dayofmonth, hour, minute, second, millisecond, nanosecond, from_utc_timestamp, current_timestamp, add_months, last_day, next_day, months_between

• String functions

concat_ws, lpad, rpad, replace, soundex, substring_index, base64, unbase64

• Aggregate functions

corr

Most of these functions are compatible with the built-in functions of Hive. For more information, see The document of Hive. Different from Hive, these functions provided by MaxCompute are efficient versions implemented by using local code.

The new TIMESTAMP type supports nanosecond-level precision. To be compatible with it, MaxCompute adds specific millisecond and nanosecond functions to retrieve milliseconds of TIMESTAMP and DATETIME and nanoseconds of TIMESTAMP.

As mentioned in the previous article in this series, MaxCompute supports new coercion writing methods. For example, if you want to force a BIGINT variable to be converted to STRING, you can directly write string(a_bigint), which is equivalent to cast(a_bigint as string). Which form to use depends entirely on your preference.

It should be noted that all functions used to support new types, such as current_timestamp, also need set odps.sql.type.system.odps2=true;. Otherwise, compilation errors will be reported.

The Expansion of Partition Types

The support for partition types has also been expanded. Currently, partition types support TINYINT, SMALLINT, INT, BIGINT, VARCHAR, and STRING [6].

In addition, when the original ODPS is used for dynamic partitioning, if the type of partition column is not strictly consistent with the type of the column in the relevant SELECT list, an error will be reported. MaxCompute supports the implicit conversions of data types. For example,

set odps.sql.type.system.odps2=true;
create table parttable(a int, b double) partitioned by (p string);
insert into parttable partition(p) (p, a) select key, value, current_timestmap() from src;
select * from parttable;

Return

a	b	p
0	NULL	2017-01-23 22:30:47.130406621
0	NULL	2017-01-23 22:30:47.130406621

You can see that the value of the partition column p is implicitly converted from the TIMESTAMP type.

Use a UDF

Currently, Java UDFs of MaxCompute V2.0 support new types. Python UDFs will be implemented as soon as possible. Java UDFs use the new type as follows.

UDAFs and UDTFs use the @Resolve annotation to obtain signatures. MaxCompute V2.0 supports using new types in annotations, such as @Resolve("smallint->varchar(10)")
The UDF obtains the signature through the reflection analysis of evaluate. In this case, the built-in type of max compute and the JAVA type meet the one-to-one mapping relationship.

max compute type	java type
tinyint	java.lang.Byte
smallint	java.lang.Short
int	java.lang.Integer
bigint	java.lang.Long
float	java.lang.Float
double	java.lang.Double
decimal	java.math.BigDecimal
boolean	java.lang.Boolean
string	java.lang.String
varchar	com.aliyun.odps.data.Varchar
binary	com.aliyun.odps.data.Binary
datetime	java.util.Date
timestamp	java.sql.Timestamp
array	java.util.List
map	java.util.Map
struct	com.aliyun.odps.data.Struct

It should be noted that the java type corresponding to the ARRAY type is List.

Summary

MaxCompute significantly enhances basic data types and built-in functions to cater to diverse application scenarios, yet relying solely on basic types can still be cumbersome in complex situations. Please look forward to the next article.

Annotations

● [1]:

If an INT constant exceeds the value range of INT, it will be converted into BIGINT. If it exceeds the value range of BIGINT, it will be converted into DOUBLE.
In the original ODPS, due to historical reasons, all INT types in SQL scripts are converted to BIGINT, for example,

create table a_bigint_table(a int); -- INT here is actually treated as BIGINT.
select cast(id as int) from mytable; -- INT here is actually treated as BIGINT

To be compatible with the original ODPS mode, MaxCompute retains this conversion even if the odps.sql.type.system.odps2 is not set to true. But a warning is reported, indicating that INT is treated as BIGINT. If the case occurs in your script, we recommend that you rewrite all your scripts to BIGINT to avoid confusion.

● [2]:

The VARCHAR constant can be represented by implicit conversion of the STRING constant.

● [3]:

The STRING constant supports the connection. For example, 'abc' 'xyz' will be parsed as 'abcxyz'. Different parts can be written on different lines.

● [4]:

Due to the limitations of the underlying system, the current call current_timestamp does not reach nanosecond precision. For example,

meta_dev>set odps.sql.type.system.odps2=true;select nanosecond(current_timestamp());

The output is similar to

+------+
| _c0  |
+------+
| 877000000 |
+------+

The TIMASTAMP constant and external data imports can support nanosecond precision.

● [5]:

In the original ODPS, due to historical reasons, DOUBLE can be implicitly converted to BIGINT. This conversion may cause data loss, which is not allowed by general database systems. To be compatible with the original ODPS mode, MaxCompute still allows this conversion even if the odps.sql.type.system.odps2 is not set to true. But a warning is reported. If the odps.sql.type.system.odps2 is set to true, this implicit type conversion is not allowed.

● [6]:

In the original ODPS, due to historical reasons, although the partition type can be specified as BIGINT, all other cases are processed as STRING except that the schema of the table indicates that it is BIGINT. For example,

create table parttest (a bigint) partitioned by (pt bigint);
insert into parttest partition(pt) select 1, 2 from dual;
insert into parttest partition(pt) select 1, 10 from dual;
select * from parttest where pt >= 2;

The returned result has only one line because 10 is compared by string and 2. It can not go back. To be compatible with the original ODPS mode, MaxCompute still treats it in this way even if the odps.sql.type.system.odps2 is not set to true. If odps.sql.type.system.odps2 is set to true, partitions of the BIGINT type are processed in strict accordance with the BIGINT type.

Community

MaxCompute Unleashed - Part 2: Basic Data Types and Built-in Functions

The Expansion of Basic Types

The Expansion of Built-in Functions

The Expansion of Partition Types

Use a UDF

Summary

Annotations

Read previous post:

Read next post:

Alibaba Cloud MaxCompute

You may also like

Comments

Alibaba Cloud MaxCompute

Related Products

MaxCompute

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Realtime Compute for Apache Flink