All Products
Search
Document Center

MaxCompute:Column operations

Last Updated:Mar 26, 2026

The PyODPS DataFrame API supports a wide range of column operations, including null handling, conditional logic, math, strings, datetime, collections, and custom functions. Operations on a sequence apply element-wise to every value in the column.

Prerequisites

Before running the examples, initialize the datasets:

from odps.df import DataFrame

iris = DataFrame(o.get_table('pyodps_iris'))
lens = DataFrame(o.get_table('pyodps_ml_100k_lens'))

Null-related functions

Three built-in functions handle null values: isnull, notnull, and fillna.

  • isnull — returns True if the field value is null.

  • notnull — returns True if the field value is not null.

  • fillna — replaces null values with specified values.

>>> iris.sepallength.isnull().head(5)
   sepallength
0        False
1        False
2        False
3        False
4        False

Logic functions

ifelse

ifelse applies to Boolean fields. When the condition is True, it returns the first argument; otherwise, it returns the second.

>>> (iris.sepallength > 5).ifelse('gt5', 'lte5').rename('cmp5').head(5)
   cmp5
0   gt5
1  lte5
2  lte5
3  lte5
4  lte5

switch

switch handles multiple conditions, matching each value against a sequence of condition-result pairs and returning the corresponding result. Use the default parameter to specify the fallback value when no condition matches.

>>> iris.sepallength.switch(4.9, 'eq4.9', 5.0, 'eq5.0', default='noeq').rename('equalness').head(5)
   equalness
0       noeq
1      eq4.9
2       noeq
3       noeq
4      eq5.0

Import switch from odps.df to apply it as a standalone function across multiple columns:

>>> from odps.df import switch
>>> switch(iris.sepallength == 4.9, 'eq4.9',   # condition 1: exact match 4.9
...        iris.sepallength == 5.0, 'eq5.0',   # condition 2: exact match 5.0
...        default='noeq').rename('equalness').head(5)
   equalness
0       noeq
1      eq4.9
2       noeq
3       noeq
4      eq5.0

Conditional assignment (PyODPS V0.7.8 and later)

Modify column values based on conditions using assignment syntax:

>>> iris[iris.sepallength > 5, 'cmp5'] = 'gt5'    # set 'gt5' where sepallength > 5
>>> iris[iris.sepallength <= 5, 'cmp5'] = 'lte5'  # set 'lte5' where sepallength <= 5
>>> iris.head(5)
   cmp5
0   gt5
1  lte5
2  lte5
3  lte5
4  lte5

Mathematical operations

Numeric sequences support the standard arithmetic operators (+, -, *, /) and a set of math functions.

>>> (iris.sepallength * 10).log().head(5)
   sepallength
0     3.931826
1     3.891820
2     3.850148
3     3.828641
4     3.912023
>>> fields = [
...     iris.sepallength,
...     (iris.sepallength / 2).rename('sepallength divided by 2'),
...     (iris.sepallength ** 2).rename('sepallength squared'),
... ]
>>> iris[fields].head(5)
   sepallength  sepallength divided by 2  sepallength squared
0          5.1                      2.55                26.01
1          4.9                      2.45                24.01
2          4.7                      2.35                22.09
3          4.6                      2.30                21.16
4          5.0                      2.50                25.00

The following table lists all supported math functions.

FunctionDescription
absReturns the absolute value.
sqrtReturns the square root.
sinReturns the sine.
sinhReturns the hyperbolic sine.
cosReturns the cosine.
coshReturns the hyperbolic cosine.
tanReturns the tangent.
tanhReturns the hyperbolic tangent.
arccosReturns the arc cosine.
arccoshReturns the inverse hyperbolic cosine.
arcsinReturns the arc sine.
arcsinhReturns the inverse hyperbolic sine.
arctanReturns the arc tangent.
arctanhReturns the inverse hyperbolic tangent.
expReturns e raised to the given power.
expm1Returns e raised to the given power, minus 1.
logReturns the logarithm using the supplied base.
log2Returns the base-2 logarithm.
log10Returns the base-10 logarithm.
log1pReturns log(1 + x).
radiansConverts the values in radians to degrees.
degreesConverts the values in degrees to radians.
ceilReturns the smallest integer greater than or equal to the given number.
floorReturns the largest integer less than or equal to the given number.
truncReturns the number truncated to the specified decimal place.

Comparison and range checks

Compare a sequence against another sequence or a scalar:

>>> (iris.sepallength < 5).head(5)
   sepallength
0        False
1         True
2         True
3         True
4        False

Sequential comparisons such as 3 <= iris.sepallength <= 5 are not supported. Use between instead:

>>> iris.sepallength.between(3, 5).head(5)   # inclusive by default
   sepallength
0        False
1         True
2         True
3         True
4         True

To exclude endpoints, set inclusive=False:

>>> iris.sepallength.between(3, 5, inclusive=False).head(5)
   sepallength
0        False
1         True
2         True
3         True
4        False

String operations

The DataFrame API provides 35+ string functions for sequence and scalar objects. Functions are accessed directly on a string column — no separate namespace import is required. The following example shows upper and extract:

>>> fields = [
...     iris.name.upper().rename('upper_name'),
...     iris.name.extract('Iris(.*)', group=1),
... ]
>>> iris[fields].head(5)
    upper_name     name
0  IRIS-SETOSA  -setosa
1  IRIS-SETOSA  -setosa
2  IRIS-SETOSA  -setosa
3  IRIS-SETOSA  -setosa
4  IRIS-SETOSA  -setosa

The following table describes all supported string functions.

FunctionDescription
capitalizeConverts the first character to uppercase and the rest to lowercase.
containsReturns whether the string contains a substring. The regex parameter is True by default, so the substring is treated as a regular expression.
countReturns the number of occurrences of the specified substring.
endswithReturns whether the string ends with the specified suffix.
startswithReturns whether the string starts with the specified prefix.
extractExtracts matches of a regular expression. Without the group parameter, returns substrings that match the pattern. With group, returns the specified capture group.
findSearches left-to-right and returns the index of the first occurrence of the substring. Returns -1 if not found.
rfindSearches right-to-left and returns the index of the first occurrence of the substring. Returns -1 if not found.
replaceReplaces substrings matching a regular expression pattern. Specify n to limit the number of replacements.
getReturns the character at the specified position.
lenReturns the length of the string.
ljustPads the string on the right with fillchar (default: space) until it reaches width characters.
rjustPads the string on the left with fillchar (default: space) until it reaches width characters.
lowerConverts the string to lowercase.
upperConverts the string to uppercase.
lstripRemoves leading whitespace, including blank lines.
rstripRemoves trailing whitespace, including blank lines.
stripRemoves leading and trailing whitespace, including blank lines.
splitSplits the string at the specified delimiter and returns a LIST<STRING> value.
padPads the string with fillchar (default: space) at the specified side: left, right, or both.
repeatRepeats the string n times.
slicePerforms a slice operation on the string.
swapcaseSwaps the case of all characters: uppercase becomes lowercase and vice versa.
titleConverts the string to title case, where each word starts with an uppercase character. Equivalent to str.title.
zfillPads the string with 0 on the left until it reaches width characters.
isalnumReturns True if all characters are alphanumeric. Equivalent to str.isalnum.
isalphaReturns True if all characters are alphabetic. Equivalent to str.isalpha.
isdigitReturns True if all characters are digits. Equivalent to str.isdigit.
isspaceReturns True if all characters are whitespace. Equivalent to str.isspace.
islowerReturns True if all cased characters are lowercase. Equivalent to str.islower.
isupperReturns True if all cased characters are uppercase. Equivalent to str.isupper.
istitleReturns True if the string is in title case. Equivalent to str.istitle.
isnumericReturns True if all characters are numeric. Equivalent to str.isnumeric.
isdecimalReturns True if all characters are decimal characters. Equivalent to str.isdecimal.
todictSplits the string into a DICT<STRING, STRING> value. Takes two parameters: the entry delimiter and the key-value delimiter.
strptimeParses a string into a datetime using the same format codes as the Python standard library. See Basic date and time types.

Datetime operations

Datetime functions apply to sequence and scalar objects of DATETIME type.

The following example extracts year, month, day, and hour from a unix timestamp column:

>>> df = lens[[lens.unix_timestamp.astype('datetime').rename('dt')]]
>>> df[df.dt,
...    df.dt.year.rename('year'),
...    df.dt.month.rename('month'),
...    df.dt.day.rename('day'),
...    df.dt.hour.rename('hour')].head(5)
                    dt  year  month  day  hour
0  1998-04-08 11:02:00  1998      4    8    11
1  1998-04-08 10:57:55  1998      4    8    10
2  1998-04-08 10:45:26  1998      4    8    10
3  1998-04-08 10:25:52  1998      4    8    10
4  1998-04-08 10:44:19  1998      4    8    10

The following table describes the supported datetime attributes.

AttributeDescription
yearReturns the year component of the datetime.
monthReturns the month component (1-12).
dayReturns the day of the month.
hourReturns the hour component (0-23).
minuteReturns the minute component (0-59).
secondReturns the second component (0-59).
weekofyearReturns the ISO week number of the year. Monday is the first day of the week.
weekdayReturns a number representing the day of the week where the provided date falls.
dayofweekReturns a number representing the day of the week where the provided date falls.
strftimeFormats the datetime as a string using the same format codes as the Python standard library. See Basic date and time types.

Converts the given string representing a time to the specified format. The time format is the same as the time in the standard Python library. For more information about the time formats in Python, see Basic date and time types.

Converts the given string representing a time to the specified format. The time format is the same as the time format in the standard Python library. For more information about the time formats in Python, see Basic date and time types.

Datetime arithmetic

PyODPS supports date arithmetic. Use the datetime unit types from odps.df to add or subtract time intervals. Subtracting two datetime columns returns the difference in milliseconds as int64.

>>> from odps.df import day
>>> df
                           a                          b
0 2016-12-06 16:43:12.460001 2016-12-06 17:43:12.460018
1 2016-12-06 16:43:12.460012 2016-12-06 17:43:12.460021
2 2016-12-06 16:43:12.460015 2016-12-06 17:43:12.460022

>>> df.a - day(3)                          # subtract 3 days from column a
                           a
0 2016-12-03 16:43:12.460001
1 2016-12-03 16:43:12.460012
2 2016-12-03 16:43:12.460015

>>> (df.b - df.a).dtype                    # difference between two datetime columns
int64
>>> (df.b - df.a).rename('a')              # result is in milliseconds
         a
0  3600000
1  3600000
2  3600000

The following table lists the available datetime unit types for arithmetic.

TypeDescription
yearYear-level interval.
monthMonth-level interval.
dayDay-level interval.
hourHour-level interval.
minuteMinute-level interval.
secondSecond-level interval.
millisecondMillisecond-level interval.

Collection operations

PyODPS supports LIST and DICT collection types. Use subscripts to retrieve individual items, and len to get the number of items.

>>> df
   id         a                            b
0   1  [a1, b1]  {'a2': 0, 'b2': 1, 'c2': 2}
1   2      [c1]           {'d2': 3, 'e2': 4}

>>> df[df.id, df.a[0], df.b['b2']]        # access by subscript
   id   a    b
0   1  a1    1
1   2  c1  NaN

>>> df[df.id, df.a.len(), df.b.len()]     # get collection size
   id  a  b
0   1  2  3
1   2  1  2

explode

Use explode when you need to expand a collection column into individual rows — one row per element. This is useful for downstream aggregations or joins that expect flat row-level data.

For LIST columns, explode returns one column by default. Set pos=True to also return the element's position index (similar to Python's enumerate):

>>> df.a.explode()
    a
0  a1
1  b1
2  c1

>>> df.a.explode(pos=True)                # include position index
   a_pos   a
0      0  a1
1      1  b1
2      0  c1

>>> df.a.explode(['pos', 'value'], pos=True)   # specify column names
   pos value
0    0    a1
1    1    b1
2    0    c1

For DICT columns, explode returns two columns: keys and values.

>>> df.b.explode()
  b_key  b_value
0    a2        0
1    b2        1
2    c2        2
3    d2        3
4    e2        4

>>> df.b.explode(['key', 'value'])         # specify column names
  key  value
0  a2      0
1  b2      1
2  c2      2
3  d2      3
4  e2      4

Use explode with column selection to combine exploded columns with the original columns:

>>> df[df.id, df.a.explode()]
   id   a
0   1  a1
1   1  b1
2   2  c1

>>> df[df.id, df.a.explode(), df.b.explode()]   # cross-product of both collections
   id   a b_key  b_value
0   1  a1    a2        0
1   1  a1    b2        1
2   1  a1    c2        2
3   1  b1    a2        0
4   1  b1    b2        1
5   1  b1    c2        2
6   2  c1    d2        3
7   2  c1    e2        4

LIST-specific methods

In addition to len and explode, LIST columns support the following methods.

MethodDescription
contains(v)Returns True if the list contains the specified element.
sortSorts the list and returns a LIST value.

DICT-specific methods

In addition to len and explode, DICT columns support the following methods.

MethodDescription
keysReturns all keys as a LIST value.
valuesReturns all values as a LIST value.

Other operations

isin and notin

isin checks whether each element exists in a specified collection. notin checks the opposite.

>>> iris.sepallength.isin([4.9, 5.1]).rename('sepallength').head(5)
   sepallength
0         True
1         True
2        False
3        False
4        False

cut

cut divides sequence values into discrete segments (bins).

>>> iris.sepallength.cut(range(6), labels=['0-1', '1-2', '2-3', '3-4', '4-5']).rename('sepallength_cut').head(5)
   sepallength_cut
0             None
1              4-5
2              4-5
3              4-5
4              4-5

You can use the include_under and include_over operations to specify the maximum and minimum values, respectively.

>>> labels = ['0-1', '1-2', '2-3', '3-4', '4-5', '5-']
>>> iris.sepallength.cut(range(6), labels=labels, include_over=True).rename('sepallength_cut').head(5)
   sepallength_cut
0               5-
1              4-5
2              4-5
3              4-5
4              4-5

Call built-in functions or UDFs in MaxCompute

Use the func function to call MaxCompute built-in functions or user-defined functions (UDFs) to create columns. The return type defaults to STRING; use the rtype parameter to override it.

from odps.df import func

iris[iris.name, func.rand(rtype='float').rename('rand')][:4]
iris[iris.name, func.rand(10, rtype='float').rename('rand')][:4]

# Call a UDF defined in MaxCompute. Specify the column name if it cannot be inferred automatically.
iris[iris.name, func.your_udf(iris.sepalwidth, iris.sepallength, rtype='float').rename('new_col')]

# Call a UDF from another project using the project parameter.
iris[iris.name, func.your_udf(iris.sepalwidth, iris.sepallength, rtype='float', project='udf_project', name='new_col')]

The Pandas backend does not support expressions that contain the func function.