The PyODPS DataFrame API supports a wide range of column operations, including null handling, conditional logic, math, strings, datetime, collections, and custom functions. Operations on a sequence apply element-wise to every value in the column.
Prerequisites
Before running the examples, initialize the datasets:
from odps.df import DataFrame
iris = DataFrame(o.get_table('pyodps_iris'))
lens = DataFrame(o.get_table('pyodps_ml_100k_lens'))Null-related functions
Three built-in functions handle null values: isnull, notnull, and fillna.
isnull— returnsTrueif the field value is null.notnull— returnsTrueif the field value is not null.fillna— replaces null values with specified values.
>>> iris.sepallength.isnull().head(5)
sepallength
0 False
1 False
2 False
3 False
4 FalseLogic functions
ifelse
ifelse applies to Boolean fields. When the condition is True, it returns the first argument; otherwise, it returns the second.
>>> (iris.sepallength > 5).ifelse('gt5', 'lte5').rename('cmp5').head(5)
cmp5
0 gt5
1 lte5
2 lte5
3 lte5
4 lte5switch
switch handles multiple conditions, matching each value against a sequence of condition-result pairs and returning the corresponding result. Use the default parameter to specify the fallback value when no condition matches.
>>> iris.sepallength.switch(4.9, 'eq4.9', 5.0, 'eq5.0', default='noeq').rename('equalness').head(5)
equalness
0 noeq
1 eq4.9
2 noeq
3 noeq
4 eq5.0Import switch from odps.df to apply it as a standalone function across multiple columns:
>>> from odps.df import switch
>>> switch(iris.sepallength == 4.9, 'eq4.9', # condition 1: exact match 4.9
... iris.sepallength == 5.0, 'eq5.0', # condition 2: exact match 5.0
... default='noeq').rename('equalness').head(5)
equalness
0 noeq
1 eq4.9
2 noeq
3 noeq
4 eq5.0Conditional assignment (PyODPS V0.7.8 and later)
Modify column values based on conditions using assignment syntax:
>>> iris[iris.sepallength > 5, 'cmp5'] = 'gt5' # set 'gt5' where sepallength > 5
>>> iris[iris.sepallength <= 5, 'cmp5'] = 'lte5' # set 'lte5' where sepallength <= 5
>>> iris.head(5)
cmp5
0 gt5
1 lte5
2 lte5
3 lte5
4 lte5Mathematical operations
Numeric sequences support the standard arithmetic operators (+, -, *, /) and a set of math functions.
>>> (iris.sepallength * 10).log().head(5)
sepallength
0 3.931826
1 3.891820
2 3.850148
3 3.828641
4 3.912023>>> fields = [
... iris.sepallength,
... (iris.sepallength / 2).rename('sepallength divided by 2'),
... (iris.sepallength ** 2).rename('sepallength squared'),
... ]
>>> iris[fields].head(5)
sepallength sepallength divided by 2 sepallength squared
0 5.1 2.55 26.01
1 4.9 2.45 24.01
2 4.7 2.35 22.09
3 4.6 2.30 21.16
4 5.0 2.50 25.00The following table lists all supported math functions.
| Function | Description |
|---|---|
abs | Returns the absolute value. |
sqrt | Returns the square root. |
sin | Returns the sine. |
sinh | Returns the hyperbolic sine. |
cos | Returns the cosine. |
cosh | Returns the hyperbolic cosine. |
tan | Returns the tangent. |
tanh | Returns the hyperbolic tangent. |
arccos | Returns the arc cosine. |
arccosh | Returns the inverse hyperbolic cosine. |
arcsin | Returns the arc sine. |
arcsinh | Returns the inverse hyperbolic sine. |
arctan | Returns the arc tangent. |
arctanh | Returns the inverse hyperbolic tangent. |
exp | Returns e raised to the given power. |
expm1 | Returns e raised to the given power, minus 1. |
log | Returns the logarithm using the supplied base. |
log2 | Returns the base-2 logarithm. |
log10 | Returns the base-10 logarithm. |
log1p | Returns log(1 + x). |
radians | Converts the values in radians to degrees. |
degrees | Converts the values in degrees to radians. |
ceil | Returns the smallest integer greater than or equal to the given number. |
floor | Returns the largest integer less than or equal to the given number. |
trunc | Returns the number truncated to the specified decimal place. |
Comparison and range checks
Compare a sequence against another sequence or a scalar:
>>> (iris.sepallength < 5).head(5)
sepallength
0 False
1 True
2 True
3 True
4 FalseSequential comparisons such as 3 <= iris.sepallength <= 5 are not supported. Use between instead:
>>> iris.sepallength.between(3, 5).head(5) # inclusive by default
sepallength
0 False
1 True
2 True
3 True
4 TrueTo exclude endpoints, set inclusive=False:
>>> iris.sepallength.between(3, 5, inclusive=False).head(5)
sepallength
0 False
1 True
2 True
3 True
4 FalseString operations
The DataFrame API provides 35+ string functions for sequence and scalar objects. Functions are accessed directly on a string column — no separate namespace import is required. The following example shows upper and extract:
>>> fields = [
... iris.name.upper().rename('upper_name'),
... iris.name.extract('Iris(.*)', group=1),
... ]
>>> iris[fields].head(5)
upper_name name
0 IRIS-SETOSA -setosa
1 IRIS-SETOSA -setosa
2 IRIS-SETOSA -setosa
3 IRIS-SETOSA -setosa
4 IRIS-SETOSA -setosaThe following table describes all supported string functions.
| Function | Description |
|---|---|
capitalize | Converts the first character to uppercase and the rest to lowercase. |
contains | Returns whether the string contains a substring. The regex parameter is True by default, so the substring is treated as a regular expression. |
count | Returns the number of occurrences of the specified substring. |
endswith | Returns whether the string ends with the specified suffix. |
startswith | Returns whether the string starts with the specified prefix. |
extract | Extracts matches of a regular expression. Without the group parameter, returns substrings that match the pattern. With group, returns the specified capture group. |
find | Searches left-to-right and returns the index of the first occurrence of the substring. Returns -1 if not found. |
rfind | Searches right-to-left and returns the index of the first occurrence of the substring. Returns -1 if not found. |
replace | Replaces substrings matching a regular expression pattern. Specify n to limit the number of replacements. |
get | Returns the character at the specified position. |
len | Returns the length of the string. |
ljust | Pads the string on the right with fillchar (default: space) until it reaches width characters. |
rjust | Pads the string on the left with fillchar (default: space) until it reaches width characters. |
lower | Converts the string to lowercase. |
upper | Converts the string to uppercase. |
lstrip | Removes leading whitespace, including blank lines. |
rstrip | Removes trailing whitespace, including blank lines. |
strip | Removes leading and trailing whitespace, including blank lines. |
split | Splits the string at the specified delimiter and returns a LIST<STRING> value. |
pad | Pads the string with fillchar (default: space) at the specified side: left, right, or both. |
repeat | Repeats the string n times. |
slice | Performs a slice operation on the string. |
swapcase | Swaps the case of all characters: uppercase becomes lowercase and vice versa. |
title | Converts the string to title case, where each word starts with an uppercase character. Equivalent to str.title. |
zfill | Pads the string with 0 on the left until it reaches width characters. |
isalnum | Returns True if all characters are alphanumeric. Equivalent to str.isalnum. |
isalpha | Returns True if all characters are alphabetic. Equivalent to str.isalpha. |
isdigit | Returns True if all characters are digits. Equivalent to str.isdigit. |
isspace | Returns True if all characters are whitespace. Equivalent to str.isspace. |
islower | Returns True if all cased characters are lowercase. Equivalent to str.islower. |
isupper | Returns True if all cased characters are uppercase. Equivalent to str.isupper. |
istitle | Returns True if the string is in title case. Equivalent to str.istitle. |
isnumeric | Returns True if all characters are numeric. Equivalent to str.isnumeric. |
isdecimal | Returns True if all characters are decimal characters. Equivalent to str.isdecimal. |
todict | Splits the string into a DICT<STRING, STRING> value. Takes two parameters: the entry delimiter and the key-value delimiter. |
strptime | Parses a string into a datetime using the same format codes as the Python standard library. See Basic date and time types. |
Datetime operations
Datetime functions apply to sequence and scalar objects of DATETIME type.
The following example extracts year, month, day, and hour from a unix timestamp column:
>>> df = lens[[lens.unix_timestamp.astype('datetime').rename('dt')]]
>>> df[df.dt,
... df.dt.year.rename('year'),
... df.dt.month.rename('month'),
... df.dt.day.rename('day'),
... df.dt.hour.rename('hour')].head(5)
dt year month day hour
0 1998-04-08 11:02:00 1998 4 8 11
1 1998-04-08 10:57:55 1998 4 8 10
2 1998-04-08 10:45:26 1998 4 8 10
3 1998-04-08 10:25:52 1998 4 8 10
4 1998-04-08 10:44:19 1998 4 8 10The following table describes the supported datetime attributes.
| Attribute | Description |
|---|---|
year | Returns the year component of the datetime. |
month | Returns the month component (1-12). |
day | Returns the day of the month. |
hour | Returns the hour component (0-23). |
minute | Returns the minute component (0-59). |
second | Returns the second component (0-59). |
weekofyear | Returns the ISO week number of the year. Monday is the first day of the week. |
weekday | Returns a number representing the day of the week where the provided date falls. |
dayofweek | Returns a number representing the day of the week where the provided date falls. |
strftime | Formats the datetime as a string using the same format codes as the Python standard library. See Basic date and time types. Converts the given string representing a time to the specified format. The time format is the same as the time in the standard Python library. For more information about the time formats in Python, see Basic date and time types. Converts the given string representing a time to the specified format. The time format is the same as the time format in the standard Python library. For more information about the time formats in Python, see Basic date and time types. |
Datetime arithmetic
PyODPS supports date arithmetic. Use the datetime unit types from odps.df to add or subtract time intervals. Subtracting two datetime columns returns the difference in milliseconds as int64.
>>> from odps.df import day
>>> df
a b
0 2016-12-06 16:43:12.460001 2016-12-06 17:43:12.460018
1 2016-12-06 16:43:12.460012 2016-12-06 17:43:12.460021
2 2016-12-06 16:43:12.460015 2016-12-06 17:43:12.460022
>>> df.a - day(3) # subtract 3 days from column a
a
0 2016-12-03 16:43:12.460001
1 2016-12-03 16:43:12.460012
2 2016-12-03 16:43:12.460015
>>> (df.b - df.a).dtype # difference between two datetime columns
int64
>>> (df.b - df.a).rename('a') # result is in milliseconds
a
0 3600000
1 3600000
2 3600000The following table lists the available datetime unit types for arithmetic.
| Type | Description |
|---|---|
year | Year-level interval. |
month | Month-level interval. |
day | Day-level interval. |
hour | Hour-level interval. |
minute | Minute-level interval. |
second | Second-level interval. |
millisecond | Millisecond-level interval. |
Collection operations
PyODPS supports LIST and DICT collection types. Use subscripts to retrieve individual items, and len to get the number of items.
>>> df
id a b
0 1 [a1, b1] {'a2': 0, 'b2': 1, 'c2': 2}
1 2 [c1] {'d2': 3, 'e2': 4}
>>> df[df.id, df.a[0], df.b['b2']] # access by subscript
id a b
0 1 a1 1
1 2 c1 NaN
>>> df[df.id, df.a.len(), df.b.len()] # get collection size
id a b
0 1 2 3
1 2 1 2explode
Use explode when you need to expand a collection column into individual rows — one row per element. This is useful for downstream aggregations or joins that expect flat row-level data.
For LIST columns, explode returns one column by default. Set pos=True to also return the element's position index (similar to Python's enumerate):
>>> df.a.explode()
a
0 a1
1 b1
2 c1
>>> df.a.explode(pos=True) # include position index
a_pos a
0 0 a1
1 1 b1
2 0 c1
>>> df.a.explode(['pos', 'value'], pos=True) # specify column names
pos value
0 0 a1
1 1 b1
2 0 c1For DICT columns, explode returns two columns: keys and values.
>>> df.b.explode()
b_key b_value
0 a2 0
1 b2 1
2 c2 2
3 d2 3
4 e2 4
>>> df.b.explode(['key', 'value']) # specify column names
key value
0 a2 0
1 b2 1
2 c2 2
3 d2 3
4 e2 4Use explode with column selection to combine exploded columns with the original columns:
>>> df[df.id, df.a.explode()]
id a
0 1 a1
1 1 b1
2 2 c1
>>> df[df.id, df.a.explode(), df.b.explode()] # cross-product of both collections
id a b_key b_value
0 1 a1 a2 0
1 1 a1 b2 1
2 1 a1 c2 2
3 1 b1 a2 0
4 1 b1 b2 1
5 1 b1 c2 2
6 2 c1 d2 3
7 2 c1 e2 4LIST-specific methods
In addition to len and explode, LIST columns support the following methods.
| Method | Description |
|---|---|
contains(v) | Returns True if the list contains the specified element. |
sort | Sorts the list and returns a LIST value. |
DICT-specific methods
In addition to len and explode, DICT columns support the following methods.
| Method | Description |
|---|---|
keys | Returns all keys as a LIST value. |
values | Returns all values as a LIST value. |
Other operations
isin and notin
isin checks whether each element exists in a specified collection. notin checks the opposite.
>>> iris.sepallength.isin([4.9, 5.1]).rename('sepallength').head(5)
sepallength
0 True
1 True
2 False
3 False
4 Falsecut
cut divides sequence values into discrete segments (bins).
>>> iris.sepallength.cut(range(6), labels=['0-1', '1-2', '2-3', '3-4', '4-5']).rename('sepallength_cut').head(5)
sepallength_cut
0 None
1 4-5
2 4-5
3 4-5
4 4-5You can use the include_under and include_over operations to specify the maximum and minimum values, respectively.
>>> labels = ['0-1', '1-2', '2-3', '3-4', '4-5', '5-']
>>> iris.sepallength.cut(range(6), labels=labels, include_over=True).rename('sepallength_cut').head(5)
sepallength_cut
0 5-
1 4-5
2 4-5
3 4-5
4 4-5Call built-in functions or UDFs in MaxCompute
Use the func function to call MaxCompute built-in functions or user-defined functions (UDFs) to create columns. The return type defaults to STRING; use the rtype parameter to override it.
from odps.df import func
iris[iris.name, func.rand(rtype='float').rename('rand')][:4]
iris[iris.name, func.rand(10, rtype='float').rename('rand')][:4]
# Call a UDF defined in MaxCompute. Specify the column name if it cannot be inferred automatically.
iris[iris.name, func.your_udf(iris.sepalwidth, iris.sepallength, rtype='float').rename('new_col')]
# Call a UDF from another project using the project parameter.
iris[iris.name, func.your_udf(iris.sepalwidth, iris.sepallength, rtype='float', project='udf_project', name='new_col')]The Pandas backend does not support expressions that contain the func function.