All Products
Search
Document Center

MaxCompute:Collection

Last Updated:Mar 26, 2026

CollectionExpr represents a two-dimensional dataset in the PyODPS DataFrame API, equivalent to a MaxCompute table or a spreadsheet. DataFrame objects are also CollectionExpr objects. CollectionExpr supports column operations, data filtering, and data transformation.

Prerequisites

Before you begin, ensure that you have:

  • A table named pyodps_iris. For more information, see the "DataFrame data processing" section in Getting started

  • A DataFrame object. For more information, see the "Create a DataFrame object from a MaxCompute table" section in Create a DataFrame object

Retrieve column types

Use the dtypes attribute to get the types of all columns in a CollectionExpr. It returns a Schema type.

print(iris.dtypes)

Output:

odps.Schema {
  sepallength           float64
  sepalwidth            float64
  petallength           float64
  petalwidth            float64
  name                  string
}

Select, add, and delete columns

The following table shows which syntax to use based on your goal:

Goal Recommended syntax PyODPS version
Select specific columns expr[col1, col2] All versions
Exclude columns exclude(col1, col2) All versions
Add a computed column iris['new_col'] = expr 0.7.2+
Overwrite an existing column iris['col'] = new_expr 0.7.2+
Delete a column del iris['col'] 0.7.2+
Conditional column update iris[condition, 'col'] = expr 0.7.2+
Add a constant column iris['col'] = value 0.7.12+
Rename while selecting select(col, new_name=expr) All versions

Select columns

Use expr[columns] to select specific columns from a CollectionExpr.

print(iris['name', 'sepallength'].head(5))

Output:

          name  sepallength
0  Iris-setosa          4.9
1  Iris-setosa          4.7
2  Iris-setosa          4.6
3  Iris-setosa          5.0
4  Iris-setosa          5.4
To select a single column and return a Collection (not a Sequence), add a trailing comma or use double brackets: iris[iris.sepallength,] or iris[[iris.sepallength]]. Without these, a Sequence object is returned instead.

Delete columns

Using `exclude` (all PyODPS versions):

print(iris.exclude('sepallength', 'petallength')[:5].head(5))

Output:

   sepalwidth  petalwidth         name
0         3.0         0.2  Iris-setosa
1         3.2         0.2  Iris-setosa
2         3.1         0.2  Iris-setosa
3         3.6         0.2  Iris-setosa
4         3.9         0.4  Iris-setosa

Using `del` (PyODPS 0.7.2+):

del iris['sepallength']
del iris['petallength']
print(iris[:5].head(5))

Output:

   sepalwidth  petalwidth         name
0         3.0         0.2  Iris-setosa
1         3.2         0.2  Iris-setosa
2         3.1         0.2  Iris-setosa
3         3.6         0.2  Iris-setosa
4         3.9         0.4  Iris-setosa

Add columns

Using assignment syntax (PyODPS 0.7.2+, recommended):

iris['sepalwidthplus1'] = iris.sepalwidth + 1
print(iris.head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth         name  \
0          4.9         3.0          1.4         0.2  Iris-setosa
1          4.7         3.2          1.3         0.2  Iris-setosa
2          4.6         3.1          1.5         0.2  Iris-setosa
3          5.0         3.6          1.4         0.2  Iris-setosa
4          5.4         3.9          1.7         0.4  Iris-setosa

   sepalwidthplus1
0              4.0
1              4.2
2              4.1
3              4.6
4              4.9

Using `expr[expr, new_sequence]` (all versions): Creates a new Collection with the additional column appended. If the new column has the same name as an existing column, rename it to avoid conflicts.

print(iris[iris, (iris.sepalwidth + 1).rename('sepalwidthplus1')].head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth         name  \
0          4.9         3.0          1.4         0.2  Iris-setosa
1          4.7         3.2          1.3         0.2  Iris-setosa
2          4.6         3.1          1.5         0.2  Iris-setosa
3          5.0         3.6          1.4         0.2  Iris-setosa
4          5.4         3.9          1.7         0.4  Iris-setosa

   sepalwidthplus1
0              4.0
1              4.2
2              4.1
3              4.6
4              4.9

Add and delete columns simultaneously

Overwrite an existing column (PyODPS 0.7.2+):

iris['sepalwidth'] = iris.sepalwidth * 2
print(iris.head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth         name
0          4.9         6.0          1.4         0.2  Iris-setosa
1          4.7         6.4          1.3         0.2  Iris-setosa
2          4.6         6.2          1.5         0.2  Iris-setosa
3          5.0         7.2          1.4         0.2  Iris-setosa
4          5.4         7.8          1.7         0.4  Iris-setosa

Using `exclude` with a new column (all versions): Excludes the original column and appends the transformed version, so no renaming is needed.

print(iris[iris.exclude('sepalwidth'), iris.sepalwidth * 2].head(5))

Output:

   sepallength  petallength  petalwidth         name  sepalwidth
0          4.9          1.4         0.2  Iris-setosa         6.0
1          4.7          1.3         0.2  Iris-setosa         6.4
2          4.6          1.5         0.2  Iris-setosa         6.2
3          5.0          1.4         0.2  Iris-setosa         7.2
4          5.4          1.7         0.4  Iris-setosa         7.8

Using `select` (all versions): Similar to exclude with a new column, but lets you rename the result in one step using keyword arguments.

print(iris.select('name', sepalwidthminus1=iris.sepalwidth - 1).head(5))

Output:

          name  sepalwidthminus1
0  Iris-setosa               2.0
1  Iris-setosa               2.2
2  Iris-setosa               2.1
3  Iris-setosa               2.6
4  Iris-setosa               2.9

Using a lambda expression: Pass a lambda that takes the result of the previous operation as a parameter. PyODPS evaluates the lambda and substitutes valid columns from that Collection.

print(iris['name', 'petallength'][[lambda x: x.name]].head(5))

Output:

          name
0  Iris-setosa
1  Iris-setosa
2  Iris-setosa
3  Iris-setosa
4  Iris-setosa

Conditional column update (PyODPS 0.7.2+): Update a column only for rows that match a condition.

iris[iris.sepallength > 5.0, 'sepalwidth'] = iris.sepalwidth * 2
print(iris.head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth         name
0          4.9         3.0          1.4         0.2  Iris-setosa
1          4.7         3.2          1.3         0.2  Iris-setosa
2          4.6         3.1          1.5         0.2  Iris-setosa
3          5.0         3.6          1.4         0.2  Iris-setosa
4          5.4         7.8          1.7         0.4  Iris-setosa

Introduce constants and random numbers

Add a constant column

Simple syntax (PyODPS 0.7.12+, recommended): Assign a literal value directly to a new column.

iris['id'] = 1
print(iris.head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth         name  id
0          4.9         3.0          1.4         0.2  Iris-setosa   1
1          4.7         3.2          1.3         0.2  Iris-setosa   1
2          4.6         3.1          1.5         0.2  Iris-setosa   1
3          5.0         3.6          1.4         0.2  Iris-setosa   1
4          5.4         3.9          1.7         0.4  Iris-setosa   1
The simple syntax cannot infer the type of null values automatically. To add a null column, use NullScalar with an explicit type.
from odps.df import NullScalar
iris['null_col'] = NullScalar('float')
print(iris.head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth         name null_col
0          4.9         3.0          1.4         0.2  Iris-setosa     None
1          4.7         3.2          1.3         0.2  Iris-setosa     None
2          4.6         3.1          1.5         0.2  Iris-setosa     None
3          5.0         3.6          1.4         0.2  Iris-setosa     None
4          5.4         3.9          1.7         0.4  Iris-setosa     None

Using `Scalar` (all versions): Wrap a constant with Scalar and specify the column name manually.

from odps.df import Scalar
print(iris[iris, Scalar(1).rename('id')][:5].head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth         name  id
0          4.9         3.0          1.4         0.2  Iris-setosa   1
1          4.7         3.2          1.3         0.2  Iris-setosa   1
2          4.6         3.1          1.5         0.2  Iris-setosa   1
3          5.0         3.6          1.4         0.2  Iris-setosa   1
4          5.4         3.9          1.7         0.4  Iris-setosa   1

To add a null column with an explicit type:

from odps.df import NullScalar
print(iris[iris, NullScalar('float').rename('fid')][:5].head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth         name   fid
0          4.9         3.0          1.4         0.2  Iris-setosa  None
1          4.7         3.2          1.3         0.2  Iris-setosa  None
2          4.6         3.1          1.5         0.2  Iris-setosa  None
3          5.0         3.6          1.4         0.2  Iris-setosa  None
4          5.4         3.9          1.7         0.4  Iris-setosa  None

Add a random number column

Use RandomScalar to append a column of random FLOAT values in the range 0–1, with a unique value per row. The optional parameter is a random seed.

from odps.df import RandomScalar
iris[iris, RandomScalar().rename('rand_val')][:5]

Output:

   sepallength  sepalwidth  petallength  petalwidth         name  rand_val
0          4.9         3.0          1.4         0.2  Iris-setosa  0.000471
1          4.7         3.2          1.3         0.2  Iris-setosa  0.799520
2          4.6         3.1          1.5         0.2  Iris-setosa  0.834609
3          5.0         3.6          1.4         0.2  Iris-setosa  0.106921
4          5.4         3.9          1.7         0.4  Iris-setosa  0.763442

Filter data

Filter rows using standard operators, the filter method, the query method, or lambda expressions.

In standard filter expressions, use & for AND | for OR — Python's and and or keywords are not supported. The query method accepts both.

Filter by condition:

print(iris[iris.sepallength > 5].head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth         name
0          5.4         3.9          1.7         0.4  Iris-setosa
1          5.4         3.7          1.5         0.2  Iris-setosa
2          5.8         4.0          1.2         0.2  Iris-setosa
3          5.7         4.4          1.5         0.4  Iris-setosa
4          5.4         3.9          1.3         0.4  Iris-setosa

AND (`&`):

print(iris[(iris.sepallength < 5) & (iris['petallength'] > 1.5)].head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth             name
0          4.8         3.4          1.6         0.2      Iris-setosa
1          4.8         3.4          1.9         0.2      Iris-setosa
2          4.7         3.2          1.6         0.2      Iris-setosa
3          4.8         3.1          1.6         0.2      Iris-setosa
4          4.9         2.4          3.3         1.0  Iris-versicolor

OR (`|`):

print(iris[(iris.sepalwidth < 2.5) | (iris.sepalwidth > 4)].head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth             name
0          5.7         4.4          1.5         0.4      Iris-setosa
1          5.2         4.1          1.5         0.1      Iris-setosa
2          5.5         4.2          1.4         0.2      Iris-setosa
3          4.5         2.3          1.3         0.3      Iris-setosa
4          5.5         2.3          4.0         1.3  Iris-versicolor

NOT (`~`):

print(iris[~(iris.sepalwidth > 3)].head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth         name
0          4.9         3.0          1.4         0.2  Iris-setosa
1          4.4         2.9          1.4         0.2  Iris-setosa
2          4.8         3.0          1.4         0.1  Iris-setosa
3          4.3         3.0          1.1         0.1  Iris-setosa
4          5.0         3.0          1.6         0.2  Iris-setosa

`filter` method with multiple conditions: Passing multiple arguments to filter is equivalent to combining them with &.

print(iris.filter(iris.sepalwidth > 3.5, iris.sepalwidth < 4).head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth         name
0          5.0         3.6          1.4         0.2  Iris-setosa
1          5.4         3.9          1.7         0.4  Iris-setosa
2          5.4         3.7          1.5         0.2  Iris-setosa
3          5.4         3.9          1.3         0.4  Iris-setosa
4          5.7         3.8          1.7         0.3  Iris-setosa

Lambda expression for chained operations:

print(iris[iris.sepalwidth > 3.8]['name', lambda x: x.sepallength + 1].head(5))

Output:

          name  sepallength
0  Iris-setosa          6.4
1  Iris-setosa          6.8
2  Iris-setosa          6.7
3  Iris-setosa          6.4
4  Iris-setosa          6.2

BOOLEAN column as a filter: When a Collection contains a BOOLEAN column, use that column directly as a filter condition.

# Check the schema
print(df.dtypes)
# odps.Schema {
#   a boolean
#   b int64
# }

# Filter using the boolean column
print(df[df.a])
#       a  b
# 0  True  1
# 1  True  3

The following examples show the difference between retrieving a single column and using it as a filter:

df[df.a, ]       # Retrieve a one-column Collection
df[[df.a]]       # Retrieve a one-column Collection
df.select(df.a)  # Retrieve a one-column Collection explicitly
df[df.a]         # Use column a (BOOLEAN) as a filter condition
df.a             # Retrieve a column from a Collection
df['a']          # Retrieve a column from a Collection

Filter with query

The query method accepts filter conditions as a string expression, similar to Pandas. Reference column names directly in the string; prefix local variables with @.

print(iris.query("(sepallength < 5) and (petallength > 1.5)").head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth             name
0          4.8         3.4          1.6         0.2      Iris-setosa
1          4.8         3.4          1.9         0.2      Iris-setosa
2          4.7         3.2          1.6         0.2      Iris-setosa
3          4.8         3.1          1.6         0.2      Iris-setosa
4          4.9         2.4          3.3         1.0  Iris-versicolor

Use @ to reference a local variable inside a query string:

var = 4
print(iris.query("(sepalwidth < 2.5) | (sepalwidth > @var)").head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth             name
0          5.7         4.4          1.5         0.4      Iris-setosa
1          5.2         4.1          1.5         0.1      Iris-setosa
2          5.5         4.2          1.4         0.2      Iris-setosa
3          4.5         2.3          1.3         0.3      Iris-setosa
4          5.5         2.3          4.0         1.3  Iris-versicolor

The query method supports the following syntax:

Syntax Description
name Column names (no @ prefix). Use @variable to reference a local Python variable.
operator +, -, *, /, //, %, **, ==, !=, <, <=, >, >=, in, not in
bool AND: & or and. OR: ` or or`.
attribute Object attributes
index, slice, subscript Slice operations

Convert a column into rows

Use explode to expand a LIST or MAP column into multiple rows. You can also use the apply method for multi-row output. You can explode one or more columns simultaneously. If an input row produces no output (for example, an empty list), the row is excluded by default. Set keep_nulls=True to retain such rows with null values.

For more information about how to use explode for multi-row output, see the "Collection-related operations" section in Column operations.

Sample data:

print(df)
#    id         a             b
# 0   1  [a1, b1]  [a2, b2, c2]
# 1   2      [c1]      [d2, e2]

Explode one column, keep another as-is:

print(df[df.id, df.a.explode(), df.b])

Output:

   id   a             b
0   1  a1  [a2, b2, c2]
1   1  b1  [a2, b2, c2]
2   2  c1      [d2, e2]

Explode two columns simultaneously:

print(df[df.id, df.a.explode(), df.b.explode()])

Output:

   id   a   b
0   1  a1  a2
1   1  a1  b2
2   1  a1  c2
3   1  b1  a2
4   1  b1  b2
5   1  b1  c2
6   2  c1  d2
7   2  c1  e2

`keep_nulls=True` — retain rows with empty input:

print(df)
#    id         a
# 0   1  [a1, b1]
# 1   2        []

# Without keep_nulls: row 2 is dropped
print(df[df.id, df.a.explode()])
#    id   a
# 0   1  a1
# 1   1  b1

# With keep_nulls=True: row 2 is retained with a null value
print(df[df.id, df.a.explode(keep_nulls=True)])
#    id     a
# 0   1    a1
# 1   1    b1
# 2   2  None

Output limits

Limit the number of rows returned using slice syntax or the limit method.

Slice syntax:

print(iris[:3].execute())

Output:

   sepallength  sepalwidth  petallength  petalwidth         name
0          4.9         3.0          1.4         0.2  Iris-setosa
1          4.7         3.2          1.3         0.2  Iris-setosa
2          4.6         3.1          1.5         0.2  Iris-setosa

`limit` method:

print(iris.limit(3).execute())

Output:

   sepallength  sepalwidth  petallength  petalwidth         name
0          4.9         3.0          1.4         0.2  Iris-setosa
1          4.7         3.2          1.3         0.2  Iris-setosa
2          4.6         3.1          1.5         0.2  Iris-setosa
In the MaxCompute SQL backend, slice operations do not support start or step — only limit is supported. Slice operations apply to Collection objects only, not Sequence objects.