Data manipulation operations on a collection - MaxCompute

CollectionExpr represents a two-dimensional dataset in the PyODPS DataFrame API, equivalent to a MaxCompute table or a spreadsheet. DataFrame objects are also CollectionExpr objects. CollectionExpr supports column operations, data filtering, and data transformation.

Prerequisites

Before you begin, ensure that you have:

A table named pyodps_iris. For more information, see the "DataFrame data processing" section in Getting started
A DataFrame object. For more information, see the "Create a DataFrame object from a MaxCompute table" section in Create a DataFrame object

Retrieve column types

Use the dtypes attribute to get the types of all columns in a CollectionExpr. It returns a Schema type.

print(iris.dtypes)

Output:

odps.Schema {
  sepallength           float64
  sepalwidth            float64
  petallength           float64
  petalwidth            float64
  name                  string
}

Select, add, and delete columns

The following table shows which syntax to use based on your goal:

Goal	Recommended syntax	PyODPS version
Select specific columns	`expr[col1, col2]`	All versions
Exclude columns	`exclude(col1, col2)`	All versions
Add a computed column	`iris['new_col'] = expr`	0.7.2+
Overwrite an existing column	`iris['col'] = new_expr`	0.7.2+
Delete a column	`del iris['col']`	0.7.2+
Conditional column update	`iris[condition, 'col'] = expr`	0.7.2+
Add a constant column	`iris['col'] = value`	0.7.12+
Rename while selecting	`select(col, new_name=expr)`	All versions

Select columns

Use expr[columns] to select specific columns from a CollectionExpr.

print(iris['name', 'sepallength'].head(5))

Output:

          name  sepallength
0  Iris-setosa          4.9
1  Iris-setosa          4.7
2  Iris-setosa          4.6
3  Iris-setosa          5.0
4  Iris-setosa          5.4

To select a single column and return a Collection (not a Sequence), add a trailing comma or use double brackets: iris[iris.sepallength,] or iris[[iris.sepallength]]. Without these, a Sequence object is returned instead.

Delete columns

Using `exclude` (all PyODPS versions):

print(iris.exclude('sepallength', 'petallength')[:5].head(5))

Output:

   sepalwidth  petalwidth         name
0         3.0         0.2  Iris-setosa
1         3.2         0.2  Iris-setosa
2         3.1         0.2  Iris-setosa
3         3.6         0.2  Iris-setosa
4         3.9         0.4  Iris-setosa

Using `del` (PyODPS 0.7.2+):

del iris['sepallength']
del iris['petallength']
print(iris[:5].head(5))

Output:

   sepalwidth  petalwidth         name
0         3.0         0.2  Iris-setosa
1         3.2         0.2  Iris-setosa
2         3.1         0.2  Iris-setosa
3         3.6         0.2  Iris-setosa
4         3.9         0.4  Iris-setosa

Add columns

Using assignment syntax (PyODPS 0.7.2+, recommended):

iris['sepalwidthplus1'] = iris.sepalwidth + 1
print(iris.head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth         name  \
0          4.9         3.0          1.4         0.2  Iris-setosa
1          4.7         3.2          1.3         0.2  Iris-setosa
2          4.6         3.1          1.5         0.2  Iris-setosa
3          5.0         3.6          1.4         0.2  Iris-setosa
4          5.4         3.9          1.7         0.4  Iris-setosa

   sepalwidthplus1
0              4.0
1              4.2
2              4.1
3              4.6
4              4.9

Using `expr[expr, new_sequence]` (all versions): Creates a new Collection with the additional column appended. If the new column has the same name as an existing column, rename it to avoid conflicts.

print(iris[iris, (iris.sepalwidth + 1).rename('sepalwidthplus1')].head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth         name  \
0          4.9         3.0          1.4         0.2  Iris-setosa
1          4.7         3.2          1.3         0.2  Iris-setosa
2          4.6         3.1          1.5         0.2  Iris-setosa
3          5.0         3.6          1.4         0.2  Iris-setosa
4          5.4         3.9          1.7         0.4  Iris-setosa

   sepalwidthplus1
0              4.0
1              4.2
2              4.1
3              4.6
4              4.9

Add and delete columns simultaneously

Overwrite an existing column (PyODPS 0.7.2+):

iris['sepalwidth'] = iris.sepalwidth * 2
print(iris.head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth         name
0          4.9         6.0          1.4         0.2  Iris-setosa
1          4.7         6.4          1.3         0.2  Iris-setosa
2          4.6         6.2          1.5         0.2  Iris-setosa
3          5.0         7.2          1.4         0.2  Iris-setosa
4          5.4         7.8          1.7         0.4  Iris-setosa

Using `exclude` with a new column (all versions): Excludes the original column and appends the transformed version, so no renaming is needed.

print(iris[iris.exclude('sepalwidth'), iris.sepalwidth * 2].head(5))

Output:

   sepallength  petallength  petalwidth         name  sepalwidth
0          4.9          1.4         0.2  Iris-setosa         6.0
1          4.7          1.3         0.2  Iris-setosa         6.4
2          4.6          1.5         0.2  Iris-setosa         6.2
3          5.0          1.4         0.2  Iris-setosa         7.2
4          5.4          1.7         0.4  Iris-setosa         7.8

Using `select` (all versions): Similar to exclude with a new column, but lets you rename the result in one step using keyword arguments.

print(iris.select('name', sepalwidthminus1=iris.sepalwidth - 1).head(5))

Output:

          name  sepalwidthminus1
0  Iris-setosa               2.0
1  Iris-setosa               2.2
2  Iris-setosa               2.1
3  Iris-setosa               2.6
4  Iris-setosa               2.9

Using a lambda expression: Pass a lambda that takes the result of the previous operation as a parameter. PyODPS evaluates the lambda and substitutes valid columns from that Collection.

print(iris['name', 'petallength'][[lambda x: x.name]].head(5))

Output:

          name
0  Iris-setosa
1  Iris-setosa
2  Iris-setosa
3  Iris-setosa
4  Iris-setosa

Conditional column update (PyODPS 0.7.2+): Update a column only for rows that match a condition.

iris[iris.sepallength > 5.0, 'sepalwidth'] = iris.sepalwidth * 2
print(iris.head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth         name
0          4.9         3.0          1.4         0.2  Iris-setosa
1          4.7         3.2          1.3         0.2  Iris-setosa
2          4.6         3.1          1.5         0.2  Iris-setosa
3          5.0         3.6          1.4         0.2  Iris-setosa
4          5.4         7.8          1.7         0.4  Iris-setosa

Introduce constants and random numbers

Add a constant column

Simple syntax (PyODPS 0.7.12+, recommended): Assign a literal value directly to a new column.

iris['id'] = 1
print(iris.head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth         name  id
0          4.9         3.0          1.4         0.2  Iris-setosa   1
1          4.7         3.2          1.3         0.2  Iris-setosa   1
2          4.6         3.1          1.5         0.2  Iris-setosa   1
3          5.0         3.6          1.4         0.2  Iris-setosa   1
4          5.4         3.9          1.7         0.4  Iris-setosa   1

The simple syntax cannot infer the type of null values automatically. To add a null column, use NullScalar with an explicit type.

from odps.df import NullScalar
iris['null_col'] = NullScalar('float')
print(iris.head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth         name null_col
0          4.9         3.0          1.4         0.2  Iris-setosa     None
1          4.7         3.2          1.3         0.2  Iris-setosa     None
2          4.6         3.1          1.5         0.2  Iris-setosa     None
3          5.0         3.6          1.4         0.2  Iris-setosa     None
4          5.4         3.9          1.7         0.4  Iris-setosa     None

Using `Scalar` (all versions): Wrap a constant with Scalar and specify the column name manually.

from odps.df import Scalar
print(iris[iris, Scalar(1).rename('id')][:5].head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth         name  id
0          4.9         3.0          1.4         0.2  Iris-setosa   1
1          4.7         3.2          1.3         0.2  Iris-setosa   1
2          4.6         3.1          1.5         0.2  Iris-setosa   1
3          5.0         3.6          1.4         0.2  Iris-setosa   1
4          5.4         3.9          1.7         0.4  Iris-setosa   1

To add a null column with an explicit type:

from odps.df import NullScalar
print(iris[iris, NullScalar('float').rename('fid')][:5].head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth         name   fid
0          4.9         3.0          1.4         0.2  Iris-setosa  None
1          4.7         3.2          1.3         0.2  Iris-setosa  None
2          4.6         3.1          1.5         0.2  Iris-setosa  None
3          5.0         3.6          1.4         0.2  Iris-setosa  None
4          5.4         3.9          1.7         0.4  Iris-setosa  None

Add a random number column

Use RandomScalar to append a column of random FLOAT values in the range 0–1, with a unique value per row. The optional parameter is a random seed.

from odps.df import RandomScalar
iris[iris, RandomScalar().rename('rand_val')][:5]

Output:

   sepallength  sepalwidth  petallength  petalwidth         name  rand_val
0          4.9         3.0          1.4         0.2  Iris-setosa  0.000471
1          4.7         3.2          1.3         0.2  Iris-setosa  0.799520
2          4.6         3.1          1.5         0.2  Iris-setosa  0.834609
3          5.0         3.6          1.4         0.2  Iris-setosa  0.106921
4          5.4         3.9          1.7         0.4  Iris-setosa  0.763442

Filter data

Filter rows using standard operators, the filter method, the query method, or lambda expressions.

In standard filter expressions, use & for AND | for OR — Python's and and or keywords are not supported. The query method accepts both.

Filter by condition:

print(iris[iris.sepallength > 5].head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth         name
0          5.4         3.9          1.7         0.4  Iris-setosa
1          5.4         3.7          1.5         0.2  Iris-setosa
2          5.8         4.0          1.2         0.2  Iris-setosa
3          5.7         4.4          1.5         0.4  Iris-setosa
4          5.4         3.9          1.3         0.4  Iris-setosa

AND (`&`):

print(iris[(iris.sepallength < 5) & (iris['petallength'] > 1.5)].head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth             name
0          4.8         3.4          1.6         0.2      Iris-setosa
1          4.8         3.4          1.9         0.2      Iris-setosa
2          4.7         3.2          1.6         0.2      Iris-setosa
3          4.8         3.1          1.6         0.2      Iris-setosa
4          4.9         2.4          3.3         1.0  Iris-versicolor

OR (`|`):

print(iris[(iris.sepalwidth < 2.5) | (iris.sepalwidth > 4)].head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth             name
0          5.7         4.4          1.5         0.4      Iris-setosa
1          5.2         4.1          1.5         0.1      Iris-setosa
2          5.5         4.2          1.4         0.2      Iris-setosa
3          4.5         2.3          1.3         0.3      Iris-setosa
4          5.5         2.3          4.0         1.3  Iris-versicolor

NOT (`~`):

print(iris[~(iris.sepalwidth > 3)].head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth         name
0          4.9         3.0          1.4         0.2  Iris-setosa
1          4.4         2.9          1.4         0.2  Iris-setosa
2          4.8         3.0          1.4         0.1  Iris-setosa
3          4.3         3.0          1.1         0.1  Iris-setosa
4          5.0         3.0          1.6         0.2  Iris-setosa

`filter` method with multiple conditions: Passing multiple arguments to filter is equivalent to combining them with &.

print(iris.filter(iris.sepalwidth > 3.5, iris.sepalwidth < 4).head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth         name
0          5.0         3.6          1.4         0.2  Iris-setosa
1          5.4         3.9          1.7         0.4  Iris-setosa
2          5.4         3.7          1.5         0.2  Iris-setosa
3          5.4         3.9          1.3         0.4  Iris-setosa
4          5.7         3.8          1.7         0.3  Iris-setosa

Lambda expression for chained operations:

print(iris[iris.sepalwidth > 3.8]['name', lambda x: x.sepallength + 1].head(5))

Output:

          name  sepallength
0  Iris-setosa          6.4
1  Iris-setosa          6.8
2  Iris-setosa          6.7
3  Iris-setosa          6.4
4  Iris-setosa          6.2

BOOLEAN column as a filter: When a Collection contains a BOOLEAN column, use that column directly as a filter condition.

# Check the schema
print(df.dtypes)
# odps.Schema {
#   a boolean
#   b int64
# }

# Filter using the boolean column
print(df[df.a])
#       a  b
# 0  True  1
# 1  True  3

The following examples show the difference between retrieving a single column and using it as a filter:

df[df.a, ]       # Retrieve a one-column Collection
df[[df.a]]       # Retrieve a one-column Collection
df.select(df.a)  # Retrieve a one-column Collection explicitly
df[df.a]         # Use column a (BOOLEAN) as a filter condition
df.a             # Retrieve a column from a Collection
df['a']          # Retrieve a column from a Collection

Filter with `query`

The query method accepts filter conditions as a string expression, similar to Pandas. Reference column names directly in the string; prefix local variables with @.

print(iris.query("(sepallength < 5) and (petallength > 1.5)").head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth             name
0          4.8         3.4          1.6         0.2      Iris-setosa
1          4.8         3.4          1.9         0.2      Iris-setosa
2          4.7         3.2          1.6         0.2      Iris-setosa
3          4.8         3.1          1.6         0.2      Iris-setosa
4          4.9         2.4          3.3         1.0  Iris-versicolor

Use @ to reference a local variable inside a query string:

var = 4
print(iris.query("(sepalwidth < 2.5) | (sepalwidth > @var)").head(5))

Output:

   sepallength  sepalwidth  petallength  petalwidth             name
0          5.7         4.4          1.5         0.4      Iris-setosa
1          5.2         4.1          1.5         0.1      Iris-setosa
2          5.5         4.2          1.4         0.2      Iris-setosa
3          4.5         2.3          1.3         0.3      Iris-setosa
4          5.5         2.3          4.0         1.3  Iris-versicolor

The query method supports the following syntax:

Syntax	Description
name	Column names (no `@` prefix). Use `@variable` to reference a local Python variable.
operator	`+`, `-`, ``, `/`, `//`, `%`, `*`, `==`, `!=`, `<`, `<=`, `>`, `>=`, `in`, `not in`
bool	AND: `&` or `and`. OR: `	`or` or`.
attribute	Object attributes
index, slice, subscript	Slice operations

Convert a column into rows

Use explode to expand a LIST or MAP column into multiple rows. You can also use the apply method for multi-row output. You can explode one or more columns simultaneously. If an input row produces no output (for example, an empty list), the row is excluded by default. Set keep_nulls=True to retain such rows with null values.

For more information about how to use explode for multi-row output, see the "Collection-related operations" section in Column operations.

Sample data:

print(df)
#    id         a             b
# 0   1  [a1, b1]  [a2, b2, c2]
# 1   2      [c1]      [d2, e2]

Explode one column, keep another as-is:

print(df[df.id, df.a.explode(), df.b])

Output:

   id   a             b
0   1  a1  [a2, b2, c2]
1   1  b1  [a2, b2, c2]
2   2  c1      [d2, e2]

Explode two columns simultaneously:

print(df[df.id, df.a.explode(), df.b.explode()])

Output:

   id   a   b
0   1  a1  a2
1   1  a1  b2
2   1  a1  c2
3   1  b1  a2
4   1  b1  b2
5   1  b1  c2
6   2  c1  d2
7   2  c1  e2

`keep_nulls=True` — retain rows with empty input:

print(df)
#    id         a
# 0   1  [a1, b1]
# 1   2        []

# Without keep_nulls: row 2 is dropped
print(df[df.id, df.a.explode()])
#    id   a
# 0   1  a1
# 1   1  b1

# With keep_nulls=True: row 2 is retained with a null value
print(df[df.id, df.a.explode(keep_nulls=True)])
#    id     a
# 0   1    a1
# 1   1    b1
# 2   2  None

Output limits

Limit the number of rows returned using slice syntax or the limit method.

Slice syntax:

print(iris[:3].execute())

Output:

   sepallength  sepalwidth  petallength  petalwidth         name
0          4.9         3.0          1.4         0.2  Iris-setosa
1          4.7         3.2          1.3         0.2  Iris-setosa
2          4.6         3.1          1.5         0.2  Iris-setosa

`limit` method:

print(iris.limit(3).execute())

Output:

   sepallength  sepalwidth  petallength  petalwidth         name
0          4.9         3.0          1.4         0.2  Iris-setosa
1          4.7         3.2          1.3         0.2  Iris-setosa
2          4.6         3.1          1.5         0.2  Iris-setosa

In the MaxCompute SQL backend, slice operations do not support start or step — only limit is supported. Slice operations apply to Collection objects only, not Sequence objects.

Prerequisites

Retrieve column types

Select, add, and delete columns

Select columns

Delete columns

Add columns

Add and delete columns simultaneously

Introduce constants and random numbers

Add a constant column

Add a random number column

Filter data

Filter with query

Convert a column into rows

Output limits

Filter with `query`