CollectionExpr represents a two-dimensional dataset in the PyODPS DataFrame API, equivalent to a MaxCompute table or a spreadsheet. DataFrame objects are also CollectionExpr objects. CollectionExpr supports column operations, data filtering, and data transformation.
Prerequisites
Before you begin, ensure that you have:
-
A table named
pyodps_iris. For more information, see the "DataFrame data processing" section in Getting started -
A DataFrame object. For more information, see the "Create a DataFrame object from a MaxCompute table" section in Create a DataFrame object
Retrieve column types
Use the dtypes attribute to get the types of all columns in a CollectionExpr. It returns a Schema type.
print(iris.dtypes)
Output:
odps.Schema {
sepallength float64
sepalwidth float64
petallength float64
petalwidth float64
name string
}
Select, add, and delete columns
The following table shows which syntax to use based on your goal:
| Goal | Recommended syntax | PyODPS version |
|---|---|---|
| Select specific columns | expr[col1, col2] |
All versions |
| Exclude columns | exclude(col1, col2) |
All versions |
| Add a computed column | iris['new_col'] = expr |
0.7.2+ |
| Overwrite an existing column | iris['col'] = new_expr |
0.7.2+ |
| Delete a column | del iris['col'] |
0.7.2+ |
| Conditional column update | iris[condition, 'col'] = expr |
0.7.2+ |
| Add a constant column | iris['col'] = value |
0.7.12+ |
| Rename while selecting | select(col, new_name=expr) |
All versions |
Select columns
Use expr[columns] to select specific columns from a CollectionExpr.
print(iris['name', 'sepallength'].head(5))
Output:
name sepallength
0 Iris-setosa 4.9
1 Iris-setosa 4.7
2 Iris-setosa 4.6
3 Iris-setosa 5.0
4 Iris-setosa 5.4
To select a single column and return a Collection (not a Sequence), add a trailing comma or use double brackets:iris[iris.sepallength,]oriris[[iris.sepallength]]. Without these, a Sequence object is returned instead.
Delete columns
Using `exclude` (all PyODPS versions):
print(iris.exclude('sepallength', 'petallength')[:5].head(5))
Output:
sepalwidth petalwidth name
0 3.0 0.2 Iris-setosa
1 3.2 0.2 Iris-setosa
2 3.1 0.2 Iris-setosa
3 3.6 0.2 Iris-setosa
4 3.9 0.4 Iris-setosa
Using `del` (PyODPS 0.7.2+):
del iris['sepallength']
del iris['petallength']
print(iris[:5].head(5))
Output:
sepalwidth petalwidth name
0 3.0 0.2 Iris-setosa
1 3.2 0.2 Iris-setosa
2 3.1 0.2 Iris-setosa
3 3.6 0.2 Iris-setosa
4 3.9 0.4 Iris-setosa
Add columns
Using assignment syntax (PyODPS 0.7.2+, recommended):
iris['sepalwidthplus1'] = iris.sepalwidth + 1
print(iris.head(5))
Output:
sepallength sepalwidth petallength petalwidth name \
0 4.9 3.0 1.4 0.2 Iris-setosa
1 4.7 3.2 1.3 0.2 Iris-setosa
2 4.6 3.1 1.5 0.2 Iris-setosa
3 5.0 3.6 1.4 0.2 Iris-setosa
4 5.4 3.9 1.7 0.4 Iris-setosa
sepalwidthplus1
0 4.0
1 4.2
2 4.1
3 4.6
4 4.9
Using `expr[expr, new_sequence]` (all versions): Creates a new Collection with the additional column appended. If the new column has the same name as an existing column, rename it to avoid conflicts.
print(iris[iris, (iris.sepalwidth + 1).rename('sepalwidthplus1')].head(5))
Output:
sepallength sepalwidth petallength petalwidth name \
0 4.9 3.0 1.4 0.2 Iris-setosa
1 4.7 3.2 1.3 0.2 Iris-setosa
2 4.6 3.1 1.5 0.2 Iris-setosa
3 5.0 3.6 1.4 0.2 Iris-setosa
4 5.4 3.9 1.7 0.4 Iris-setosa
sepalwidthplus1
0 4.0
1 4.2
2 4.1
3 4.6
4 4.9
Add and delete columns simultaneously
Overwrite an existing column (PyODPS 0.7.2+):
iris['sepalwidth'] = iris.sepalwidth * 2
print(iris.head(5))
Output:
sepallength sepalwidth petallength petalwidth name
0 4.9 6.0 1.4 0.2 Iris-setosa
1 4.7 6.4 1.3 0.2 Iris-setosa
2 4.6 6.2 1.5 0.2 Iris-setosa
3 5.0 7.2 1.4 0.2 Iris-setosa
4 5.4 7.8 1.7 0.4 Iris-setosa
Using `exclude` with a new column (all versions): Excludes the original column and appends the transformed version, so no renaming is needed.
print(iris[iris.exclude('sepalwidth'), iris.sepalwidth * 2].head(5))
Output:
sepallength petallength petalwidth name sepalwidth
0 4.9 1.4 0.2 Iris-setosa 6.0
1 4.7 1.3 0.2 Iris-setosa 6.4
2 4.6 1.5 0.2 Iris-setosa 6.2
3 5.0 1.4 0.2 Iris-setosa 7.2
4 5.4 1.7 0.4 Iris-setosa 7.8
Using `select` (all versions): Similar to exclude with a new column, but lets you rename the result in one step using keyword arguments.
print(iris.select('name', sepalwidthminus1=iris.sepalwidth - 1).head(5))
Output:
name sepalwidthminus1
0 Iris-setosa 2.0
1 Iris-setosa 2.2
2 Iris-setosa 2.1
3 Iris-setosa 2.6
4 Iris-setosa 2.9
Using a lambda expression: Pass a lambda that takes the result of the previous operation as a parameter. PyODPS evaluates the lambda and substitutes valid columns from that Collection.
print(iris['name', 'petallength'][[lambda x: x.name]].head(5))
Output:
name
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
Conditional column update (PyODPS 0.7.2+): Update a column only for rows that match a condition.
iris[iris.sepallength > 5.0, 'sepalwidth'] = iris.sepalwidth * 2
print(iris.head(5))
Output:
sepallength sepalwidth petallength petalwidth name
0 4.9 3.0 1.4 0.2 Iris-setosa
1 4.7 3.2 1.3 0.2 Iris-setosa
2 4.6 3.1 1.5 0.2 Iris-setosa
3 5.0 3.6 1.4 0.2 Iris-setosa
4 5.4 7.8 1.7 0.4 Iris-setosa
Introduce constants and random numbers
Add a constant column
Simple syntax (PyODPS 0.7.12+, recommended): Assign a literal value directly to a new column.
iris['id'] = 1
print(iris.head(5))
Output:
sepallength sepalwidth petallength petalwidth name id
0 4.9 3.0 1.4 0.2 Iris-setosa 1
1 4.7 3.2 1.3 0.2 Iris-setosa 1
2 4.6 3.1 1.5 0.2 Iris-setosa 1
3 5.0 3.6 1.4 0.2 Iris-setosa 1
4 5.4 3.9 1.7 0.4 Iris-setosa 1
The simple syntax cannot infer the type of null values automatically. To add a null column, use NullScalar with an explicit type.from odps.df import NullScalar
iris['null_col'] = NullScalar('float')
print(iris.head(5))
Output:
sepallength sepalwidth petallength petalwidth name null_col
0 4.9 3.0 1.4 0.2 Iris-setosa None
1 4.7 3.2 1.3 0.2 Iris-setosa None
2 4.6 3.1 1.5 0.2 Iris-setosa None
3 5.0 3.6 1.4 0.2 Iris-setosa None
4 5.4 3.9 1.7 0.4 Iris-setosa None
Using `Scalar` (all versions): Wrap a constant with Scalar and specify the column name manually.
from odps.df import Scalar
print(iris[iris, Scalar(1).rename('id')][:5].head(5))
Output:
sepallength sepalwidth petallength petalwidth name id
0 4.9 3.0 1.4 0.2 Iris-setosa 1
1 4.7 3.2 1.3 0.2 Iris-setosa 1
2 4.6 3.1 1.5 0.2 Iris-setosa 1
3 5.0 3.6 1.4 0.2 Iris-setosa 1
4 5.4 3.9 1.7 0.4 Iris-setosa 1
To add a null column with an explicit type:
from odps.df import NullScalar
print(iris[iris, NullScalar('float').rename('fid')][:5].head(5))
Output:
sepallength sepalwidth petallength petalwidth name fid
0 4.9 3.0 1.4 0.2 Iris-setosa None
1 4.7 3.2 1.3 0.2 Iris-setosa None
2 4.6 3.1 1.5 0.2 Iris-setosa None
3 5.0 3.6 1.4 0.2 Iris-setosa None
4 5.4 3.9 1.7 0.4 Iris-setosa None
Add a random number column
Use RandomScalar to append a column of random FLOAT values in the range 0–1, with a unique value per row. The optional parameter is a random seed.
from odps.df import RandomScalar
iris[iris, RandomScalar().rename('rand_val')][:5]
Output:
sepallength sepalwidth petallength petalwidth name rand_val
0 4.9 3.0 1.4 0.2 Iris-setosa 0.000471
1 4.7 3.2 1.3 0.2 Iris-setosa 0.799520
2 4.6 3.1 1.5 0.2 Iris-setosa 0.834609
3 5.0 3.6 1.4 0.2 Iris-setosa 0.106921
4 5.4 3.9 1.7 0.4 Iris-setosa 0.763442
Filter data
Filter rows using standard operators, the filter method, the query method, or lambda expressions.
In standard filter expressions, use&for AND|for OR — Python'sandandorkeywords are not supported. Thequerymethod accepts both.
Filter by condition:
print(iris[iris.sepallength > 5].head(5))
Output:
sepallength sepalwidth petallength petalwidth name
0 5.4 3.9 1.7 0.4 Iris-setosa
1 5.4 3.7 1.5 0.2 Iris-setosa
2 5.8 4.0 1.2 0.2 Iris-setosa
3 5.7 4.4 1.5 0.4 Iris-setosa
4 5.4 3.9 1.3 0.4 Iris-setosa
AND (`&`):
print(iris[(iris.sepallength < 5) & (iris['petallength'] > 1.5)].head(5))
Output:
sepallength sepalwidth petallength petalwidth name
0 4.8 3.4 1.6 0.2 Iris-setosa
1 4.8 3.4 1.9 0.2 Iris-setosa
2 4.7 3.2 1.6 0.2 Iris-setosa
3 4.8 3.1 1.6 0.2 Iris-setosa
4 4.9 2.4 3.3 1.0 Iris-versicolor
OR (`|`):
print(iris[(iris.sepalwidth < 2.5) | (iris.sepalwidth > 4)].head(5))
Output:
sepallength sepalwidth petallength petalwidth name
0 5.7 4.4 1.5 0.4 Iris-setosa
1 5.2 4.1 1.5 0.1 Iris-setosa
2 5.5 4.2 1.4 0.2 Iris-setosa
3 4.5 2.3 1.3 0.3 Iris-setosa
4 5.5 2.3 4.0 1.3 Iris-versicolor
NOT (`~`):
print(iris[~(iris.sepalwidth > 3)].head(5))
Output:
sepallength sepalwidth petallength petalwidth name
0 4.9 3.0 1.4 0.2 Iris-setosa
1 4.4 2.9 1.4 0.2 Iris-setosa
2 4.8 3.0 1.4 0.1 Iris-setosa
3 4.3 3.0 1.1 0.1 Iris-setosa
4 5.0 3.0 1.6 0.2 Iris-setosa
`filter` method with multiple conditions: Passing multiple arguments to filter is equivalent to combining them with &.
print(iris.filter(iris.sepalwidth > 3.5, iris.sepalwidth < 4).head(5))
Output:
sepallength sepalwidth petallength petalwidth name
0 5.0 3.6 1.4 0.2 Iris-setosa
1 5.4 3.9 1.7 0.4 Iris-setosa
2 5.4 3.7 1.5 0.2 Iris-setosa
3 5.4 3.9 1.3 0.4 Iris-setosa
4 5.7 3.8 1.7 0.3 Iris-setosa
Lambda expression for chained operations:
print(iris[iris.sepalwidth > 3.8]['name', lambda x: x.sepallength + 1].head(5))
Output:
name sepallength
0 Iris-setosa 6.4
1 Iris-setosa 6.8
2 Iris-setosa 6.7
3 Iris-setosa 6.4
4 Iris-setosa 6.2
BOOLEAN column as a filter: When a Collection contains a BOOLEAN column, use that column directly as a filter condition.
# Check the schema
print(df.dtypes)
# odps.Schema {
# a boolean
# b int64
# }
# Filter using the boolean column
print(df[df.a])
# a b
# 0 True 1
# 1 True 3
The following examples show the difference between retrieving a single column and using it as a filter:
df[df.a, ] # Retrieve a one-column Collection
df[[df.a]] # Retrieve a one-column Collection
df.select(df.a) # Retrieve a one-column Collection explicitly
df[df.a] # Use column a (BOOLEAN) as a filter condition
df.a # Retrieve a column from a Collection
df['a'] # Retrieve a column from a Collection
Filter with query
The query method accepts filter conditions as a string expression, similar to Pandas. Reference column names directly in the string; prefix local variables with @.
print(iris.query("(sepallength < 5) and (petallength > 1.5)").head(5))
Output:
sepallength sepalwidth petallength petalwidth name
0 4.8 3.4 1.6 0.2 Iris-setosa
1 4.8 3.4 1.9 0.2 Iris-setosa
2 4.7 3.2 1.6 0.2 Iris-setosa
3 4.8 3.1 1.6 0.2 Iris-setosa
4 4.9 2.4 3.3 1.0 Iris-versicolor
Use @ to reference a local variable inside a query string:
var = 4
print(iris.query("(sepalwidth < 2.5) | (sepalwidth > @var)").head(5))
Output:
sepallength sepalwidth petallength petalwidth name
0 5.7 4.4 1.5 0.4 Iris-setosa
1 5.2 4.1 1.5 0.1 Iris-setosa
2 5.5 4.2 1.4 0.2 Iris-setosa
3 4.5 2.3 1.3 0.3 Iris-setosa
4 5.5 2.3 4.0 1.3 Iris-versicolor
The query method supports the following syntax:
| Syntax | Description | |
|---|---|---|
| name | Column names (no @ prefix). Use @variable to reference a local Python variable. |
|
| operator | +, -, *, /, //, %, **, ==, !=, <, <=, >, >=, in, not in |
|
| bool | AND: & or and. OR: ` |
or or`. |
| attribute | Object attributes | |
| index, slice, subscript | Slice operations |
Convert a column into rows
Use explode to expand a LIST or MAP column into multiple rows. You can also use the apply method for multi-row output. You can explode one or more columns simultaneously. If an input row produces no output (for example, an empty list), the row is excluded by default. Set keep_nulls=True to retain such rows with null values.
For more information about how to use explode for multi-row output, see the "Collection-related operations" section in Column operations.
Sample data:
print(df)
# id a b
# 0 1 [a1, b1] [a2, b2, c2]
# 1 2 [c1] [d2, e2]
Explode one column, keep another as-is:
print(df[df.id, df.a.explode(), df.b])
Output:
id a b
0 1 a1 [a2, b2, c2]
1 1 b1 [a2, b2, c2]
2 2 c1 [d2, e2]
Explode two columns simultaneously:
print(df[df.id, df.a.explode(), df.b.explode()])
Output:
id a b
0 1 a1 a2
1 1 a1 b2
2 1 a1 c2
3 1 b1 a2
4 1 b1 b2
5 1 b1 c2
6 2 c1 d2
7 2 c1 e2
`keep_nulls=True` — retain rows with empty input:
print(df)
# id a
# 0 1 [a1, b1]
# 1 2 []
# Without keep_nulls: row 2 is dropped
print(df[df.id, df.a.explode()])
# id a
# 0 1 a1
# 1 1 b1
# With keep_nulls=True: row 2 is retained with a null value
print(df[df.id, df.a.explode(keep_nulls=True)])
# id a
# 0 1 a1
# 1 1 b1
# 2 2 None
Output limits
Limit the number of rows returned using slice syntax or the limit method.
Slice syntax:
print(iris[:3].execute())
Output:
sepallength sepalwidth petallength petalwidth name
0 4.9 3.0 1.4 0.2 Iris-setosa
1 4.7 3.2 1.3 0.2 Iris-setosa
2 4.6 3.1 1.5 0.2 Iris-setosa
`limit` method:
print(iris.limit(3).execute())
Output:
sepallength sepalwidth petallength petalwidth name
0 4.9 3.0 1.4 0.2 Iris-setosa
1 4.7 3.2 1.3 0.2 Iris-setosa
2 4.6 3.1 1.5 0.2 Iris-setosa
In the MaxCompute SQL backend, slice operations do not supportstartorstep— onlylimitis supported. Slice operations apply to Collection objects only, not Sequence objects.