PyODPS provides a DataFrame API for processing structured data in MaxCompute. A DataFrame object can reference four types of data sources — MaxCompute tables, MaxCompute partitions, Pandas DataFrames, and SQLAlchemy tables — using the same interface. Switch between data sources by changing the input, without rewriting your processing logic. This makes it easy to develop and test code locally with Pandas, then run the same logic at scale on MaxCompute.
Prerequisites
Before you begin, make sure that you have:
The sample table
pyodps_irisavailable in your MaxCompute project. For setup instructions, see Use DataFrame to process data
Key concepts
A PyODPS DataFrame object is the only Collection object you need to create manually. The three core object types are:
| Object | Structure | Description |
|---|---|---|
Collection (DataFrame) | Two-dimensional | Represents a full table |
Sequence | One-dimensional | Represents a single column |
Scalar | Single value | Represents a scalar object |
Lazy evaluation: A DataFrame built from a MaxCompute table does not load actual data — it holds a representation of the operations to perform. MaxCompute executes those operations and stores the results only when you trigger computation. By contrast, a DataFrame built from a Pandas DataFrame holds the actual data in memory.
Data source quick reference
All four creation methods use the same DataFrame(source) constructor. The only difference is what you pass as source.
| Data source | Method | Typical use |
|---|---|---|
| MaxCompute table | DataFrame(o.get_table(...)) or .to_df() | Production data processing |
| MaxCompute partition | DataFrame(o.get_table(...).get_partition(...)) or .to_df() | Partition-scoped processing |
| Pandas DataFrame | DataFrame(pandas_df) | Local testing and development |
| SQLAlchemy table | DataFrame(sqlalchemy_table) | Database table access |
Create a DataFrame object from a MaxCompute table
Pass the table object directly to DataFrame, or call to_df() on the table.
from odps.df import DataFrame
# Method 1: Pass the table object to DataFrame
iris = DataFrame(o.get_table('pyodps_iris'))
# Method 2: Use the to_df method on the table
iris2 = o.get_table('pyodps_iris').to_df()Both methods produce equivalent DataFrame objects with the same schema.
Create a DataFrame object from a MaxCompute partition
Pass a partition object to DataFrame, or call to_df() on the partition.
from odps.df import DataFrame
# Create a partitioned table if one does not already exist
o.create_table('partitioned_table', ('num bigint, num2 double', 'pt string'), if_not_exists=True)
# Method 1: Pass the partition object to DataFrame
pt_df = DataFrame(o.get_table('partitioned_table').get_partition('pt=20171111'))
# Method 2: Use the to_df method on the partition
pt_df2 = o.get_table('partitioned_table').get_partition('pt=20171111').to_df()Create a DataFrame object from a Pandas DataFrame
Pass the Pandas DataFrame to the DataFrame constructor.
from odps.df import DataFrame
import pandas as pd
import numpy as np
df = DataFrame(pd.DataFrame(np.arange(9).reshape(3, 3), columns=list('abc')))Data type inference
When converting a Pandas DataFrame, PyODPS infers data types automatically. Use the following parameters to control type resolution when inference fails or produces incorrect results:
| Parameter | Type | Description |
|---|---|---|
unknown_as_string | bool | When True, converts columns with unrecognizable types (including empty columns) to string instead of raising an error |
as_type | dict | Explicitly sets the data type for one or more columns; required for list and dict columns, which PyODPS cannot infer automatically |
Example 1: Resolve an empty column and override a data type
df2 = DataFrame(df, unknown_as_string=True, as_type={'null_col2': 'float'})
print(df2.dtypes)Output:
odps.Schema {
sepallength float64
sepalwidth float64
petallength float64
petalwidth float64
name string
null_col1 string # unknown_as_string=True converted this to string
null_col2 float64 # as_type forced this to float
}Example 2: Set the type for a list column
PyODPS does not infer types for list or dict columns. Set as_type to a dict mapping column names to their types.
df4 = DataFrame(df3, as_type={'list_col': 'list<int64>'})
print(df4.dtypes)Output:
odps.Schema {
id int64
list_col list<int64> # as_type is required for list and dict columns
}PyODPS does not support uploading Object Storage Service (OSS) or Tablestore external tables to MaxCompute.
Create a DataFrame object from an SQLAlchemy table
Connect SQLAlchemy to your MaxCompute project, then pass the table object to DataFrame. For connection parameters, see Connect SQLAlchemy to a MaxCompute project.
import os
from odps.df import DataFrame
import sqlalchemy
# Build the connection string using environment variables for your AccessKey credentials.
# Set ALIBABA_CLOUD_ACCESS_KEY_ID and ALIBABA_CLOUD_ACCESS_KEY_SECRET before running this code.
conn_string = 'odps://%s:%s@<project>/?endpoint=<endpoint>' % (
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET')
)
engine = sqlalchemy.create_engine(conn_string)
metadata = sqlalchemy.MetaData(bind=engine) # Bind metadata to the engine
table = sqlalchemy.Table('pyodps_iris', metadata, extend_existing=True, autoload=True)
iris = DataFrame(table)What's next
After creating a DataFrame object, you can perform data operations using the PyODPS DataFrame API. For available operations, see Use DataFrame to process data.