Create a PyODPS DataFrame by referencing data sources - MaxCompute

PyODPS provides a DataFrame API for processing structured data in MaxCompute. A DataFrame object can reference four types of data sources — MaxCompute tables, MaxCompute partitions, Pandas DataFrames, and SQLAlchemy tables — using the same interface. Switch between data sources by changing the input, without rewriting your processing logic. This makes it easy to develop and test code locally with Pandas, then run the same logic at scale on MaxCompute.

Prerequisites

Before you begin, make sure that you have:

The sample table pyodps_iris available in your MaxCompute project. For setup instructions, see Use DataFrame to process data

Key concepts

A PyODPS DataFrame object is the only Collection object you need to create manually. The three core object types are:

Object	Structure	Description
`Collection` (`DataFrame`)	Two-dimensional	Represents a full table
`Sequence`	One-dimensional	Represents a single column
`Scalar`	Single value	Represents a scalar object

Lazy evaluation: A DataFrame built from a MaxCompute table does not load actual data — it holds a representation of the operations to perform. MaxCompute executes those operations and stores the results only when you trigger computation. By contrast, a DataFrame built from a Pandas DataFrame holds the actual data in memory.

Data source quick reference

All four creation methods use the same DataFrame(source) constructor. The only difference is what you pass as source.

Data source	Method	Typical use
MaxCompute table	`DataFrame(o.get_table(...))` or `.to_df()`	Production data processing
MaxCompute partition	`DataFrame(o.get_table(...).get_partition(...))` or `.to_df()`	Partition-scoped processing
Pandas DataFrame	`DataFrame(pandas_df)`	Local testing and development
SQLAlchemy table	`DataFrame(sqlalchemy_table)`	Database table access

Create a DataFrame object from a MaxCompute table

Pass the table object directly to DataFrame, or call to_df() on the table.

from odps.df import DataFrame

# Method 1: Pass the table object to DataFrame
iris = DataFrame(o.get_table('pyodps_iris'))

# Method 2: Use the to_df method on the table
iris2 = o.get_table('pyodps_iris').to_df()

Both methods produce equivalent DataFrame objects with the same schema.

Create a DataFrame object from a MaxCompute partition

Pass a partition object to DataFrame, or call to_df() on the partition.

from odps.df import DataFrame

# Create a partitioned table if one does not already exist
o.create_table('partitioned_table', ('num bigint, num2 double', 'pt string'), if_not_exists=True)

# Method 1: Pass the partition object to DataFrame
pt_df = DataFrame(o.get_table('partitioned_table').get_partition('pt=20171111'))

# Method 2: Use the to_df method on the partition
pt_df2 = o.get_table('partitioned_table').get_partition('pt=20171111').to_df()

Create a DataFrame object from a Pandas DataFrame

Pass the Pandas DataFrame to the DataFrame constructor.

from odps.df import DataFrame
import pandas as pd
import numpy as np

df = DataFrame(pd.DataFrame(np.arange(9).reshape(3, 3), columns=list('abc')))

Data type inference

When converting a Pandas DataFrame, PyODPS infers data types automatically. Use the following parameters to control type resolution when inference fails or produces incorrect results:

Parameter	Type	Description
`unknown_as_string`	bool	When `True`, converts columns with unrecognizable types (including empty columns) to `string` instead of raising an error
`as_type`	dict	Explicitly sets the data type for one or more columns; required for `list` and `dict` columns, which PyODPS cannot infer automatically

Example 1: Resolve an empty column and override a data type

df2 = DataFrame(df, unknown_as_string=True, as_type={'null_col2': 'float'})
print(df2.dtypes)

Output:

odps.Schema {
  sepallength           float64
  sepalwidth            float64
  petallength           float64
  petalwidth            float64
  name                  string
  null_col1             string   # unknown_as_string=True converted this to string
  null_col2             float64  # as_type forced this to float
}

Example 2: Set the type for a list column

PyODPS does not infer types for list or dict columns. Set as_type to a dict mapping column names to their types.

df4 = DataFrame(df3, as_type={'list_col': 'list<int64>'})
print(df4.dtypes)

Output:

odps.Schema {
  id        int64
  list_col  list<int64>  # as_type is required for list and dict columns
}

PyODPS does not support uploading Object Storage Service (OSS) or Tablestore external tables to MaxCompute.

Create a DataFrame object from an SQLAlchemy table

Connect SQLAlchemy to your MaxCompute project, then pass the table object to DataFrame. For connection parameters, see Connect SQLAlchemy to a MaxCompute project.

import os
from odps.df import DataFrame
import sqlalchemy

# Build the connection string using environment variables for your AccessKey credentials.
# Set ALIBABA_CLOUD_ACCESS_KEY_ID and ALIBABA_CLOUD_ACCESS_KEY_SECRET before running this code.
conn_string = 'odps://%s:%s@<project>/?endpoint=<endpoint>' % (
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET')
)

engine = sqlalchemy.create_engine(conn_string)
metadata = sqlalchemy.MetaData(bind=engine)  # Bind metadata to the engine
table = sqlalchemy.Table('pyodps_iris', metadata, extend_existing=True, autoload=True)

iris = DataFrame(table)

What's next

After creating a DataFrame object, you can perform data operations using the PyODPS DataFrame API. For available operations, see Use DataFrame to process data.