All Products
Search
Document Center

MaxCompute:Getting started

Last Updated:Mar 26, 2026

PyODPS DataFrame lets you query and transform MaxCompute tables using a pandas-like API—no raw SQL required. This tutorial walks you through the full workflow: loading data, inspecting schemas, filtering and aggregating, joining tables, and applying user-defined functions (UDFs).

By the end of this tutorial, you will be able to:

  • Create a DataFrame object from a MaxCompute table

  • Inspect schemas and preview rows

  • Select, exclude, and compute columns

  • Group, count, and sort data

  • Join multiple DataFrames and persist the result as a new table

  • Apply row-wise UDFs with the apply() method

This tutorial covers two environments:

  • IPython — interactive exploration using the MovieLens 100K dataset

  • DataWorks PyODPS node — batch processing using the Iris dataset

Prerequisites

Before you begin, ensure that you have:

  • An Alibaba Cloud account with MaxCompute enabled

  • A MaxCompute project with read and write access

  • Your AccessKey ID and AccessKey secret stored as environment variables:

    • ALIBABA_CLOUD_ACCESS_KEY_ID

    • ALIBABA_CLOUD_ACCESS_KEY_SECRET

  • PyODPS installed (pip install pyodps)

  • (For Part 1) IPython installed (pip install IPython)

  • (For Part 2) A DataWorks workspace with PyODPS 3 node support

Part 1: Explore data in IPython

This part uses the MovieLens 100K dataset to demonstrate interactive DataFrame operations. The dataset includes three files:

  • u.user — user profiles (user ID, age, gender, occupation, zip code)

  • u.item — movie metadata

  • u.data — user ratings

Step 1: Set up tables

Create the three MaxCompute tables that will hold the dataset.

Users table

CREATE TABLE IF NOT EXISTS pyodps_ml_100k_users
(
  user_id    BIGINT COMMENT 'User ID',
  age        BIGINT COMMENT 'Age',
  sex        STRING COMMENT 'Gender',
  occupation STRING COMMENT 'Occupation',
  zip_code   STRING COMMENT 'Zip code'
);

Movies table

CREATE TABLE IF NOT EXISTS pyodps_ml_100k_movies
(
    movie_id            BIGINT COMMENT 'Movie ID',
    title              STRING COMMENT 'Movie title',
    release_date       STRING COMMENT 'Release date',
    video_release_date STRING COMMENT 'Video release date',
    IMDb_URL           STRING COMMENT 'IMDb URL',
    unknown            TINYINT COMMENT 'Unknown',
    Action             TINYINT COMMENT 'Action',
    Adventure          TINYINT COMMENT 'Adventure',
    Animation          TINYINT COMMENT 'Animation',
    Children           TINYINT COMMENT 'Children',
    Comedy             TINYINT COMMENT 'Comedy',
    Crime              TINYINT COMMENT 'Crime',
    Documentary        TINYINT COMMENT 'Documentary',
    Drama              TINYINT COMMENT 'Drama',
    Fantasy            TINYINT COMMENT 'Fantasy',
    FilmNoir           TINYINT COMMENT 'Film noir',
    Horror             TINYINT COMMENT 'Horror',
    Musical            TINYINT COMMENT 'Music',
    Mystery            TINYINT COMMENT 'Mystery',
    Romance            TINYINT COMMENT 'Romance',
    SciFi              TINYINT COMMENT 'Science fiction',
    Thriller           TINYINT COMMENT 'Thriller',
    War                TINYINT COMMENT 'War',
    Western            TINYINT COMMENT 'West'
);

Ratings table

CREATE TABLE IF NOT EXISTS pyodps_ml_100k_ratings
(
    user_id    BIGINT COMMENT 'User ID',
    movie_id   BIGINT COMMENT 'Movie ID',
    rating     BIGINT COMMENT 'Score',
    timestamp  BIGINT COMMENT 'Timestamp'
);

Step 2: Upload data

Use the Tunnel upload command to load each file into its table. Replace path_to_file with the directory where you saved the dataset files.

Tunnel upload -fd | path_to_file/u.user pyodps_ml_100k_users;
Tunnel upload -fd | path_to_file/u.item pyodps_ml_100k_movies;
Tunnel upload -fd | path_to_file/u.data pyodps_ml_100k_ratings;

For the full list of Tunnel commands, see Tunnel commands.

Step 3: Connect to MaxCompute

Start IPython and create a MaxCompute connection object. All DataFrame operations in this part use this object.

import os
from odps import ODPS

o = ODPS(
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
    project='your-default-project',
    endpoint='your-end-point',
)

Replace your-default-project and your-end-point with your MaxCompute project name and endpoint.

Step 4: Create a DataFrame

Wrap a MaxCompute table as a DataFrame object.

from odps.df import DataFrame

users = DataFrame(o.get_table('pyodps_ml_100k_users'))

Step 5: Inspect the schema

Use the dtypes property to see all columns and their data types.

print(users.dtypes)

Output:

odps.Schema {
  user_id             int64
  age                 int64
  sex                 string
  occupation          string
  zip_code            string
}

Step 6: Preview rows

Use head(N) to fetch the first N rows. This is the fastest way to verify that your data loaded correctly.

print(users.head(10))

Output:

   user_id  age  sex     occupation  zip_code
0        1   24    M     technician     85711
1        2   53    F          other     94043
2        3   23    M         writer     32067
3        4   24    M     technician     43537
4        5   33    F          other     15213
5        6   42    M      executive     98101
6        7   57    M  administrator     91344
7        8   36    M  administrator     05201
8        9   29    M        student     01002
9       10   53    M         lawyer     90703

Step 7: Work with columns

Select specific columns

Pass a list of column names to retrieve only those columns.

print(users[['user_id', 'age']].head(5))

Output:

   user_id  age
0        1   24
1        2   53
2        3   23
3        4   24
4        5   33

Exclude columns

Use exclude() to drop columns you do not need.

print(users.exclude('zip_code', 'age').head(5))

Output:

   user_id  sex  occupation
0        1    M  technician
1        2    F       other
2        3    M      writer
3        4    M  technician
4        5    F       other

Add a computed column

Combine exclude() with select() to drop columns and add computed ones in a single step. The example below drops zip_code and sex, then adds a boolean sex_bool column derived from sex.

print(users.select(users.exclude('zip_code', 'sex'), sex_bool=users.sex == 'M').head(5))

Output:

   user_id  age  occupation  sex_bool
0        1   24  technician      True
1        2   53       other     False
2        3   23      writer      True
3        4   24  technician      True
4        5   33       other     False

Step 8: Aggregate and sort

Count by group

Use groupby() followed by agg() to count records within each group.

print(users.groupby(users.sex).agg(count=users.count()))

Output:

   sex  count
0    F    273
1    M    670

Find the top 10 occupations

Chain groupby(), agg(), sort(), and head() to rank groups by count.

df = users.groupby('occupation').agg(count=users['occupation'].count())
df1 = df.sort(df['count'], ascending=False)
print(df1.head(10))

Output:

      occupation  count
0        student    196
1          other    105
2       educator     95
3  administrator     79
4       engineer     67
5     programmer     66
6      librarian     51
7         writer     45
8      executive     32
9      scientist     31

Alternatively, use value_counts() for a shorter syntax. The number of records returned is controlled by the options.df.odps.sort.limit parameter. For details, see Configurations.

df = users.occupation.value_counts()[:10]
print(df.head(10))

Output:

      occupation  count
0        student    196
1          other    105
2       educator     95
3  administrator     79
4       engineer     67
5     programmer     66
6      librarian     51
7         writer     45
8      executive     32
9      scientist     31

Step 9: Join tables and persist the result

Load the movies and ratings DataFrames, join all three tables, and save the result as a new MaxCompute table using persist().

movies = DataFrame(o.get_table('pyodps_ml_100k_movies'))
ratings = DataFrame(o.get_table('pyodps_ml_100k_ratings'))

o.delete_table('pyodps_ml_100k_lens', if_exists=True)
lens = movies.join(ratings).join(users).persist('pyodps_ml_100k_lens')

print(lens.dtypes)

Output:

odps.Schema {
  movie_id                          int64
  title                             string
  release_date                      string
  ideo_release_date                 string
  imdb_url                          string
  unknown                           int64
  action                            int64
  adventure                         int64
  animation                         int64
  children                          int64
  comedy                            int64
  crime                             int64
  documentary                       int64
  drama                             int64
  fantasy                           int64
  filmnoir                          int64
  horror                            int64
  musical                           int64
  mystery                           int64
  romance                           int64
  scifi                             int64
  thriller                          int64
  war                               int64
  western                           int64
  user_id                           int64
  rating                            int64
  timestamp                         int64
  age                               int64
  sex                               string
  occupation                        string
  zip_code                          string
}

The joined table pyodps_ml_100k_lens contains 31 columns combining all fields from the three source tables.

Part 2: Process data in a DataWorks PyODPS node

This part uses the Iris dataset to demonstrate running DataFrame code in a DataWorks PyODPS 3 node, including row-wise UDFs. For background on PyODPS nodes, see Develop a PyODPS 3 task.

Step 1: Create the Iris table

Use the DataWorks table management feature to create the target table.

  1. In the Business Flow section, click the desired workflow. Right-click MaxCompute and select Create Table. In the Create Table dialog box, select a path from the Path drop-down list, enter a name in the Name field, and click Create.

  2. On the table editing page, click the image.png icon in the upper-left corner.

  3. Run the following statement to create the table:

    CREATE TABLE pyodps_iris (
        sepallength double COMMENT 'sepal length (cm)',
        sepalwidth  double COMMENT 'sepal width (cm)',
        petallength double COMMENT 'petal length (cm)',
        petalwidth  double COMMENT 'petal width (cm)',
        name        string COMMENT 'name'
    );

Step 2: Upload the Iris dataset

  1. Right-click the table name and select Import Data. In the Data Import Wizard dialog box, click Next, then click Browse to upload the Iris CSV file you downloaded.

    image

  2. Select Match by location and click Import Data.

Step 3: Create a PyODPS node

In the Business Flow section, click the desired workflow, right-click MaxCompute, and then choose Create Node > PyODPS 3 to create a node for your code.

Step 4: Write and run the code

Paste the following code into the node editor and click the image.png icon to run it. Results appear in the Runtime Log pane.

The code demonstrates three DataFrame operations on the Iris table: previewing rows, selecting a single column, and applying row-wise UDFs.

from odps import ODPS
from odps.df import DataFrame, output
import os

o = ODPS(
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
    project='your-default-project',
    endpoint='your-end-point',
)

# Create a DataFrame from the Iris table
iris = DataFrame(o.get_table('pyodps_iris'))

# Preview the first 10 rows
print(iris.head(10))

# Select a single column
print(iris.sepallength.head(5))

# Apply a lambda UDF to compute the sum of two columns
print(iris.apply(lambda row: row.sepallength + row.sepalwidth, axis=1, reduce=True, types='float').rename('sepaladd').head(3))

# Apply a named UDF that returns multiple output rows per input row
@output(['iris_add', 'iris_sub'], ['float', 'float'])
def handle(row):
    # Each yield call produces one output row
    yield row.sepallength - row.sepalwidth, row.sepallength + row.sepalwidth
    yield row.petallength - row.petalwidth, row.petallength + row.petalwidth

print(iris.apply(handle, axis=1).head(5))

Expected output:

# iris.head(10)
   sepallength  sepalwidth  petallength  petalwidth         name
0          4.9         3.0          1.4         0.2  Iris-setosa
1          4.7         3.2          1.3         0.2  Iris-setosa
2          4.6         3.1          1.5         0.2  Iris-setosa
3          5.0         3.6          1.4         0.2  Iris-setosa
4          5.4         3.9          1.7         0.4  Iris-setosa
5          4.6         3.4          1.4         0.3  Iris-setosa
6          5.0         3.4          1.5         0.2  Iris-setosa
7          4.4         2.9          1.4         0.2  Iris-setosa
8          4.9         3.1          1.5         0.1  Iris-setosa
9          5.4         3.7          1.5         0.2  Iris-setosa


# iris.sepallength.head(5)
   sepallength
0          4.9
1          4.7
2          4.6
3          5.0
4          5.4

# iris.apply(lambda ...).rename('sepaladd').head(3)
   sepaladd
0       7.9
1       7.9
2       7.7

# iris.apply(handle, axis=1).head(5)
   iris_add  iris_sub
0       1.9       7.9
1       1.2       1.6
2       1.5       7.9
3       1.1       1.5
4       1.5       7.7

How the UDF works:

  • axis=1 tells apply() to pass each row as the function argument.

  • The @output decorator specifies the output column names and types.

  • yield returns multiple rows from a single input row. Each yield statement in handle() produces one output row—with two yield statements per call, each input row produces two output rows.

What's next

  • Explore the full DataFrame API: filtering with boolean expressions, window functions, and more advanced aggregations

  • Learn how to write and register permanent UDFs in MaxCompute

  • See how to schedule PyODPS nodes as recurring DataWorks workflows