Merge PyODPS DataFrames using join and union - MaxCompute

PyODPS DataFrame lets you combine tables using join and union operations. Use join to match rows across two Collections on a shared key; use union or concat to stack two Collections with the same schema into one.

Operations covered on this page:

Operation	What it does
`join`	Inner join two Collections on matching column values
`left_join`, `right_join`, `outer_join`	Outer joins; duplicate columns are renamed with suffixes
Self join	Join a Collection against itself using `view()`
`mapjoin`	Perform a mapjoin operation on the right table
`union` / `concat`	Stack two Collections with matching schemas into one

Prerequisites

Before you begin, ensure that you have:

Imported the example tables into MaxCompute. Follow the "Prepare data" section in Getting started
Initialized movies and ratings DataFrames as shown below

from odps.df import DataFrame

movies = DataFrame(o.get_table('pyodps_ml_100k_movies'))
ratings = DataFrame(o.get_table('pyodps_ml_100k_ratings'))

The two tables have the following schemas:

>>> movies.dtypes
odps.Schema {
  movie_id                            int64
  title                               string
  release_date                        string
  video_release_date                  string
  _url                                string
}

>>> ratings.dtypes
odps.Schema {
  user_id                     int64
  movie_id                    int64
  rating                      int64
  unix_timestamp              int64
}

Join operations

PyODPS DataFrame joins two Collection objects and returns a new Collection. The join type and the way you write the join condition together determine how duplicate column names appear in the result.

Inner join on columns with the same name

For an inner join (join), pass the shared column name as a string to on. The result keeps only one copy of the join column:

>>> movies.join(ratings, on='movie_id').head(3)
   movie_id              title  release_date  video_release_date                                            url  user_id  rating  unix_timestamp
0         3  Four Rooms (1995)   01-Jan-1995                      http://example.aliyundoc.com/M/title-exact?Four%20Rooms%...       49       3       888068877
1         3  Four Rooms (1995)   01-Jan-1995                      http://example.aliyundoc.com/M/title-exact?Four%20Rooms%...      621       5       881444887
2         3  Four Rooms (1995)   01-Jan-1995                      http://example.aliyundoc.com/M/title-exact?Four%20Rooms%...      291       3       874833936

Omitting on entirely has the same effect — PyODPS automatically joins on all columns that share the same name:

>>> movies.join(ratings).head(3)
   movie_id              title  release_date  video_release_date                                           url  user_id  rating  unix_timestamp
0         3  Four Rooms (1995)   01-Jan-1995                      http://example.aliyundoc.com/M/title-exact?Four%20Rooms%...       49       3       888068877
1         3  Four Rooms (1995)   01-Jan-1995                      http://example.aliyundoc.com/M/title-exact?Four%20Rooms%...      621       5       881444887
2         3  Four Rooms (1995)   01-Jan-1995                      http://example.aliyundoc.com/M/title-exact?Four%20Rooms%...      291       3       874833936

This single-copy behavior applies to inner joins only. For left_join, right_join, and outer_join, duplicate columns are always renamed with suffixes. See Outer joins.

Join on columns with different names

When the join key has a different name in each table, pass a list of (left_col, right_col) tuples to on:

>>> ratings2 = ratings[ratings.exclude('movie_id'), ratings.movie_id.rename('movie_id2')]
>>> ratings2.dtypes
odps.Schema {
  user_id                     int64
  rating                      int64
  unix_timestamp              int64
  movie_id2                   int64
}

>>> movies.join(ratings2, on=[('movie_id', 'movie_id2')]).head(3)
   movie_id              title  release_date  video_release_date                                           url  user_id  rating  unix_timestamp  movie_id2
0         3  Four Rooms (1995)   01-Jan-1995                      http://example.aliyundoc.com/M/title-exact?Four%20Rooms%...       49       3       888068877          3
1         3  Four Rooms (1995)   01-Jan-1995                      http://example.aliyundoc.com/M/title-exact?Four%20Rooms%...      621       5       881444887          3
2         3  Four Rooms (1995)   01-Jan-1995                      http://example.aliyundoc.com/M/title-exact?Four%20Rooms%...      291       3       874833936          3

You can also pass an equality expression directly — the result is identical:

>>> movies.join(ratings2, on=[movies.movie_id == ratings2.movie_id2]).head(3)
   movie_id              title  release_date  video_release_date                                           url  user_id  rating  unix_timestamp  movie_id2
0         3  Four Rooms (1995)   01-Jan-1995                      http://example.aliyundoc.com/M/title-exact?Four%20Rooms%...       49       3       888068877          3
1         3  Four Rooms (1995)   01-Jan-1995                      http://example.aliyundoc.com/M/title-exact?Four%20Rooms%...      621       5       881444887          3
2         3  Four Rooms (1995)   01-Jan-1995                      http://example.aliyundoc.com/M/title-exact?Four%20Rooms%...      291       3       874833936          3

Outer joins

For left_join, right_join, and outer_join, PyODPS always renames duplicate columns with suffixes to preserve both values. The default suffixes are _x (left table) and _y (right table):

>>> movies.left_join(ratings, on='movie_id').head(3)
   movie_id_x              title  release_date  video_release_date                                           url  user_id  movie_id_y  rating  unix_timestamp
0           3  Four Rooms (1995)   01-Jan-1995                      http://example.aliyundoc.com/M/title-exact?Four%20Rooms%...       49           3       3       888068877
1           3  Four Rooms (1995)   01-Jan-1995                      http://example.aliyundoc.com/M/title-exact?Four%20Rooms%...      621           3       5       881444887
2           3  Four Rooms (1995)   01-Jan-1995                      http://example.aliyundoc.com/M/title-exact?Four%20Rooms%...      291           3       3       874833936

Because movie_id appears in both tables, the result has movie_id_x (from movies) and movie_id_y (from ratings).

Custom suffixes: Pass a 2-tuple to suffixes to override the defaults:

movies.left_join(ratings, on='movie_id', suffixes=('_movie', '_rating'))

Suppress duplicate columns: Set merge_columns=True to keep only one copy of each duplicate column. PyODPS selects the non-null value from either side:

movies.left_join(ratings, on='movie_id', merge_columns=True)

Self join

To join a Collection against itself, first create a view with view() so PyODPS can distinguish the left and right sides:

>>> movies2 = movies.view()
>>> movies.join(movies2, movies.movie_id == movies2.movie_id)[movies, movies2.movie_id.rename('movie_id2')].head(3)
   movie_id            title_x release_date_x video_release_date_x  \
0         2   GoldenEye (1995)    01-Jan-1995                 True
1         3  Four Rooms (1995)    01-Jan-1995                 True
2         4  Get Shorty (1995)    01-Jan-1995                 True

                                         url_x  movie_id2
0  http://example.aliyundoc.com/M/title-exact?GoldenEye%20(...          2
1  http://example.aliyundoc.comtitle-exact?Four%20Rooms%...          3
2  http://example.aliyundoc.com/M/title-exact?Get%20Shorty%...          4

Map join

To perform a mapjoin operation, set mapjoin to True. The system then performs the mapjoin operation on the right table:

movies.join(ratings, on='movie_id', mapjoin=True)

PyODPS also supports joining a PyODPS Collection with a pandas DataFrame or a database Collection. In that case, the join runs on PyODPS.

Union operations

Use union or concat to stack two tables vertically into one. Both tables must have the same column names and data types. Column order does not need to match — PyODPS aligns columns by name:

>>> mov1 = movies[movies.movie_id < 3]['movie_id', 'title']
>>> mov2 = movies[(movies.movie_id > 3) & (movies.movie_id < 6)]['title', 'movie_id']
>>> mov1.union(mov2)
   movie_id              title
0         1   Toy Story (1995)
1         2   GoldenEye (1995)
2         4  Get Shorty (1995)
3         5     Copycat (1995)

PyODPS also supports union between a PyODPS Collection and a pandas DataFrame or a database Collection. In that case, the operation runs on PyODPS.