PyODPS DataFrame lets you combine tables using join and union operations. Use join to match rows across two Collections on a shared key; use union or concat to stack two Collections with the same schema into one.
Operations covered on this page:
| Operation | What it does |
|---|---|
join | Inner join two Collections on matching column values |
left_join, right_join, outer_join | Outer joins; duplicate columns are renamed with suffixes |
| Self join | Join a Collection against itself using view() |
mapjoin | Perform a mapjoin operation on the right table |
union / concat | Stack two Collections with matching schemas into one |
Prerequisites
Before you begin, ensure that you have:
Imported the example tables into MaxCompute. Follow the "Prepare data" section in Getting started
Initialized
moviesandratingsDataFrames as shown below
from odps.df import DataFrame
movies = DataFrame(o.get_table('pyodps_ml_100k_movies'))
ratings = DataFrame(o.get_table('pyodps_ml_100k_ratings'))The two tables have the following schemas:
>>> movies.dtypes
odps.Schema {
movie_id int64
title string
release_date string
video_release_date string
_url string
}
>>> ratings.dtypes
odps.Schema {
user_id int64
movie_id int64
rating int64
unix_timestamp int64
}Join operations
PyODPS DataFrame joins two Collection objects and returns a new Collection. The join type and the way you write the join condition together determine how duplicate column names appear in the result.
Inner join on columns with the same name
For an inner join (join), pass the shared column name as a string to on. The result keeps only one copy of the join column:
>>> movies.join(ratings, on='movie_id').head(3)
movie_id title release_date video_release_date url user_id rating unix_timestamp
0 3 Four Rooms (1995) 01-Jan-1995 http://example.aliyundoc.com/M/title-exact?Four%20Rooms%... 49 3 888068877
1 3 Four Rooms (1995) 01-Jan-1995 http://example.aliyundoc.com/M/title-exact?Four%20Rooms%... 621 5 881444887
2 3 Four Rooms (1995) 01-Jan-1995 http://example.aliyundoc.com/M/title-exact?Four%20Rooms%... 291 3 874833936Omitting on entirely has the same effect — PyODPS automatically joins on all columns that share the same name:
>>> movies.join(ratings).head(3)
movie_id title release_date video_release_date url user_id rating unix_timestamp
0 3 Four Rooms (1995) 01-Jan-1995 http://example.aliyundoc.com/M/title-exact?Four%20Rooms%... 49 3 888068877
1 3 Four Rooms (1995) 01-Jan-1995 http://example.aliyundoc.com/M/title-exact?Four%20Rooms%... 621 5 881444887
2 3 Four Rooms (1995) 01-Jan-1995 http://example.aliyundoc.com/M/title-exact?Four%20Rooms%... 291 3 874833936This single-copy behavior applies to inner joins only. Forleft_join,right_join, andouter_join, duplicate columns are always renamed with suffixes. See Outer joins.
Join on columns with different names
When the join key has a different name in each table, pass a list of (left_col, right_col) tuples to on:
>>> ratings2 = ratings[ratings.exclude('movie_id'), ratings.movie_id.rename('movie_id2')]
>>> ratings2.dtypes
odps.Schema {
user_id int64
rating int64
unix_timestamp int64
movie_id2 int64
}
>>> movies.join(ratings2, on=[('movie_id', 'movie_id2')]).head(3)
movie_id title release_date video_release_date url user_id rating unix_timestamp movie_id2
0 3 Four Rooms (1995) 01-Jan-1995 http://example.aliyundoc.com/M/title-exact?Four%20Rooms%... 49 3 888068877 3
1 3 Four Rooms (1995) 01-Jan-1995 http://example.aliyundoc.com/M/title-exact?Four%20Rooms%... 621 5 881444887 3
2 3 Four Rooms (1995) 01-Jan-1995 http://example.aliyundoc.com/M/title-exact?Four%20Rooms%... 291 3 874833936 3You can also pass an equality expression directly — the result is identical:
>>> movies.join(ratings2, on=[movies.movie_id == ratings2.movie_id2]).head(3)
movie_id title release_date video_release_date url user_id rating unix_timestamp movie_id2
0 3 Four Rooms (1995) 01-Jan-1995 http://example.aliyundoc.com/M/title-exact?Four%20Rooms%... 49 3 888068877 3
1 3 Four Rooms (1995) 01-Jan-1995 http://example.aliyundoc.com/M/title-exact?Four%20Rooms%... 621 5 881444887 3
2 3 Four Rooms (1995) 01-Jan-1995 http://example.aliyundoc.com/M/title-exact?Four%20Rooms%... 291 3 874833936 3Outer joins
For left_join, right_join, and outer_join, PyODPS always renames duplicate columns with suffixes to preserve both values. The default suffixes are _x (left table) and _y (right table):
>>> movies.left_join(ratings, on='movie_id').head(3)
movie_id_x title release_date video_release_date url user_id movie_id_y rating unix_timestamp
0 3 Four Rooms (1995) 01-Jan-1995 http://example.aliyundoc.com/M/title-exact?Four%20Rooms%... 49 3 3 888068877
1 3 Four Rooms (1995) 01-Jan-1995 http://example.aliyundoc.com/M/title-exact?Four%20Rooms%... 621 3 5 881444887
2 3 Four Rooms (1995) 01-Jan-1995 http://example.aliyundoc.com/M/title-exact?Four%20Rooms%... 291 3 3 874833936Because movie_id appears in both tables, the result has movie_id_x (from movies) and movie_id_y (from ratings).
Custom suffixes: Pass a 2-tuple to suffixes to override the defaults:
movies.left_join(ratings, on='movie_id', suffixes=('_movie', '_rating'))Suppress duplicate columns: Set merge_columns=True to keep only one copy of each duplicate column. PyODPS selects the non-null value from either side:
movies.left_join(ratings, on='movie_id', merge_columns=True)Self join
To join a Collection against itself, first create a view with view() so PyODPS can distinguish the left and right sides:
>>> movies2 = movies.view()
>>> movies.join(movies2, movies.movie_id == movies2.movie_id)[movies, movies2.movie_id.rename('movie_id2')].head(3)
movie_id title_x release_date_x video_release_date_x \
0 2 GoldenEye (1995) 01-Jan-1995 True
1 3 Four Rooms (1995) 01-Jan-1995 True
2 4 Get Shorty (1995) 01-Jan-1995 True
url_x movie_id2
0 http://example.aliyundoc.com/M/title-exact?GoldenEye%20(... 2
1 http://example.aliyundoc.comtitle-exact?Four%20Rooms%... 3
2 http://example.aliyundoc.com/M/title-exact?Get%20Shorty%... 4Map join
To perform a mapjoin operation, set mapjoin to True. The system then performs the mapjoin operation on the right table:
movies.join(ratings, on='movie_id', mapjoin=True)PyODPS also supports joining a PyODPS Collection with a pandas DataFrame or a database Collection. In that case, the join runs on PyODPS.
Union operations
Use union or concat to stack two tables vertically into one. Both tables must have the same column names and data types. Column order does not need to match — PyODPS aligns columns by name:
>>> mov1 = movies[movies.movie_id < 3]['movie_id', 'title']
>>> mov2 = movies[(movies.movie_id > 3) & (movies.movie_id < 6)]['title', 'movie_id']
>>> mov1.union(mov2)
movie_id title
0 1 Toy Story (1995)
1 2 GoldenEye (1995)
2 4 Get Shorty (1995)
3 5 Copycat (1995)PyODPS also supports union between a PyODPS Collection and a pandas DataFrame or a database Collection. In that case, the operation runs on PyODPS.