This topic describes the join and union operations that are supported by Alibaba Cloud MaxCompute SDK for Python (PyODPS) DataFrame to merge data in tables.
>>> from odps.df import DataFrame
>>> movies = DataFrame(o.get_table('pyodps_ml_100k_movies'))
>>> ratings = DataFrame(o.get_table('pyodps_ml_100k_ratings'))
>>> movies.dtypes
odps.Schema {
movie_id int64
title string
release_date string
video_release_date string
imdb_url string
}
>>> ratings.dtypes
odps.Schema {
user_id int64
movie_id int64
rating int64
unix_timestamp int64
}
Join operation
join
two Collection objects.
- If you do not specify
join
conditions, the DataFrame API uses the columns of the same name tojoin
the Collection objects.>>> movies.join(ratings).head(3) movie_id title release_date video_release_date imdb_url user_id rating unix_timestamp 0 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 49 3 888068877 1 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 621 5 881444887 2 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 291 3 874833936
- You can specify explicit
join
conditions. For ajoin
operation, if the column names specified in theon
condition for the two Collection objects are the same, the system uses one of the specified columns for the new table.>>> movies.join(ratings, on='movie_id').head(3) movie_id title release_date video_release_date imdb_url user_id rating unix_timestamp 0 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 49 3 888068877 1 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 621 5 881444887 2 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 291 3 874833936
- For other types of
join
operations such asleft_join
, if the column names specified in the on condition for the two Collection objects are the same, the system renames the specified columns for the new table.
In the preceding sample code, the two>>> movies.left_join(ratings, on='movie_id').head(3) movie_id_x title release_date video_release_date imdb_url user_id movie_id_y rating unix_timestamp 0 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 49 3 3 888068877 1 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 621 3 5 881444887 2 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 291 3 3 874833936
movie_id
columns are renamed asmovie_id_x
andmovie_id_y
. The renaming rule depends on thesuffixes
parameter. By default, the value of the suffixes parameter is('_x', '_y')
. When columns of the same name are found, the system renames the columns by using the specified suffixes.>>> ratings2 = ratings[ratings.exclude('movie_id'), ratings.movie_id.rename('movie_id2')] >>> ratings2.dtypes odps.Schema { user_id int64 rating int64 unix_timestamp int64 movie_id2 int64 } >>> movies.join(ratings2, on=[('movie_id', 'movie_id2')]).head(3) movie_id title release_date video_release_date imdb_url user_id rating unix_timestamp movie_id2 0 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 49 3 888068877 3 1 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 621 5 881444887 3 2 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 291 3 874833936 3
- You can specify an expression that uses the equality operator in the
on
condition to rename columns for the new table.>>> movies.join(ratings2, on=[movies.movie_id == ratings2.movie_id2]).head(3) movie_id title release_date video_release_date imdb_url user_id rating unix_timestamp movie_id2 0 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 49 3 888068877 3 1 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 621 5 881444887 3 2 3 Four Rooms (1995) 01-Jan-1995 http://us.imdb.com/M/title-exact?Four%20Rooms%... 291 3 874833936 3
- If you perform a
self-join
operation, you can use theview
method to retrieve columns from the left and right Collection objects.>>> movies2 = movies.view() >>> movies.join(movies2, movies.movie_id == movies2.movie_id)[movies, movies2.movie_id.rename('movie_id2')].head(3) movie_id title_x release_date_x video_release_date_x \ 0 2 GoldenEye (1995) 01-Jan-1995 True 1 3 Four Rooms (1995) 01-Jan-1995 True 2 4 Get Shorty (1995) 01-Jan-1995 True imdb_url_x movie_id2 0 http://us.imdb.com/M/title-exact?GoldenEye%20(... 2 1 http://us.imdb.com/M/title-exact?Four%20Rooms%... 3 2 http://us.imdb.com/M/title-exact?Get%20Shorty%... 4
PyODPS DataFrame supports the
left_join
,right_join
, andouter_join
operations in addition to thejoin
operation. In the left_join, right_join, and outer_join operations, renamed columns are suffixed with_x
and_y
by default. You can use a 2-tuple
to define the suffixes in thesuffixes
parameter. - When you perform the left_join, right_join, or outer_join operation, to avoid duplicate
columns in the new table, set the
merge_columns
option to True. The system then selects non-null values from the duplicate columns as the values in the new column.>>> movies.left_join(ratings, on='movie_id', merge_columns=True)
To use the mapjoin
operation, set mapjoin
to True. The system then performs the mapjoin
operation on the right Collection object. You can join
a Collection object from MaxCompute to a Collection object from pandas, or join
a Collection object from MaxCompute to a Collection object from a database. In both
cases, MaxCompute executes the computation.
Union operation
union
or concat
operation to combine the tables into a single table regardless of how the columns
are ordered.
>>> mov1 = movies[movies.movie_id < 3]['movie_id', 'title']
>>> mov2 = movies[(movies.movie_id > 3) & (movies.movie_id < 6)]['title', 'movie_id']
>>> mov1.union(mov2)
movie_id title
0 1 Toy Story (1995)
1 2 GoldenEye (1995)
2 4 Get Shorty (1995)
3 5 Copycat (1995)
You can union
a Collection object from MaxCompute to a Collection object from pandas, or union
a Collection object from MaxCompute to a Collection object from a database. In both
cases, MaxCompute executes the computation.