This topic provides examples on how to perform operations on DataFrame objects in typical scenarios by using the SDK for Python.
DataFrame
PyODPS provides a pandas-like API, PyODPS DataFrame, which can make full use of the computing power of MaxCompute. For more information, see DataFrame.
In this example, the following tables exist: pyodps_ml_100k_movies that contains movie-related data, pyodps_ml_100k_users that contains user-related data, and pyodps_ml_100k_ratings that contains rating-related data.
Create an entry object of MaxCompute.
import os from odps import ODPS # Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_ID to your AccessKey ID. # Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET to your AccessKey secret. # We recommend that you do not directly use your AccessKey ID or AccessKey secret. o = ODPS( os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'), os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'), project='your-default-project', endpoint='your-end-point', )Call a table object and create the DataFrame object named users.
from odps.df import DataFrame users = DataFrame(o.get_table('pyodps_ml_100k_users'))Perform the following operations on the DataFrame object:
View the fields of DataFrame and the types of these fields based on the dtypes attribute.
users.dtypesUse the head method to obtain the first N data records for a quick data preview.
users.head(10)The following result is returned:
-
user_id
age
sex
occupation
zip_code
0
1
24
M
technician
85711
1
2
53
F
other
94043
2
3
23
M
writer
32067
3
4
24
M
technician
43537
4
5
33
F
other
15213
5
6
42
M
executive
98101
6
7
57
M
administrator
91344
7
8
36
M
administrator
05201
8
9
29
M
student
01002
9
10
53
M
lawyer
90703
Filter fields.
Filter some fields.
users[['user_id', 'age']].head(5)The following result is returned:
-
user_id
age
0
1
24
1
2
53
2
3
23
3
4
24
4
5
33
Exclude some fields. Example:
>>> users.exclude('zip_code', 'age').head(5)The following result is returned:
-
user_id
sex
occupation
0
1
M
technician
1
2
F
other
2
3
M
writer
3
4
M
technician
4
5
F
other
When you exclude some fields, you may want to obtain new columns based on computation. For example, add the sex_bool attribute and set it to True if sex is M. Otherwise, set the sex_bool attribute to False. Sample code:
>>> users.select(users.exclude('zip_code', 'sex'), sex_bool=users.sex == 'M').head(5)The following result is returned:
-
user_id
age
occupation
sex_bool
0
1
24
technician
True
1
2
53
other
False
2
3
23
writer
True
3
4
24
technician
True
4
5
33
other
False
Obtain the number of users at the age of 20 to 25.
>>> users.age.between(20, 25).count().rename('count') 943Obtain the numbers of male and female users.
>>> users.groupby(users.sex).count()The following result is returned:
-
sex
count
0
F
273
1
M
670
To divide users by occupation, obtain the top 10 occupations with the most population, and sort the occupations in descending order of population.
>>> df = users.groupby('occupation').agg(count=users['occupation'].count()) >>> df.sort(df['count'], ascending=False)[:10]The following result is returned:
-
occupation
count
0
student
196
1
other
105
2
educator
95
3
administrator
79
4
engineer
67
5
programmer
66
6
librarian
51
7
writer
45
8
executive
32
9
scientist
31
The DataFrame API provides the value_counts method to quickly achieve the same result.
>>> users.occupation.value_counts()[:10]The following result is returned:
-
occupation
count
0
student
196
1
other
105
2
educator
95
3
administrator
79
4
engineer
67
5
programmer
66
6
librarian
51
7
writer
45
8
executive
32
9
scientist
31
Show data in a more intuitive graph.
%matplotlib inlineVisualize data on a horizontal bar chart.
users['occupation'].value_counts().plot(kind='barh', x='occupation', ylabel='prefession')
Visualize data on a histogram. Divide users into 30 groups by age and view the histogram of age distribution. Example:
>>> users.age.hist(bins=30, title="Distribution of users' ages", xlabel='age', ylabel='count of users')
Join three tables to obtain a new table and save it.
movies = DataFrame(o.get_table('pyodps_ml_100k_movies')) ratings = DataFrame(o.get_table('pyodps_ml_100k_ratings')) o.delete_table('pyodps_ml_100k_lens', if_exists=True) lens = movies.join(ratings).join(users).persist('pyodps_ml_100k_lens') lens.dtypesThe following result is returned:
odps.Schema { movie_id int64 title string release_date string video_release_date string imdb_url string user_id int64 rating int64 unix_timestamp int64 age int64 sex string occupation string zip_code string }Divide ages from 0 to 79 into eight age groups.
labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79'] cut_lens = lens[lens, lens.age.cut(range(0, 80, 10), right=False, labels=labels).rename('age_group')]View the first 10 data records of a single age in a group.
cut_lens['age_group', 'age'].distinct()[:10]The following result is returned:
-
age_group
age
0
0-9
7
1
10-19
10
2
10-19
11
3
10-19
13
4
10-19
14
5
10-19
15
6
10-19
16
7
10-19
17
8
10-19
18
9
10-19
19
View the total rating and average rating of users in each age group. Example:
cut_lens.groupby ('age_group').agg(cut_lens.rating.count().rename ('total_rating'), cut_lens.rating.mean().rename ('average_rating'))The following result is returned:
-
age_group
average_rating
total_rating
0
0-9
3.767442
43
1
10-19
3.486126
8181
2
20-29
3.467333
39535
3
30-39
3.554444
25696
4
40-49
3.591772
15021
5
50-59
3.635800
8704
6
60-69
3.648875
2623
7
70-79
3.649746
197