This topic describes how to create and manage a DataFrame object. It also provides information about how to use DataFrame to process data.
DataFrame object operations
The MovieLens 100K dataset is used in this example. The MovieLens 100K dataset contains the following tables: pyodps_ml_100k_movies
(movie-related data), pyodps_ml_100k_users
(user-related data), and pyodps_ml_100k_ratings
(rating-related data).
If no MaxCompute objects are available in the runtime environment, create a MaxCompute object.
import os from odps import ODPS # Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_ID to your AccessKey ID. # Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET to your AccessKey secret. # We recommend that you do not directly use your AccessKey ID or AccessKey secret. o = ODPS( os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'), os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'), project='your-default-project', endpoint='your-end-point', )
Create a DataFrame object by specifying a table.
from odps.df import DataFrame users = DataFrame(o.get_table('pyodps_ml_100k_users'))
You can query the
dtypes
property to view the fields of the DataFrame object and the data types of the fields.print(users.dtypes)
Return value
odps.Schema { user_id int64 age int64 sex string occupation string zip_code string }
Use the
head
method to specify the number of first data records that you want to preview.print(users.head(10))
Return value
user_id age sex occupation zip_code 0 1 24 M technician 85711 1 2 53 F other 94043 2 3 23 M writer 32067 3 4 24 M technician 43537 4 5 33 F other 15213 5 6 42 M executive 98101 6 7 57 M administrator 91344 7 8 36 M administrator 05201 8 9 29 M student 01002 9 10 53 M lawyer 90703
If you do not want to view all the fields, perform the following operations:
Specify the fields that you want to query.
print(users[['user_id', 'age']].head(5))
Return value
user_id age 0 1 24 1 2 53 2 3 23 3 4 24 4 5 33
Exclude several fields.
print(users.exclude('zip_code', 'age').head(5))
Return value
user_id sex occupation 0 1 M technician 1 2 F other 2 3 M writer 3 4 M technician 4 5 F other
Exclude specific fields and add new fields based on computation. For example, add the
sex_bool
field and set sex_bool to True ifsex
isM
. If the value of sex is not M, set sex_bool to False.print(users.select(users.exclude('zip_code', 'sex'), sex_bool=users.sex == 'M').head(5))
Return value
user_id age occupation sex_bool 0 1 24 technician True 1 2 53 other False 2 3 23 writer True 3 4 24 technician True 4 5 33 other False
You can obtain the numbers of male and female users.
print(users.groupby(users.sex).agg(count=users.count()))
Return value
sex count 0 F 273 1 M 670
To divide users by occupation, you can obtain the first 10 occupations that have the largest number of users, and sort the occupations in descending order based on the number of users.
df = users.groupby('occupation').agg(count=users['occupation'].count()) print(df.sort(df['count'], ascending=False)[:10])
Return value
occupation count 0 student 196 1 other 105 2 educator 95 3 administrator 79 4 engineer 67 5 programmer 66 6 librarian 51 7 writer 45 8 executive 32 9 scientist 31
Alternatively, you can use the
value_counts
method. The number of records that is returned by using this method is limited by theoptions.df.odps.sort.limit
parameter. For more information, see Configurations.print(users.occupation.value_counts()[:10])
Return value
occupation count 0 student 196 1 other 105 2 educator 95 3 administrator 79 4 engineer 67 5 programmer 66 6 librarian 51 7 writer 45 8 executive 32 9 scientist 31
Show data in a more intuitive graph. You can enable built-in charts in IPython.
%matplotlib inline
Use a horizontal column chart to visualize data.
users['occupation'].value_counts().plot(kind='barh', x='occupation', ylabel='prefession')
The following chart is displayed:
Divide users into 30 groups by age and view the histogram of age distribution.
users.age.hist(bins=30, title="Distribution of users' ages", xlabel='age', ylabel='count of users')
The following chart is displayed:
Join
the three tables and save the tables as a new table.movies = DataFrame(o.get_table('pyodps_ml_100k_movies')) ratings = DataFrame(o.get_table('pyodps_ml_100k_ratings')) o.delete_table('pyodps_ml_100k_lens', if_exists=True) lens = movies.join(ratings).join(users).persist('pyodps_ml_100k_lens') print(lens.dtypes)
Return value
odps.Schema { movie_id int64 title string release_date string video_release_date string imdb_url string user_id int64 rating int64 unix_timestamp int64 age int64 sex string occupation string zip_code string }
Divide users aged 0 to 80 into eight age groups.
labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79'] cut_lens = lens[lens, lens.age.cut(range(0, 81, 10), right=False, labels=labels).rename('age_group')]
View the first 10 ages each with only one user, as well as their age groups.
print(cut_lens ['age_group', 'age'].distinct()[:10])
Return value
age_group age 0 0-9 7 1 10-19 10 2 10-19 11 3 10-19 13 4 10-19 14 5 10-19 15 6 10-19 16 7 10-19 17 8 10-19 18 9 10-19 19
View the total rating and average rating of users in each age group.
print(cut_lens.groupby ('age_group').agg(cut_lens.rating.count().rename ('total_rating'), cut_lens.rating.mean().rename ('avg_rating')))
Return value
age_group avg_rating total_rating 0 0-9 3.767442 43 1 10-19 3.486126 8181 2 20-29 3.467333 39535 3 30-39 3.554444 25696 4 40-49 3.591772 15021 5 50-59 3.635800 8704 6 60-69 3.648875 2623 7 70-79 3.649746 197
DataFrame data processing
Before you perform the following steps, download the Iris dataset and save the file to your computer as sample data. A DataWorks PyODPS node is used in this example. For more information, see Create a PyODPS 2 node.
Create a test data table.
Use the table management feature of DataWorks to create a table, and then click DDL mode.
Enter the CREATE TABLE statement and submit the table. Example:
CREATE TABLE pyodps_iris ( sepallength double COMMENT 'sepal length (cm)', sepalwidth double COMMENT 'sepal width (cm)', petallength double COMMENT 'petal length (cm)', petalwidth double COMMENT 'petal width (cm)', name string COMMENT 'name' ) ;
Upload test data.
Right-click the table that you create and select Import Data from the shortcut menu.
Enter the table name, upload the downloaded package of the dataset, and then click Next.
Select Match by location and click Import Data.
Create a PyODPS node to store and run the code.
Write the code and click the Run icon. You can view the result in the Runtime Log section in the lower pane.
Code details:
from odps.df import DataFrame, output iris = DataFrame(o.get_table('pyodps_iris')) # Create the DataFrame object iris from the MaxCompute table. print(iris.head(10)) print(iris.sepallength.head(5)) # Display part of the iris content. # Use a user-defined function to calculate the sum of two columns of iris. print(iris.apply(lambda row: row.sepallength + row.sepalwidth, axis=1, reduce=True, types='float').rename('sepaladd').head(3)) # Specify the output name and type of the function. @output(['iris_add', 'iris_sub'], ['float', 'float']) def handle(row): # Use the yield keyword to return multiple rows of results. yield row.sepallength - row.sepalwidth,row.sepallength + row.sepalwidth yield row.petallength - row.petalwidth,row.petallength + row.petalwidth # Display the results of the first five rows. axis=1 indicates that the axis of the column extends horizontally. print(iris.apply(handle,axis=1).head(5))