This topic describes how to create and manage a DataFrame object. It also provides information about how to use DataFrame to process data.
Create and manage a DataFrame object
The MovieLens 100K dataset is used in this example. The following tables exist: pyodps_ml_100k_movies
(movie-related data), pyodps_ml_100k_users
(user-related data), and pyodps_ml_100k_ratings
(rating-related data).
- If no MaxCompute objects are available in the runtime environment, create a MaxCompute
object.
from odps import ODPS o = ODPS('**your-access-id**', '**your-secret-access-key**', project='**your-project**', endpoint='**your-end-point**')
- Create a DataFrame object by specifying a table.
from odps.df import DataFrame users = DataFrame(o.get_table('pyodps_ml_100k_users'))
- Use the
dtypes
attribute to view the fields of the DataFrame object and the data types of the fields.users.dtypes odps.Schema { user_id int64 age int64 sex string occupation string zip_code string }
- Use the
head
method to have a quick preview on the first N data records.users.head(10) user_id age sex occupation zip_code 0 1 24 M technician 85711 1 2 53 F other 94043 2 3 23 M writer 32067 3 4 24 M technician 43537 4 5 33 F other 15213 5 6 42 M executive 98101 6 7 57 M administrator 91344 7 8 36 M administrator 05201 8 9 29 M student 01002 9 10 53 M lawyer 90703
- If you do not want to view all the fields, perform the following operations as needed:
- Specify the fields that you want to query.
users[['user_id', 'age']].head(5) user_id age 0 1 24 1 2 53 2 3 23 3 4 24 4 5 33
- Exclude the fields that you do not want to query.
users.exclude('zip_code', 'age').head(5) user_id sex occupation 0 1 M technician 1 2 F other 2 3 M writer 3 4 M technician 4 5 F other
- Exclude some fields and add new fields based on computation. For example, add the
sex_bool
attribute and set it to True ifsex
isM
. Otherwise, set it to False.users.select(users.exclude('zip_code', 'sex'), sex_bool=users.sex == 'M').head(5) user_id age occupation sex_bool 0 1 24 technician True 1 2 53 other False 2 3 23 writer True 3 4 24 technician True 4 5 33 other False
- Specify the fields that you want to query.
- Obtain the numbers of male and female users.
users.groupby(users.sex).agg(count=users.count()) sex count 0 F 273 1 M 670
- Divide users by occupation, obtain the first 10 occupations that have the largest
population, and sort the occupations in descending order of population.
Alternatively, use thedf = users.groupby('occupation').agg(count=users['occupation'].count()) df.sort(df['count'], ascending=False)[:10] occupation count 0 student 196 1 other 105 2 educator 95 3 administrator 79 4 engineer 67 5 programmer 66 6 librarian 51 7 writer 45 8 executive 32 9 scientist 31
value_counts
method. Note that the number of records returned by this method is limited by theoptions.df.odps.sort.limit
parameter. For more information, see Configurations.users.occupation.value_counts()[:10] occupation count 0 student 196 1 other 105 2 educator 95 3 administrator 79 4 engineer 67 5 programmer 66 6 librarian 51 7 writer 45 8 executive 32 9 scientist 31
- Show data in an intuitive graph.
%matplotlib inline
- Use a horizontal column chart to visualize the data.
users['occupation'].value_counts().plot(kind='barh', x='occupation', ylabel='prefession') <matplotlib.axes._subplots.AxesSubplot at 0x10653cfd0>
- Divide users into 30 groups by age and view the histogram of age distribution.
users.age.hist(bins=30, title="Distribution of users' ages", xlabel='age', ylabel='count of users') <matplotlib.axes._subplots.AxesSubplot at 0x10667a510>
- Join the three tables and save them as a new table.
movies = DataFrame(o.get_table('pyodps_ml_100k_movies')) ratings = DataFrame(o.get_table('pyodps_ml_100k_ratings')) o.delete_table('pyodps_ml_100k_lens', if_exists=True) lens = movies.join(ratings).join(users).persist('pyodps_ml_100k_lens') lens.dtypes odps.Schema { movie_id int64 title string release_date string video_release_date string imdb_url string user_id int64 rating int64 unix_timestamp int64 age int64 sex string occupation string zip_code string }
- Divide users aged 0 to 80 into eight age groups.
labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79'] cut_lens = lens[lens, lens.age.cut(range(0, 81, 10), right=False, labels=labels).rename('age_group')]
- View the first 10 ages each with only one user, as well as their age groups.
cut_lens['age_group', 'age'].distinct()[:10] age_group age 0 0-9 7 1 10-19 10 2 10-19 11 3 10-19 13 4 10-19 14 5 10-19 15 6 10-19 16 7 10-19 17 8 10-19 18 9 10-19 19
- View the total rating and average rating of users in each age group.
cut_lens.groupby('age_group').agg(cut_lens.rating.count().rename('total_rating'), cut_lens.rating.mean().rename('avg_rating')) age_group avg_rating total_rating 0 0-9 3.767442 43 1 10-19 3.486126 8181 2 20-29 3.467333 39535 3 30-39 3.554444 25696 4 40-49 3.591772 15021 5 50-59 3.635800 8704 6 60-69 3.648875 2623 7 70-79 3.649746 197
Use DataFrame to process data
Before you perform the following steps, download the Iris dataset and save the file to your computer as sample data. A DataWorks PyODPS node is used in this example. For more information, see PyODPS node.
- Create a test data table.
Use the table management feature of DataWorks to create a table. Then, click DDL mode.
Execute the following statement to generate a table structure:
CREATE TABLE pyodps_iris ( sepallength double COMMENT 'sepallength(cm)', sepalwidth double COMMENT 'sepalwidth(cm)', petallength double COMMENT 'petallength(cm)', petalwidth double COMMENT 'petalwidth(cm)', name string COMMENT 'name' ) ;
- Upload test data.
Click Import data.
Enter the table name, upload the downloaded dataset, and then click Next.
Select Match by location and click Import Data.
- Create a PyODPS node to store and run code.
- Enter the code and click the Run icon. You can view the result in the Run Log section in the lower pane.
Code details:
from odps.df import DataFrame from odps.df import output iris = DataFrame(o.get_table('pyodps_iris')) # Create the DataFrame object iris from the MaxCompute table. print iris.head(10) print iris.sepallength.head(5) # Display part of the iris content. # Use a UDF to calculate the sum of two columns of iris. print iris.apply(lambda row: row.sepallength + row.sepalwidth, axis=1, reduce=True, types='float').rename('sepaladd').head(3) # Specify the name and type of the output field for the UDF. @output(['iris_add', 'iris_sub'], ['float', 'float']) def handle(row): # Use the yield keyword to return multiple rows of results. yield row.sepallength - row.sepalwidth,row.sepallength + row.sepalwidth yield row.petallength - row.petalwidth,row.petallength + row.petalwidth # Display the results of the first five rows. axis=1 indicates that the axis of the column extends horizontally. print iris.apply(handle,axis=1).head(5)