This topic describes how to create and manage a DataFrame object. It also provides information about how to use DataFrame to process data.

Create and manage a DataFrame object

The MovieLens 100K dataset is used in this example. The following tables exist: pyodps_ml_100k_movies (movie-related data), pyodps_ml_100k_users (user-related data), and pyodps_ml_100k_ratings (rating-related data).

  1. If no MaxCompute objects are available in the runtime environment, create a MaxCompute object.
    from odps import ODPS
    o = ODPS('**your-access-id**', '**your-secret-access-key**',
             project='**your-project**', endpoint='**your-end-point**')
  2. Create a DataFrame object by specifying a table.
    from odps.df import DataFrame
    users = DataFrame(o.get_table('pyodps_ml_100k_users'))
  3. Use the dtypes attribute to view the fields of the DataFrame object and the data types of the fields.
    users.dtypes
    odps.Schema {
      user_id             int64
      age                 int64
      sex                 string
      occupation          string
      zip_code            string
    }
  4. Use the head method to have a quick preview on the first N data records.
    users.head(10)
       user_id  age  sex     occupation  zip_code
    0        1   24    M     technician     85711
    1        2   53    F          other     94043
    2        3   23    M         writer     32067
    3        4   24    M     technician     43537
    4        5   33    F          other     15213
    5        6   42    M      executive     98101
    6        7   57    M  administrator     91344
    7        8   36    M  administrator     05201
    8        9   29    M        student     01002
    9       10   53    M         lawyer     90703
  5. If you do not want to view all the fields, perform the following operations as needed:
    • Specify the fields that you want to query.
      users[['user_id', 'age']].head(5)
         user_id  age
      0        1   24
      1        2   53
      2        3   23
      3        4   24
      4        5   33
    • Exclude the fields that you do not want to query.
      users.exclude('zip_code', 'age').head(5)
         user_id  sex  occupation
      0        1    M  technician
      1        2    F       other
      2        3    M      writer
      3        4    M  technician
      4        5    F       other
    • Exclude some fields and add new fields based on computation. For example, add the sex_bool attribute and set it to True if sex is M. Otherwise, set it to False.
      users.select(users.exclude('zip_code', 'sex'), sex_bool=users.sex == 'M').head(5)
         user_id  age  occupation  sex_bool
      0        1   24  technician      True
      1        2   53       other     False
      2        3   23      writer      True
      3        4   24  technician      True
      4        5   33       other     False
  6. Obtain the numbers of male and female users.
    users.groupby(users.sex).agg(count=users.count())
       sex  count
    0    F    273
    1    M    670
  7. Divide users by occupation, obtain the first 10 occupations that have the largest population, and sort the occupations in descending order of population.
    df = users.groupby('occupation').agg(count=users['occupation'].count())
    df.sort(df['count'], ascending=False)[:10]
          occupation  count
    0        student    196
    1          other    105
    2       educator     95
    3  administrator     79
    4       engineer     67
    5     programmer     66
    6      librarian     51
    7         writer     45
    8      executive     32
    9      scientist     31
    Alternatively, use the value_counts method. Note that the number of records returned by this method is limited by the options.df.odps.sort.limit parameter. For more information, see Configurations.
    users.occupation.value_counts()[:10]
          occupation  count
    0        student    196
    1          other    105
    2       educator     95
    3  administrator     79
    4       engineer     67
    5     programmer     66
    6      librarian     51
    7         writer     45
    8      executive     32
    9      scientist     31
  8. Show data in an intuitive graph.
    %matplotlib inline
  9. Use a horizontal column chart to visualize the data.
    users['occupation'].value_counts().plot(kind='barh', x='occupation', ylabel='prefession')
    <matplotlib.axes._subplots.AxesSubplot at 0x10653cfd0>
  10. Divide users into 30 groups by age and view the histogram of age distribution.
    users.age.hist(bins=30, title="Distribution of users' ages", xlabel='age', ylabel='count of users')
    <matplotlib.axes._subplots.AxesSubplot at 0x10667a510>
  11. Join the three tables and save them as a new table.
    movies = DataFrame(o.get_table('pyodps_ml_100k_movies'))
    ratings = DataFrame(o.get_table('pyodps_ml_100k_ratings'))
    
    o.delete_table('pyodps_ml_100k_lens', if_exists=True)
    lens = movies.join(ratings).join(users).persist('pyodps_ml_100k_lens')
    
    lens.dtypes
    odps.Schema {
      movie_id                            int64
      title                               string
      release_date                        string
      video_release_date                  string
      imdb_url                            string
      user_id                             int64
      rating                              int64
      unix_timestamp                      int64
      age                                 int64
      sex                                 string
      occupation                          string
      zip_code                            string
    }
  12. Divide users aged 0 to 80 into eight age groups.
    labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79']
     cut_lens = lens[lens, lens.age.cut(range(0, 81, 10), right=False, labels=labels).rename('age_group')]
  13. View the first 10 ages each with only one user, as well as their age groups.
    cut_lens['age_group', 'age'].distinct()[:10]
       age_group  age
    0       0-9    7
    1     10-19   10
    2     10-19   11
    3     10-19   13
    4     10-19   14
    5     10-19   15
    6     10-19   16
    7     10-19   17
    8     10-19   18
    9     10-19   19
  14. View the total rating and average rating of users in each age group.
    cut_lens.groupby('age_group').agg(cut_lens.rating.count().rename('total_rating'), cut_lens.rating.mean().rename('avg_rating'))
         age_group  avg_rating  total_rating
    0       0-9  3.767442        43
    1     10-19  3.486126      8181
    2     20-29  3.467333     39535
    3     30-39  3.554444     25696
    4     40-49  3.591772     15021
    5     50-59  3.635800      8704
    6     60-69  3.648875      2623
    7     70-79  3.649746       197

Use DataFrame to process data

Before you perform the following steps, download the Iris dataset and save the file to your computer as sample data. A DataWorks PyODPS node is used in this example. For more information, see PyODPS node.

  1. Create a test data table.
    Use the table management feature of DataWorks to create a table. Then, click DDL mode.Execute the following statement to generate a table structure:
    CREATE TABLE pyodps_iris (
        sepallength double COMMENT 'sepallength(cm)',
        sepalwidth double COMMENT 'sepalwidth(cm)',
        petallength double COMMENT 'petallength(cm)',
        petalwidth double COMMENT 'petalwidth(cm)',
        name string COMMENT 'name'
    ) ;
  2. Upload test data.
    Click Import data.Enter the table name, upload the downloaded dataset, and then click Next.Select Match by location and click Import Data.
  3. Create a PyODPS node to store and run code.
  4. Enter the code and click the Run icon. You can view the result in the Run Log section in the lower pane.Code details:
    from odps.df import DataFrame
    from odps.df import output
    iris = DataFrame(o.get_table('pyodps_iris')) # Create the DataFrame object iris from the MaxCompute table.
    print iris.head(10)
    print iris.sepallength.head(5) # Display part of the iris content.
    
    # Use a UDF to calculate the sum of two columns of iris.
    print iris.apply(lambda row: row.sepallength + row.sepalwidth, axis=1, reduce=True, types='float').rename('sepaladd').head(3)
    
    # Specify the name and type of the output field for the UDF.
    @output(['iris_add', 'iris_sub'], ['float', 'float'])
    def handle(row):
    # Use the yield keyword to return multiple rows of results.
        yield row.sepallength - row.sepalwidth,row.sepallength + row.sepalwidth
        yield row.petallength - row.petalwidth,row.petallength + row.petalwidth
    # Display the results of the first five rows. axis=1 indicates that the axis of the column extends horizontally.
    print iris.apply(handle,axis=1).head(5)