All Products
Search
Document Center

MaxCompute:Getting started

Last Updated:Aug 08, 2023

This topic describes how to create and manage a DataFrame object. It also provides information about how to use DataFrame to process data.

DataFrame object operations

The MovieLens 100K dataset is used in this example. The MovieLens 100K dataset contains the following tables: pyodps_ml_100k_movies (movie-related data), pyodps_ml_100k_users (user-related data), and pyodps_ml_100k_ratings (rating-related data).

  1. If no MaxCompute objects are available in the runtime environment, create a MaxCompute object.

    import os
    from odps import ODPS
    # Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_ID to your AccessKey ID. 
    # Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET to your AccessKey secret. 
    # We recommend that you do not directly use your AccessKey ID or AccessKey secret.
    o = ODPS(
        os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
        os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
        project='your-default-project',
        endpoint='your-end-point',
    )
  2. Create a DataFrame object by specifying a table.

    from odps.df import DataFrame
    users = DataFrame(o.get_table('pyodps_ml_100k_users'))
  3. You can query the dtypes property to view the fields of the DataFrame object and the data types of the fields.

    print(users.dtypes)

    Return value

    odps.Schema {
      user_id             int64
      age                 int64
      sex                 string
      occupation          string
      zip_code            string
    }
  4. Use the head method to specify the number of first data records that you want to preview.

    print(users.head(10))

    Return value

       user_id  age  sex     occupation  zip_code
    0        1   24    M     technician     85711
    1        2   53    F          other     94043
    2        3   23    M         writer     32067
    3        4   24    M     technician     43537
    4        5   33    F          other     15213
    5        6   42    M      executive     98101
    6        7   57    M  administrator     91344
    7        8   36    M  administrator     05201
    8        9   29    M        student     01002
    9       10   53    M         lawyer     90703
  5. If you do not want to view all the fields, perform the following operations:

    • Specify the fields that you want to query.

      print(users[['user_id', 'age']].head(5))

      Return value

         user_id  age
      0        1   24
      1        2   53
      2        3   23
      3        4   24
      4        5   33
    • Exclude several fields.

      print(users.exclude('zip_code', 'age').head(5))

      Return value

         user_id  sex  occupation
      0        1    M  technician
      1        2    F       other
      2        3    M      writer
      3        4    M  technician
      4        5    F       other
    • Exclude specific fields and add new fields based on computation. For example, add the sex_bool field and set sex_bool to True if sex is M. If the value of sex is not M, set sex_bool to False.

      print(users.select(users.exclude('zip_code', 'sex'), sex_bool=users.sex == 'M').head(5))

      Return value

         user_id  age  occupation  sex_bool
      0        1   24  technician      True
      1        2   53       other     False
      2        3   23      writer      True
      3        4   24  technician      True
      4        5   33       other     False
  6. You can obtain the numbers of male and female users.

    print(users.groupby(users.sex).agg(count=users.count()))

    Return value

       sex  count
    0    F    273
    1    M    670
  7. To divide users by occupation, you can obtain the first 10 occupations that have the largest number of users, and sort the occupations in descending order based on the number of users.

    df = users.groupby('occupation').agg(count=users['occupation'].count())
    print(df.sort(df['count'], ascending=False)[:10])

    Return value

          occupation  count
    0        student    196
    1          other    105
    2       educator     95
    3  administrator     79
    4       engineer     67
    5     programmer     66
    6      librarian     51
    7         writer     45
    8      executive     32
    9      scientist     31

    Alternatively, you can use the value_counts method. The number of records that is returned by using this method is limited by the options.df.odps.sort.limit parameter. For more information, see Configurations.

    print(users.occupation.value_counts()[:10])

    Return value

          occupation  count
    0        student    196
    1          other    105
    2       educator     95
    3  administrator     79
    4       engineer     67
    5     programmer     66
    6      librarian     51
    7         writer     45
    8      executive     32
    9      scientist     31
  8. Show data in a more intuitive graph. You can enable built-in charts in IPython.

    %matplotlib inline
  9. Use a horizontal column chart to visualize data.

    users['occupation'].value_counts().plot(kind='barh', x='occupation', ylabel='prefession')

    The following chart is displayed:

    可视化图
  10. Divide users into 30 groups by age and view the histogram of age distribution.

    users.age.hist(bins=30, title="Distribution of users' ages", xlabel='age', ylabel='count of users')

    The following chart is displayed:

    可视化图
  11. Join the three tables and save the tables as a new table.

    movies = DataFrame(o.get_table('pyodps_ml_100k_movies'))
    ratings = DataFrame(o.get_table('pyodps_ml_100k_ratings'))
    
    o.delete_table('pyodps_ml_100k_lens', if_exists=True)
    lens = movies.join(ratings).join(users).persist('pyodps_ml_100k_lens')
    
    print(lens.dtypes)

    Return value

    odps.Schema {
      movie_id                            int64
      title                               string
      release_date                        string
      video_release_date                  string
      imdb_url                            string
      user_id                             int64
      rating                              int64
      unix_timestamp                      int64
      age                                 int64
      sex                                 string
      occupation                          string
      zip_code                            string
    }
  12. Divide users aged 0 to 80 into eight age groups.

    labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79']
    cut_lens = lens[lens, lens.age.cut(range(0, 81, 10), right=False, labels=labels).rename('age_group')]
  13. View the first 10 ages each with only one user, as well as their age groups.

    print(cut_lens ['age_group', 'age'].distinct()[:10])

    Return value

       age_group  age
    0       0-9    7
    1     10-19   10
    2     10-19   11
    3     10-19   13
    4     10-19   14
    5     10-19   15
    6     10-19   16
    7     10-19   17
    8     10-19   18
    9     10-19   19
  14. View the total rating and average rating of users in each age group.

    print(cut_lens.groupby ('age_group').agg(cut_lens.rating.count().rename ('total_rating'), cut_lens.rating.mean().rename ('avg_rating')))

    Return value

         age_group  avg_rating  total_rating
    0       0-9  3.767442        43
    1     10-19  3.486126      8181
    2     20-29  3.467333     39535
    3     30-39  3.554444     25696
    4     40-49  3.591772     15021
    5     50-59  3.635800      8704
    6     60-69  3.648875      2623
    7     70-79  3.649746       197

DataFrame data processing

Before you perform the following steps, download the Iris dataset and save the file to your computer as sample data. A DataWorks PyODPS node is used in this example. For more information, see Create a PyODPS 2 node.

  1. Create a test data table.

    Use the table management feature of DataWorks to create a table, and then click DDL mode. 新建表Enter the CREATE TABLE statement and submit the table. Example:

    CREATE TABLE pyodps_iris (
        sepallength double COMMENT 'sepal length (cm)',
        sepalwidth double COMMENT 'sepal width (cm)',
        petallength double COMMENT 'petal length (cm)',
        petalwidth double COMMENT 'petal width (cm)',
        name string COMMENT 'name'
    ) ;
  2. Upload test data.

    Right-click the table that you create and select Import Data from the shortcut menu. 导入数据Enter the table name, upload the downloaded package of the dataset, and then click Next. 导入数据Select Match by location and click Import Data.按位置匹配

  3. Create a PyODPS node to store and run the code.新建PyODPS节点

  4. Write the code and click the Run icon. You can view the result in the Runtime Log section in the lower pane. 运行日志Code details:

    from odps.df import DataFrame, output
    
    iris = DataFrame(o.get_table('pyodps_iris')) # Create the DataFrame object iris from the MaxCompute table. 
    print(iris.head(10))
    print(iris.sepallength.head(5))  # Display part of the iris content. 
    
    # Use a user-defined function to calculate the sum of two columns of iris. 
    print(iris.apply(lambda row: row.sepallength + row.sepalwidth, axis=1, reduce=True, types='float').rename('sepaladd').head(3))
    
    # Specify the output name and type of the function. 
    @output(['iris_add', 'iris_sub'], ['float', 'float'])
    def handle(row):
        # Use the yield keyword to return multiple rows of results. 
        yield row.sepallength - row.sepalwidth,row.sepallength + row.sepalwidth
        yield row.petallength - row.petalwidth,row.petallength + row.petalwidth
    
    # Display the results of the first five rows. axis=1 indicates that the axis of the column extends horizontally. 
    print(iris.apply(handle,axis=1).head(5))