All Products
Search
Document Center

MaxCompute:Examples of using the SDK for Python: DataFrame

Last Updated:Aug 04, 2023

This topic provides examples on how to perform operations on DataFrame objects in typical scenarios by using the SDK for Python.

DataFrame

PyODPS provides a pandas-like API, PyODPS DataFrame, which can make full use of the computing power of MaxCompute. For more information, see DataFrame.

In this example, the following tables exist: pyodps_ml_100k_movies that contains movie-related data, pyodps_ml_100k_users that contains user-related data, and pyodps_ml_100k_ratings that contains rating-related data.

  1. Create an entry object of MaxCompute.

    import os
    from odps import ODPS
    # Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_ID to your AccessKey ID. 
    # Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET to your AccessKey secret. 
    # We recommend that you do not directly use your AccessKey ID or AccessKey secret.
    o = ODPS(
        os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
        os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
        project='your-default-project',
        endpoint='your-end-point',
    )
  2. Call a table object and create the DataFrame object named users.

    from odps.df import DataFrame
    users = DataFrame(o.get_table('pyodps_ml_100k_users'))
  3. Perform the following operations on the DataFrame object:

    • View the fields of DataFrame and the types of these fields based on the dtypes attribute.

      users.dtypes
    • Use the head method to obtain the first N data records for a quick data preview.

      users.head(10)

      The following result is returned:

      -

      user_id

      age

      sex

      occupation

      zip_code

      0

      1

      24

      M

      technician

      85711

      1

      2

      53

      F

      other

      94043

      2

      3

      23

      M

      writer

      32067

      3

      4

      24

      M

      technician

      43537

      4

      5

      33

      F

      other

      15213

      5

      6

      42

      M

      executive

      98101

      6

      7

      57

      M

      administrator

      91344

      7

      8

      36

      M

      administrator

      05201

      8

      9

      29

      M

      student

      01002

      9

      10

      53

      M

      lawyer

      90703

    • Filter fields.

      • Filter some fields.

        users[['user_id', 'age']].head(5)

        The following result is returned:

        -

        user_id

        age

        0

        1

        24

        1

        2

        53

        2

        3

        23

        3

        4

        24

        4

        5

        33

      • Exclude some fields. Example:

        >>> users.exclude('zip_code', 'age').head(5)

        The following result is returned:

        -

        user_id

        sex

        occupation

        0

        1

        M

        technician

        1

        2

        F

        other

        2

        3

        M

        writer

        3

        4

        M

        technician

        4

        5

        F

        other

      • When you exclude some fields, you may want to obtain new columns based on computation. For example, add the sex_bool attribute and set it to True if sex is M. Otherwise, set the sex_bool attribute to False. Sample code:

        >>> users.select(users.exclude('zip_code', 'sex'), sex_bool=users.sex == 'M').head(5)

        The following result is returned:

        -

        user_id

        age

        occupation

        sex_bool

        0

        1

        24

        technician

        True

        1

        2

        53

        other

        False

        2

        3

        23

        writer

        True

        3

        4

        24

        technician

        True

        4

        5

        33

        other

        False

    • Obtain the number of users at the age of 20 to 25.

      >>> users.age.between(20, 25).count().rename('count')
      943
    • Obtain the numbers of male and female users.

      >>> users.groupby(users.sex).count()

      The following result is returned:

      -

      sex

      count

      0

      F

      273

      1

      M

      670

    • To divide users by occupation, obtain the top 10 occupations with the most population, and sort the occupations in descending order of population.

      >>> df = users.groupby('occupation').agg(count=users['occupation'].count())
      >>> df.sort(df['count'], ascending=False)[:10]

      The following result is returned:

      -

      occupation

      count

      0

      student

      196

      1

      other

      105

      2

      educator

      95

      3

      administrator

      79

      4

      engineer

      67

      5

      programmer

      66

      6

      librarian

      51

      7

      writer

      45

      8

      executive

      32

      9

      scientist

      31

      The DataFrame API provides the value_counts method to quickly achieve the same result.

      >>> users.occupation.value_counts()[:10]

      The following result is returned:

      -

      occupation

      count

      0

      student

      196

      1

      other

      105

      2

      educator

      95

      3

      administrator

      79

      4

      engineer

      67

      5

      programmer

      66

      6

      librarian

      51

      7

      writer

      45

      8

      executive

      32

      9

      scientist

      31

    • Show data in a more intuitive graph.

       %matplotlib inline
    • Visualize data on a horizontal bar chart.

      users['occupation'].value_counts().plot(kind='barh', x='occupation', ylabel='prefession')
      柱状图
    • Visualize data on a histogram. Divide users into 30 groups by age and view the histogram of age distribution. Example:

      >>> users.age.hist(bins=30, title="Distribution of users' ages", xlabel='age', ylabel='count of users')
      直方图
    • Join three tables to obtain a new table and save it.

      movies = DataFrame(o.get_table('pyodps_ml_100k_movies'))
      ratings = DataFrame(o.get_table('pyodps_ml_100k_ratings'))
      o.delete_table('pyodps_ml_100k_lens', if_exists=True)
      lens = movies.join(ratings).join(users).persist('pyodps_ml_100k_lens')
      lens.dtypes

      The following result is returned:

      odps.Schema {
        movie_id                            int64
        title                               string
        release_date                        string
        video_release_date                  string
        imdb_url                            string
        user_id                             int64
        rating                              int64
        unix_timestamp                      int64
        age                                 int64
        sex                                 string
        occupation                          string
        zip_code                            string
      }
    • Divide ages from 0 to 79 into eight age groups.

      labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79']
      cut_lens = lens[lens, lens.age.cut(range(0, 80, 10), right=False, labels=labels).rename('age_group')]
    • View the first 10 data records of a single age in a group.

      cut_lens['age_group', 'age'].distinct()[:10]

      The following result is returned:

      -

      age_group

      age

      0

      0-9

      7

      1

      10-19

      10

      2

      10-19

      11

      3

      10-19

      13

      4

      10-19

      14

      5

      10-19

      15

      6

      10-19

      16

      7

      10-19

      17

      8

      10-19

      18

      9

      10-19

      19

    • View the total rating and average rating of users in each age group. Example:

      cut_lens.groupby ('age_group').agg(cut_lens.rating.count().rename ('total_rating'), cut_lens.rating.mean().rename ('average_rating'))

      The following result is returned:

      -

      age_group

      average_rating

      total_rating

      0

      0-9

      3.767442

      43

      1

      10-19

      3.486126

      8181

      2

      20-29

      3.467333

      39535

      3

      30-39

      3.554444

      25696

      4

      40-49

      3.591772

      15021

      5

      50-59

      3.635800

      8704

      6

      60-69

      3.648875

      2623

      7

      70-79

      3.649746

      197