This topic describes the operations you can perform on a DataFrame. You can sort, deduplicate, sample, and scale data. You can also process null value.

from odps.df import DataFrame
iris = DataFrame(o.get_table('pyodps_iris'))

Sort data

You can sort data only in a collection.
  • Call the sort method or sort_values method to sort data.
    >>> iris.sort('sepalwidth').head(5)
       sepallength  sepalwidth  petallength  petalwidth             name
    0          5.0         2.0          3.5         1.0  Iris-versicolor
    1          6.2         2.2          4.5         1.5  Iris-versicolor
    2          6.0         2.2          5.0         1.5   Iris-virginica
    3          6.0         2.2          4.0         1.0  Iris-versicolor
    4          5.5         2.3          4.0         1.3  Iris-versicolor
  • To sort data in descending order, set the ascending parameter to False.
    >>> iris.sort('sepalwidth', ascending=False).head(5)
       sepallength  sepalwidth  petallength  petalwidth         name
    0          5.7         4.4          1.5         0.4  Iris-setosa
    1          5.5         4.2          1.4         0.2  Iris-setosa
    2          5.2         4.1          1.5         0.1  Iris-setosa
    3          5.8         4.0          1.2         0.2  Iris-setosa
    4          5.4         3.9          1.3         0.4  Iris-setosa
  • You can also set - to sort data in descending order.
    >>> iris.sort(-iris.sepalwidth).head(5)
       sepallength  sepalwidth  petallength  petalwidth         name
    0          5.7         4.4          1.5         0.4  Iris-setosa
    1          5.5         4.2          1.4         0.2  Iris-setosa
    2          5.2         4.1          1.5         0.1  Iris-setosa
    3          5.8         4.0          1.2         0.2  Iris-setosa
    4          5.4         3.9          1.3         0.4  Iris-setosa
  • You can sort multiple fields simultaneously in a DataFrame.
    >>> iris.sort(['sepalwidth', 'petallength']).head(5)
       sepallength  sepalwidth  petallength  petalwidth             name
    0          5.0         2.0          3.5         1.0  Iris-versicolor
    1          6.0         2.2          4.0         1.0  Iris-versicolor
    2          6.2         2.2          4.5         1.5  Iris-versicolor
    3          6.0         2.2          5.0         1.5   Iris-virginica
    4          4.5         2.3          1.3         0.3      Iris-setosa
  • If you need to sort multiple fields in different orders, add a list of BOOLEAN type values for the ascending parameter. The length of the list must be equal to the length of the sorted fields.
    >>> iris.sort(['sepalwidth', 'petallength'], ascending=[True, False]).head(5)
       sepallength  sepalwidth  petallength  petalwidth             name
    0          5.0         2.0          3.5         1.0  Iris-versicolor
    1          6.0         2.2          5.0         1.5   Iris-virginica
    2          6.2         2.2          4.5         1.5  Iris-versicolor
    3          6.0         2.2          4.0         1.0  Iris-versicolor
    4          6.3         2.3          4.4         1.3  Iris-versicolor
    You can also sort multiple fields in different orders in the following way:
    >>> iris.sort(['sepalwidth', -iris.petallength]).head(5)
       sepallength  sepalwidth  petallength  petalwidth             name
    0          5.0         2.0          3.5         1.0  Iris-versicolor
    1          6.0         2.2          5.0         1.5   Iris-virginica
    2          6.2         2.2          4.5         1.5  Iris-versicolor
    3          6.0         2.2          4.0         1.0  Iris-versicolor
    4          6.3         2.3          4.4         1.3  Iris-versicolor
    Note MaxCompute requires that you specify the number of records you need to sort. The default value is 10000. You can use the options.df.odps.sort.limit parameter to modify the number of records. You can set the parameter to a value greater than 10000. However, this setting can cause an out-of-memory (OOM) error.

Deduplicate data

  • To deduplicate a collection, you can call the distinct method in the following ways:
    • >>> iris[['name']].distinct()
                    name
      0      Iris-setosa
      1  Iris-versicolor
      2   Iris-virginica
    • >>> iris.distinct('name')
                    name
      0      Iris-setosa
      1  Iris-versicolor
      2   Iris-virginica
    • >>> iris.distinct('name', 'sepallength').head(3)
                name  sepallength
      0  Iris-setosa          4.3
      1  Iris-setosa          4.4
      2  Iris-setosa          4.5
  • To deduplicate a sequence, you can call the unique method. However, the sequence that is deduplicated by using the unique method cannot be selected in a collection.
    >>> iris.name.unique()
                  name
    0      Iris-setosa
    1  Iris-versicolor
    2   Iris-virginica
    An error occurs when you use the following code:
    >>> iris[iris.name, iris.name.unique()]

Sample data

To sample data from a collection, call the sample method. Python on MaxCompute (PyODPS) supports the following four sampling methods:

Note MaxCompute DataFrame must support XFlow to execute the following sampling methods, except for sampling by parts. If MaxCompute DataFrame does not support XFlow, you can only execute the sampling methods at the backend of Pandas DataFrame.
  • Sampling by parts
    Data is divided into parts by using this sampling method. You can select the part by number.
    >>> iris.sample(parts=10)  # Split data into 10 parts and take the part numbered as 0 by default.
    >>> iris.sample(parts=10, i=0)  # Split data into 10 parts and take the part numbered as 0.
    >>> iris.sample(parts=10, i=[2, 5])   # Split data into 10 parts and take the parts numbered as 2 and 5.
    >>> iris.sample(parts=10, columns=['name', 'sepalwidth'])  # Sample the data by the values of the name and sepalwidth columns.
  • Sampling by proportion or the number of records
    You must specify the number of records or the proportion of data that you want to sample when you use this method. To enable sampling with replacement, set the replace parameter to True.
    >>> iris.sample(n=100)  # Sample 100 records.
    >>> iris.sample(frac=0.3)  # Sample 30% of all records.
  • Sampling by weight
    You can specify the weight column, the number of records, and the proportion of records that you want to sample when you use this method. To enable sampling with replacement, set the replace parameter to True.
    >>> iris.sample(n=100, weights='sepal_length')
    >>> iris.sample(n=100, weights='sepal_width', replace=True)
  • Sampling by stratification
    You can specify the label column that you want to stratify when you use this method. You can also specify the sampling proportion (the frac parameter) or the number of records (the n parameter) for each label. This sampling method does not support sampling with replacement.
    >>> iris.sample(strata='category', n={'Iris Setosa': 10, 'Iris Versicolour': 10})
    >>> iris.sample(strata='category', frac={'Iris Setosa': 0.5, 'Iris Versicolour': 0.4})

Scale data

A DataFrame supports data scaling based on the maximum and minimum values, average value, and standard deviation. The following data is an example for data scaling:
name  id  fid
0  name1   4  5.3
1  name2   2  3.5
2  name2   3  1.5
3  name1   4  4.2
4  name1   3  2.2
5  name1   3  4.1
  • You can use the min_max_scale method to normalize data.
    df.min_max_scale(columns=['fid'])
        name  id       fid
    0  name1   4  1.000000
    1  name2   2  0.526316
    2  name2   3  0.000000
    3  name1   4  0.710526
    4  name1   3  0.184211
    5  name1   3  0.684211
  • You can use the min_max_scale method with the feature_range parameter to specify the output value range. The following example shows how to keep the output values in the range of (-1, 1):
    df.min_max_scale(columns=['fid'], feature_range=(-1, 1))
        name  id       fid
    0  name1   4  1.000000
    1  name2   2  0.052632
    2  name2   3 -1.000000
    3  name1   4  0.421053
    4  name1   3 -0.631579
    5  name1   3  0.368421
  • If you need to keep the original values, you must use the preserve parameter. The scaled data is added in a new column. By default, the new column is named by adding the _scaled suffix to the original column name. You can use the suffix parameter to change the suffix name.
    df.min_max_scale(columns=['fid'], preserve=True)
        name  id  fid  fid_scaled
    0  name1   4  5.3    1.000000
    1  name2   2  3.5    0.526316
    2  name2   3  1.5    0.000000
    3  name1   4  4.2    0.710526
    4  name1   3  2.2    0.184211
    5  name1   3  4.1    0.684211
  • You can also use the min_max_scale method with the group parameter to specify one or more group columns and to retrieve the minimum and maximum values from the specified column to scale data.
    df.min_max_scale(columns=['fid'], group=['name'])
        name  id       fid
    0  name1   4  1.000000
    1  name1   4  0.645161
    2  name1   3  0.000000
    3  name1   3  0.612903
    4  name2   2  1.000000
    5  name2   3  0.000000
    The preceding example shows that data in both name1 and name2 is scaled based on the minimum and maximum values of the two groups.
  • You can use the std_scale method to scale data based on standard normal distribution. You can also use the std_scale method with the preserve parameter to keep the original column, and use this method with the group parameter to group data.
    df.std_scale(columns=['fid'])
        name  id       fid
    0  name1   4  1.436467
    1  name2   2  0.026118
    2  name2   3 -1.540938
    3  name1   4  0.574587
    4  name1   3 -0.992468
    5  name1   3  0.496234

Process null value

A DataFrame supports the feature of deleting rows with null value and the feature of filling null value. The following data is an example for processing null value:
id   name   f1   f2   f3   f4
0   0  name1  1.0  NaN  3.0  4.0
1   1  name1  2.0  NaN  NaN  1.0
2   2  name1  3.0  4.0  1.0  NaN
3   3  name1  NaN  1.0  2.0  3.0
4   4  name1  1.0  NaN  3.0  4.0
5   5  name1  1.0  2.0  3.0  4.0
6   6  name1  NaN  NaN  NaN  NaN
  • You can use the dropna method to delete the rows that contain null value in the subset object.
    df.dropna(subset=['f1', 'f2', 'f3', 'f4'])
       id   name   f1   f2   f3   f4
    0   5  name1  1.0  2.0  3.0  4.0
  • To keep the rows that contain non-null values, set the how parameter to all.
    df.dropna(how='all', subset=['f1', 'f2', 'f3', 'f4'])
       id   name   f1   f2   f3   f4
    0   0  name1  1.0  NaN  3.0  4.0
    1   1  name1  2.0  NaN  NaN  1.0
    2   2  name1  3.0  4.0  1.0  NaN
    3   3  name1  NaN  1.0  2.0  3.0
    4   4  name1  1.0  NaN  3.0  4.0
    5   5  name1  1.0  2.0  3.0  4.0
  • You can use the thresh parameter to specify the minimum number of non-null values in a row.
    df.dropna(thresh=3, subset=['f1', 'f2', 'f3', 'f4'])
       id   name   f1   f2   f3   f4
    0   0  name1  1.0  NaN  3.0  4.0
    2   2  name1  3.0  4.0  1.0  NaN
    3   3  name1  NaN  1.0  2.0  3.0
    4   4  name1  1.0  NaN  3.0  4.0
    5   5  name1  1.0  2.0  3.0  4.0
  • You can call the fillna method to replace null value with a specified constant or values in an existing column.
    • The following example shows how to replace null value with a constant:
      df.fillna(100, subset=['f1', 'f2', 'f3', 'f4'])
         id   name     f1     f2     f3     f4
      0   0  name1    1.0  100.0    3.0    4.0
      1   1  name1    2.0  100.0  100.0    1.0
      2   2  name1    3.0    4.0    1.0  100.0
      3   3  name1  100.0    1.0    2.0    3.0
      4   4  name1    1.0  100.0    3.0    4.0
      5   5  name1    1.0    2.0    3.0    4.0
      6   6  name1  100.0  100.0  100.0  100.0
    • The following example shows how to replace null value with values in an existing column:
      df.fillna(df.f2, subset=['f1', 'f2', 'f3', 'f4'])
         id   name   f1   f2   f3   f4
      0   0  name1  1.0  NaN  3.0  4.0
      1   1  name1  2.0  NaN  NaN  1.0
      2   2  name1  3.0  4.0  1.0  4.0
      3   3  name1  1.0  1.0  2.0  3.0
      4   4  name1  1.0  NaN  3.0  4.0
      5   5  name1  1.0  2.0  3.0  4.0
      6   6  name1  NaN  NaN  NaN  NaN
  • A DataFrame supports backward filling and forward filling to fill null value. The following table defines the valid values of the method parameter.
    Valid value Definition
    bfill or backfill Backward filling
    ffill or pad Forward filling
    Examples
    df.fillna(method='bfill', subset=['f1', 'f2', 'f3', 'f4'])
       id   name   f1   f2   f3   f4
    0   0  name1  1.0  3.0  3.0  4.0
    1   1  name1  2.0  1.0  1.0  1.0
    2   2  name1  3.0  4.0  1.0  NaN
    3   3  name1  1.0  1.0  2.0  3.0
    4   4  name1  1.0  3.0  3.0  4.0
    5   5  name1  1.0  2.0  3.0  4.0
    6   6  name1  NaN  NaN  NaN  NaN
    df.fillna(method='ffill', subset=['f1', 'f2', 'f3', 'f4'])
       id   name   f1   f2   f3   f4
    0   0  name1  1.0  1.0  3.0  4.0
    1   1  name1  2.0  2.0  2.0  1.0
    2   2  name1  3.0  4.0  1.0  1.0
    3   3  name1  NaN  1.0  2.0  3.0
    4   4  name1  1.0  1.0  3.0  4.0
    5   5  name1  1.0  2.0  3.0  4.0
    6   6  name1  NaN  NaN  NaN  NaN

    You can use the ffill or bfill function in the preceding example to simplify the code. The ffill function equals fillna(method='ffill'), and the bfill function equals fillna(method='bfill').