This topic describes the operations you can perform on a DataFrame. You can sort, deduplicate, sample, and scale data. You can also process null value.
from odps.df import DataFrame
iris = DataFrame(o.get_table('pyodps_iris'))
Sort data
You can sort data only in a collection.
- Call the
sort
method orsort_values
method to sort data.>>> iris.sort('sepalwidth').head(5) sepallength sepalwidth petallength petalwidth name 0 5.0 2.0 3.5 1.0 Iris-versicolor 1 6.2 2.2 4.5 1.5 Iris-versicolor 2 6.0 2.2 5.0 1.5 Iris-virginica 3 6.0 2.2 4.0 1.0 Iris-versicolor 4 5.5 2.3 4.0 1.3 Iris-versicolor
- To sort data in descending order, set the
ascending
parameter to False.>>> iris.sort('sepalwidth', ascending=False).head(5) sepallength sepalwidth petallength petalwidth name 0 5.7 4.4 1.5 0.4 Iris-setosa 1 5.5 4.2 1.4 0.2 Iris-setosa 2 5.2 4.1 1.5 0.1 Iris-setosa 3 5.8 4.0 1.2 0.2 Iris-setosa 4 5.4 3.9 1.3 0.4 Iris-setosa
- You can also set
-
to sort data in descending order.>>> iris.sort(-iris.sepalwidth).head(5) sepallength sepalwidth petallength petalwidth name 0 5.7 4.4 1.5 0.4 Iris-setosa 1 5.5 4.2 1.4 0.2 Iris-setosa 2 5.2 4.1 1.5 0.1 Iris-setosa 3 5.8 4.0 1.2 0.2 Iris-setosa 4 5.4 3.9 1.3 0.4 Iris-setosa
- You can sort multiple fields simultaneously in a DataFrame.
>>> iris.sort(['sepalwidth', 'petallength']).head(5) sepallength sepalwidth petallength petalwidth name 0 5.0 2.0 3.5 1.0 Iris-versicolor 1 6.0 2.2 4.0 1.0 Iris-versicolor 2 6.2 2.2 4.5 1.5 Iris-versicolor 3 6.0 2.2 5.0 1.5 Iris-virginica 4 4.5 2.3 1.3 0.3 Iris-setosa
- If you need to sort multiple fields in different orders, add a list of BOOLEAN type
values for the
ascending
parameter. The length of the list must be equal to the length of the sorted fields.
You can also sort multiple fields in different orders in the following way:>>> iris.sort(['sepalwidth', 'petallength'], ascending=[True, False]).head(5) sepallength sepalwidth petallength petalwidth name 0 5.0 2.0 3.5 1.0 Iris-versicolor 1 6.0 2.2 5.0 1.5 Iris-virginica 2 6.2 2.2 4.5 1.5 Iris-versicolor 3 6.0 2.2 4.0 1.0 Iris-versicolor 4 6.3 2.3 4.4 1.3 Iris-versicolor
>>> iris.sort(['sepalwidth', -iris.petallength]).head(5) sepallength sepalwidth petallength petalwidth name 0 5.0 2.0 3.5 1.0 Iris-versicolor 1 6.0 2.2 5.0 1.5 Iris-virginica 2 6.2 2.2 4.5 1.5 Iris-versicolor 3 6.0 2.2 4.0 1.0 Iris-versicolor 4 6.3 2.3 4.4 1.3 Iris-versicolor
Note MaxCompute requires that you specify the number of records you need to sort. The default value is 10000. You can use theoptions.df.odps.sort.limit
parameter to modify the number of records. You can set the parameter to a value greater than 10000. However, this setting can cause an out-of-memory (OOM) error.
Deduplicate data
- To deduplicate a collection, you can call the
distinct
method in the following ways:-
>>> iris[['name']].distinct() name 0 Iris-setosa 1 Iris-versicolor 2 Iris-virginica
-
>>> iris.distinct('name') name 0 Iris-setosa 1 Iris-versicolor 2 Iris-virginica
-
>>> iris.distinct('name', 'sepallength').head(3) name sepallength 0 Iris-setosa 4.3 1 Iris-setosa 4.4 2 Iris-setosa 4.5
-
- To deduplicate a sequence, you can call the
unique
method. However, the sequence that is deduplicated by using theunique
method cannot be selected in a collection.>>> iris.name.unique() name 0 Iris-setosa 1 Iris-versicolor 2 Iris-virginica
An error occurs when you use the following code:>>> iris[iris.name, iris.name.unique()]
Sample data
To sample data from a collection, call the sample
method. Python on MaxCompute (PyODPS) supports the following four sampling methods:
Note MaxCompute DataFrame must support XFlow to execute the following sampling methods,
except for sampling by parts. If MaxCompute DataFrame does not support XFlow, you
can only execute the sampling methods at the backend of Pandas DataFrame.
- Sampling by parts
Data is divided into
parts
by using this sampling method. You can select the part by number.>>> iris.sample(parts=10) # Split data into 10 parts and take the part numbered as 0 by default. >>> iris.sample(parts=10, i=0) # Split data into 10 parts and take the part numbered as 0. >>> iris.sample(parts=10, i=[2, 5]) # Split data into 10 parts and take the parts numbered as 2 and 5. >>> iris.sample(parts=10, columns=['name', 'sepalwidth']) # Sample the data by the values of the name and sepalwidth columns.
- Sampling by proportion or the number of records
You must specify the number of records or the proportion of data that you want to sample when you use this method. To enable sampling with replacement, set the replace parameter to True.
>>> iris.sample(n=100) # Sample 100 records. >>> iris.sample(frac=0.3) # Sample 30% of all records.
- Sampling by weight
You can specify the weight column, the number of records, and the proportion of records that you want to sample when you use this method. To enable sampling with replacement, set the
replace
parameter to True.>>> iris.sample(n=100, weights='sepal_length') >>> iris.sample(n=100, weights='sepal_width', replace=True)
- Sampling by stratification
You can specify the label column that you want to stratify when you use this method. You can also specify the sampling proportion (the
frac
parameter) or the number of records (then
parameter) for each label. This sampling method does not support sampling with replacement.>>> iris.sample(strata='category', n={'Iris Setosa': 10, 'Iris Versicolour': 10}) >>> iris.sample(strata='category', frac={'Iris Setosa': 0.5, 'Iris Versicolour': 0.4})
Scale data
A DataFrame supports data scaling based on the maximum and minimum values, average
value, and standard deviation. The following data is an example for data scaling:
name id fid
0 name1 4 5.3
1 name2 2 3.5
2 name2 3 1.5
3 name1 4 4.2
4 name1 3 2.2
5 name1 3 4.1
- You can use the
min_max_scale
method to normalize data.df.min_max_scale(columns=['fid']) name id fid 0 name1 4 1.000000 1 name2 2 0.526316 2 name2 3 0.000000 3 name1 4 0.710526 4 name1 3 0.184211 5 name1 3 0.684211
- You can use the
min_max_scale
method with thefeature_range
parameter to specify the output value range. The following example shows how to keep the output values in the range of (-1, 1):df.min_max_scale(columns=['fid'], feature_range=(-1, 1)) name id fid 0 name1 4 1.000000 1 name2 2 0.052632 2 name2 3 -1.000000 3 name1 4 0.421053 4 name1 3 -0.631579 5 name1 3 0.368421
- If you need to keep the original values, you must use the
preserve
parameter. The scaled data is added in a new column. By default, the new column is named by adding the _scaled suffix to the original column name. You can use thesuffix
parameter to change the suffix name.df.min_max_scale(columns=['fid'], preserve=True) name id fid fid_scaled 0 name1 4 5.3 1.000000 1 name2 2 3.5 0.526316 2 name2 3 1.5 0.000000 3 name1 4 4.2 0.710526 4 name1 3 2.2 0.184211 5 name1 3 4.1 0.684211
- You can also use the
min_max_scale
method with thegroup
parameter to specify one or more group columns and to retrieve the minimum and maximum values from the specified column to scale data.
The preceding example shows that data in bothdf.min_max_scale(columns=['fid'], group=['name']) name id fid 0 name1 4 1.000000 1 name1 4 0.645161 2 name1 3 0.000000 3 name1 3 0.612903 4 name2 2 1.000000 5 name2 3 0.000000
name1
andname2
is scaled based on the minimum and maximum values of the two groups. - You can use the
std_scale
method to scale data based on standard normal distribution. You can also use thestd_scale
method with thepreserve
parameter to keep the original column, and use this method with thegroup
parameter to group data.df.std_scale(columns=['fid']) name id fid 0 name1 4 1.436467 1 name2 2 0.026118 2 name2 3 -1.540938 3 name1 4 0.574587 4 name1 3 -0.992468 5 name1 3 0.496234
Process null value
A DataFrame supports the feature of deleting rows with null value and the feature
of filling null value. The following data is an example for processing null value:
id name f1 f2 f3 f4
0 0 name1 1.0 NaN 3.0 4.0
1 1 name1 2.0 NaN NaN 1.0
2 2 name1 3.0 4.0 1.0 NaN
3 3 name1 NaN 1.0 2.0 3.0
4 4 name1 1.0 NaN 3.0 4.0
5 5 name1 1.0 2.0 3.0 4.0
6 6 name1 NaN NaN NaN NaN
- You can use the
dropna
method to delete the rows that contain null value in thesubset
object.df.dropna(subset=['f1', 'f2', 'f3', 'f4']) id name f1 f2 f3 f4 0 5 name1 1.0 2.0 3.0 4.0
- To keep the rows that contain non-null values, set the
how
parameter to all.df.dropna(how='all', subset=['f1', 'f2', 'f3', 'f4']) id name f1 f2 f3 f4 0 0 name1 1.0 NaN 3.0 4.0 1 1 name1 2.0 NaN NaN 1.0 2 2 name1 3.0 4.0 1.0 NaN 3 3 name1 NaN 1.0 2.0 3.0 4 4 name1 1.0 NaN 3.0 4.0 5 5 name1 1.0 2.0 3.0 4.0
- You can use the
thresh
parameter to specify the minimum number of non-null values in a row.df.dropna(thresh=3, subset=['f1', 'f2', 'f3', 'f4']) id name f1 f2 f3 f4 0 0 name1 1.0 NaN 3.0 4.0 2 2 name1 3.0 4.0 1.0 NaN 3 3 name1 NaN 1.0 2.0 3.0 4 4 name1 1.0 NaN 3.0 4.0 5 5 name1 1.0 2.0 3.0 4.0
- You can call the
fillna
method to replace null value with a specified constant or values in an existing column.- The following example shows how to replace null value with a constant:
df.fillna(100, subset=['f1', 'f2', 'f3', 'f4']) id name f1 f2 f3 f4 0 0 name1 1.0 100.0 3.0 4.0 1 1 name1 2.0 100.0 100.0 1.0 2 2 name1 3.0 4.0 1.0 100.0 3 3 name1 100.0 1.0 2.0 3.0 4 4 name1 1.0 100.0 3.0 4.0 5 5 name1 1.0 2.0 3.0 4.0 6 6 name1 100.0 100.0 100.0 100.0
- The following example shows how to replace null value with values in an existing column:
df.fillna(df.f2, subset=['f1', 'f2', 'f3', 'f4']) id name f1 f2 f3 f4 0 0 name1 1.0 NaN 3.0 4.0 1 1 name1 2.0 NaN NaN 1.0 2 2 name1 3.0 4.0 1.0 4.0 3 3 name1 1.0 1.0 2.0 3.0 4 4 name1 1.0 NaN 3.0 4.0 5 5 name1 1.0 2.0 3.0 4.0 6 6 name1 NaN NaN NaN NaN
- The following example shows how to replace null value with a constant:
- A DataFrame supports backward filling and forward filling to fill null value. The
following table defines the valid values of the
method
parameter.Valid value Definition bfill or backfill Backward filling ffill or pad Forward filling Examplesdf.fillna(method='bfill', subset=['f1', 'f2', 'f3', 'f4']) id name f1 f2 f3 f4 0 0 name1 1.0 3.0 3.0 4.0 1 1 name1 2.0 1.0 1.0 1.0 2 2 name1 3.0 4.0 1.0 NaN 3 3 name1 1.0 1.0 2.0 3.0 4 4 name1 1.0 3.0 3.0 4.0 5 5 name1 1.0 2.0 3.0 4.0 6 6 name1 NaN NaN NaN NaN df.fillna(method='ffill', subset=['f1', 'f2', 'f3', 'f4']) id name f1 f2 f3 f4 0 0 name1 1.0 1.0 3.0 4.0 1 1 name1 2.0 2.0 2.0 1.0 2 2 name1 3.0 4.0 1.0 1.0 3 3 name1 NaN 1.0 2.0 3.0 4 4 name1 1.0 1.0 3.0 4.0 5 5 name1 1.0 2.0 3.0 4.0 6 6 name1 NaN NaN NaN NaN
You can use the
ffill
orbfill
function in the preceding example to simplify the code. Theffill
function equalsfillna(method='ffill')
, and thebfill
function equalsfillna(method='bfill')
.