A Lorenz curve is a graph used to show dataset distribution inequality and commonly used to show the distribution of income or wealth within an economy. It plots the cumulative percentage of resources received by the cumulative percentage of the population to show distribution inequality in an intuitive manner. In machine learning, a Lorenz curve can be used to evaluate the fairness of model predictions or the bias in resource allocation.
Configure the component
Method 1: Configure the component on the pipeline page
On the pipeline details page in Machine Learning Designer, add the Lorenz Curve component to the pipeline and configure the parameters described in the following table.
Tab | Parameter | Description |
Fields Setting | Select Fields | Select the desired feature column that you want to use to plot a curve. This column includes data that you can use to analyze distribution inequality, such as income, wealth, or score. |
Parameters Setting | Quantile | The number of equal-probability intervals into which you divide the dataset to plot the curve. You can determine an appropriate quantile to control the granularity of the curve. This allows for a more detailed analysis of the inequality in data distribution. |
Tuning | Computing Cores | The number of cores used in computing. The value must be a positive integer. |
Memory Size per Core (Unit: MB) | The memory size of each core. |
Method 2: Use PAI commands
Configure the component parameters by using Platform for AI (PAI) commands. You can use the SQL Script component to call PAI commands. For more information, see Scenario 4: Execute PAI commands within the SQL script component.
PAI -name LorenzCurve
-project algo_public
-DinputTableName=maple_test_lorenz_basic10_input
-DcolName=col0
-DoutputTableName=maple_test_lorenz_basic10_output -DcoreNum=20
-DmemSizePerCore=110;
Parameter | Required | Default value | Description |
inputTableName | Yes | No default value | The name of the input table. |
outputTableName | Yes | No default value | The name of the output table. |
colName | No | No default value | The columns selected from the input table. You can select multiple columns and separate them with commas (,). |
N | No | 100 | The quantile. |
inputTablePartitions | No | No default value | The partitions that are selected from the input table for training. The following formats are supported:
Note If you specify multiple partitions, separate them with commas (,). Example: name1=value1,value2. |
lifecycle | No | 28 | The lifecycle of the output table. This value must be an integer. Unit: days. |
coreNum | No | Determined by the system | This parameter is used with memSizePerCore. The value must be a positive integer. The system calculates the number of instances based on the amount of input data. |
memSizePerCore | No | Determined by the system | The memory size of each core. Unit: MB. The value must be a positive integer. Recommended values: (1024,64 × 1024). |
Example
Generate the following test data:
col0:double
4
7
2
8
6
3
9
5
0
1
10
Run the following PAI command:
PAI -name LorenzCurve -project algo_public -DinputTableName=maple_test_lorenz_basic10_input -DcolName=col0 -DoutputTableName=maple_test_lorenz_basic10_output -DcoreNum=20 -DmemSizePerCore=110;
View the output as described in the following table.
quantile
col0
0
0
1
0.01818181818181818
2
0.01818181818181818
3
0.01818181818181818
4
0.01818181818181818
5
0.01818181818181818
6
0.01818181818181818
7
0.01818181818181818
8
0.01818181818181818
9
0.01818181818181818
10
0.01818181818181818
11
0.05454545454545454
12
0.05454545454545454
13
0.05454545454545454
14
0.05454545454545454
...
...
85
0.8181818181818182
86
0.8181818181818182
87
0.8181818181818182
88
0.8181818181818182
89
0.8181818181818182
90
1
91
1
92
1
93
1
94
1
95
1
96
1
97
1
98
1
99
1
100
1