Reject Inference (RI) is a commonly used technique in financial risk management, primarily aimed at mitigating sample selection bias and enhancing the accuracy and reliability of credit assessment models. The core idea of reject inference is to utilize information from accepted customers (those who have passed the approval process) to infer the risk characteristics of rejected customers (those who have not passed the approval process), thereby providing a more comprehensive evaluation of credit risk.
Algorithm
The data modeling conducted on user repayments and defaults by using the scorecard algorithms in credit scenarios uses only the data of users who receive the loan. The data of applicants who did not receive the loan is not included. This leads to inaccuracy in the prediction results of the model. The results may be overly optimistic in most cases. You can use the Reject Inference algorithm to handle this issue.
The Reject Inference method adds labels to the data that does not have the actual label but contains the prediction result based on the input training data. The training data is also known as the accept data and contains the actual label and the prediction results. The data without the actual label is also known as the rejection data. The algorithm provides the following four rejection inference methods.
fuzzy
The fuzzy method enhances the dataset by adding good and bad labels to the rejection data. The sample weight of each label is calculated based on the following formula:



In the preceding formula,
is the probability of good samples predicted by the scorecard component in the previous step. You can specify the
and
parameters:
: the rejection rate of all data.
: the ratio of the probability of bad samples in the rejection data to the probability of bad samples in the accept data.
hard cut-off
The hard cut-off method requires you to set a threshold score based on the result of the scorecard model in the previous step and the risk tolerance for rejected users. The system adds the bad sample label to samples lower than the threshold score, and adds the good sample label to samples higher than the threshold score.
parcelling
The parcelling method groups the accept data based on the prediction results of the scorecard model in the previous step and calculates the default rate of each group. Then, the system groups the rejection data in the same way, uses the default rate of each group as the sampling rate, and randomly selects the default samples in the group. The selected samples are bad samples, the rest are good samples.
two stage
The two stage method requires the prediction results of the scorecard model in the previous step (AcceptRejectScore), as well as the acceptance or rejection probability of the sample output from the model prediction component in the previous step (GoodBadScore). The two-stage method modifies the prediction results of the scorecard model on the unlabeled samples by fitting the linear relationship between AcceptRejectScore and GoodBadScore, and then adds labels to the samples according to the parcelling method.
Inputs and outputs
Input ports
accept data: Read Table, Scorecard Prediction
rejection data: Read Table, Scorecard Prediction, and Linear regression prediction
Output port
The output type is a MaxCompute table, with downstream components including Scorecard Training and Binning.
Configure the component
Tab | Parameter | Required | Description | Default value |
Field Setting | good/bad score column | Yes | The prediction results column of the scorecard component. In most cases, this parameter is the output of the prediction_score column of the scorecard component. The accept data is labeled based on whether the sample is good or bad. | No default value. |
actual label column | Yes | The name of the actual label column of the accept data. | No default value. | |
weight column | No | The name of the weight column. | No default value. | |
accept rate score column | No | The acceptance probability of the predicted samples. In most cases, this parameter is the output of the scorecard component. The data is labeled based on whether the sample is accepted or rejected. This parameter is required if you set the inference method to two stage. | No default value. | |
Parameter Setting | inference method | No | Valid values:
| fuzzy |
rejection rate | Yes | Indicates the rejection probability of a sample. | 0.3 | |
buckets number | No | This parameter is required if you set inference method to parcelling or two stage. The number of buckets for training. | 25 | |
cutoff score | No | This parameter is required if you set inference method to hard cut-off. The threshold score. The system adds the good sample label to samples higher than the threshold score, and adds the bad sample label to samples lower than the threshold score. | No default value. | |
event rate increase | No | This parameter is required if you set inference method to fuzzy, parcelling, or two stage. The event rate increase is a scaling entity that differs according to the selected inference method.
| 1.0 | |
seed | No | This parameter is required if you set inference method to parcelling. The seed used when the system randomly specifies labels. | 0 | |
score range method | No | This parameter is required if you set inference method to parcelling or two stage. Valid values:
| augmentation | |
Score Conversion | Yes | If you select Score Conversion, you need to set the scaledValue, odds, and pdo parameters. For more information, see Scorecard Training. | false | |
scaledValue | No | No default value. | ||
odds | No | No default value. | ||
pdo | No | No default value. | ||
Execution Tuning | Choose Running Mode | Yes | The type of the resources used to run the job. | MaxCompute |
Number of Workers | No | The number of nodes on which the job is run. The value needs to be a positive integer. Valid values: [1,9999]. | No default value. | |
Memory per worker, unit MB | No | The memory size of each worker node. Unit: MB. Valid values: [1024,65536]. | No default value. |
in the preceding section "How Reject Inference works".