Covariance - Platform For AI - Alibaba Cloud Documentation Center

The covariance algorithm is a statistical method used to measure the linear relationship between two random variables. It evaluates how these variables jointly vary by calculating the expected value of the product of their deviations. Covariance is of great significance in probability theory and statistics, and is widely used in machine learning for tasks such as feature selection and data preprocessing.

Algorithm description

Definition

Covariance is defined as the expected value of the product of the deviations of two random variables. Formula:

$co v (X, Y) = E [(X - μ) (Y - ν)]$

X and Y are two random variables.
μ and ν are the expected values of X and Y, respectively.
E is the expectation operation.

Properties

Positive covariance: Indicates that the two variables have a positive correlation, meaning that when one variable increases, the other variable also tends to increase.
Negative covariance: Indicates that the two variables have a negative correlation, meaning that when one variable increases, the other variable tends to decrease.
Zero covariance: Indicates that the two variables have no linear relationship.

Configure the component

Method 1: Configure the component on the pipeline page

Add a Covariance component on the pipeline page and configure the following parameters:

Category	Parameter	Description
Fields Setting	Input Columns	The input columns. You can select only BIGINT- or DOUBLE-type columns.
Tuning	Cores	The number of cores used in computing. If you do not specify this parameter, the system automatically allocates the number of cores.
Tuning	Memory Size	The memory size of each core. If you do not specify this parameter, the system automatically allocates the memory size. Unit: MB.

Method 2: Use PAI commands

Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL script.

PAI -name cov
    -project algo_public
    -DinputTableName=maple_test_cov_basic12x10_input
    -DoutputTableName=maple_test_cov_basic12x10_output
    -DcoreNum=6
    -DmemSizePerCore=110;

Parameter	Required	Default value	Description
inputTableName	Yes	None	The name of the input table.
inputTablePartitions	No	All partitions of the input table	The partitions that are selected from the input table for training. The following formats are supported: partition_name=value name1=value1/name2=value2: multi-level partitions Note If you specify multiple partitions, separate them with commas (,). For example, name1=value1,value2.
outputTableName	Yes	None	The name of the output table.
selectedColNames	No	All columns	The columns selected from the input table.
lifecycle	No	None	The lifecycle of the output table.
coreNum	No	Determined by the system	The number of cores used in computing. The value must be a positive integer. Valid values: 1 to 9999.
memSizePerCore	No	Determined by the system	The memory size of each core. Valid values: 1 to 65536. Unit: MB.