The covariance algorithm is a statistical method used to measure the linear relationship between two random variables. It evaluates how these variables jointly vary by calculating the expected value of the product of their deviations. Covariance is of great significance in probability theory and statistics, and is widely used in machine learning for tasks such as feature selection and data preprocessing.
Algorithm description
Definition
Covariance is defined as the expected value of the product of the deviations of two random variables. Formula:
X and Y are two random variables.
μ and ν are the expected values of X and Y, respectively.
E is the expectation operation.
Properties
Positive covariance: Indicates that the two variables have a positive correlation, meaning that when one variable increases, the other variable also tends to increase.
Negative covariance: Indicates that the two variables have a negative correlation, meaning that when one variable increases, the other variable tends to decrease.
Zero covariance: Indicates that the two variables have no linear relationship.
Configure the component
Method 1: Configure the component on the pipeline page
Add a Covariance component on the pipeline page and configure the following parameters:
Category | Parameter | Description |
Fields Setting | Input Columns | The input columns. You can select only BIGINT- or DOUBLE-type columns. |
Tuning | Cores | The number of cores used in computing. If you do not specify this parameter, the system automatically allocates the number of cores. |
Memory Size | The memory size of each core. If you do not specify this parameter, the system automatically allocates the memory size. Unit: MB. |
Method 2: Use PAI commands
Configure the component parameters by using PAI commands. You can use the SQL Script component to call PAI commands. For more information, see SQL Script.
PAI -name cov
-project algo_public
-DinputTableName=maple_test_cov_basic12x10_input
-DoutputTableName=maple_test_cov_basic12x10_output
-DcoreNum=6
-DmemSizePerCore=110;Parameter | Required | Default value | Description |
inputTableName | Yes | None | The name of the input table. |
inputTablePartitions | No | All partitions of the input table | The partitions that are selected from the input table for training. The following formats are supported:
Note If you specify multiple partitions, separate them with commas (,). For example, name1=value1,value2. |
outputTableName | Yes | None | The name of the output table. |
selectedColNames | No | All columns | The columns selected from the input table. |
lifecycle | No | None | The lifecycle of the output table. |
coreNum | No | Determined by the system | The number of cores used in computing. The value must be a positive integer. Valid values: 1 to 9999. |
memSizePerCore | No | Determined by the system | The memory size of each core. Valid values: 1 to 65536. Unit: MB. |