Vector Assembler is a machine learning algorithm used for dimensionality reduction and feature extraction, aiming to simplify data processing by representing high-dimensional data as low-dimensional vectors. The algorithm typically involves applying some mathematical transformation to the input vectors, converting them into a fixed-length vector representation to facilitate subsequent classification or clustering tasks. Vector Assembler is widely used in natural language processing and recommendation systems, helping to enhance the computational efficiency and accuracy of models.
Limits
The supported compute engines are MaxCompute and Realtime Compute for Apache Flink.
Configure the component in Machine Learning Designer
Input ports
Input port (from left to right) | Data type | Recommended upstream component | Required |
data | Structured data stored in MaxCompute or Object Storage Service (OSS) | None | Yes |
Component parameters
Tab | Parameter | Description |
Field Setting | selectedCols | The names of the columns that you want to aggregate. You can select numeric columns or vector columns. |
reservedCols | The names of the generated columns that you want to reserve. | |
Parameter Setting | outputCol | The name of the vector column that is generated. |
handleInvalidMethod | The policy that is used to handle exceptions. Default value: ERROR. Valid values: ERROR: throws an exception. SKIP: skips an exception and returns NULL. | |
numThreads | The number of threads used by the component. Default value: 1. | |
Execution Tuning | Number of Workers | The number of workers. This parameter must be used together with the Memory per worker, unit MB parameter. The value of this parameter must be a positive integer. Valid values: [1,9999]. |
Memory per worker, unit MB | The memory size of each worker. Valid values: 1024 to 65536. Unit: MB. |
Output ports
Output port (from left to right) | Storage location | Recommended downstream component | Model type |
data | N/A | None | None |
Example
You can copy the following code to the code editor of the PyAlink Script component. This allows the PyAlink Script component to function like the Vector Assembler component.
from pyalink.alink import *
def main(sources, sinks, parameter):
data = sources[0]
selectedColNames = ["col2", "col3"]
trainOp = VectorAssemblerBatchOp()\
.setSelectedCols(selectedColNames)\
.setOutputCol("vec")
result = trainOp.linkFrom(data)
result.link(sinks[0])
BatchOperator.execute()