All Products
Search
Document Center

Platform For AI:Component reference: Overview of all components

Last Updated:Mar 12, 2024

This topic describes the components supported by Machine Learning Designer of Platform for AI (PAI).

Component type

Component

Description

Data source and destination

Read File Data

This component is used to read objects or directories from Object Storage Service (OSS) buckets.

Read CSV File

This component allows you to read CSV files from OSS, HTTP, and Hadoop Distributed File System (HDFS) data sources.

Read MaxCompute Table

This component reads data from MaxCompute tables. By default, the component reads the table data of the current project.

Write Table

This component allows you to write upstream data to MaxCompute.

Data preprocessing

Random Sampling

This component implements random data sampling based on specific proportions or numbers. The samples are independent of each other.

Weighted Sampling

This component generates sampling data based on the values of weighted columns.

Filtering and Mapping

This component uses the expressions of filter conditions to filter data and allows you to modify the names of the columns that you want to filter.

Stratified Sampling

This component stratifies the input data based on the values of a stratification column and implements random data sampling for each stratum.

JOIN

This component merges two tables by associating the columns in the tables and determines the output columns. This component works like the JOIN statement of SQL.

Merge Columns

This component merges two tables by column. The two tables must have the same number of rows. If one of the two tables has partitions, the partitioned table must connect to the second input port.

Merge Rows (UNION)

This component merges two tables by row. If this component is used, the numbers and data types of the output fields selected from the left and right tables must be the same. This component integrates the features of UNION and UNION ALL.

Data Type Conversion

This component converts features of any data type to features of the STRING, DOUBLE, or INT data type. This component also allows you to replace missing values if exceptions occur during data type conversion.

Add ID Column

This component allows you to append an ID column to the first column of a data table.

Split

This component randomly splits data to generate datasets for training and testing.

Missing Data Imputation

This component can be configured in a visualized manner or by running PAI commands.

Normalization

This component allows you to normalize dense or sparse data.

Standardization

This component allows you to generate standardized instances in a visualized manner or by running PAI commands.

KV2Table

This component allows you to convert a table in the key-value format into a common table.

Table2KV

This component allows you to convert a common table into a table in the key-value format in a visualized manner or by running PAI commands.

Feature engineering

Feature Importance Filtering

This component provides the filtering feature for components including Linear Model Feature Importance, GBDT Feature Importance, and Random Forest Feature Importance. This component can be used to filter the top N features.

Principal Component Analysis (PCA)

This component uses a multivariate statistical method to explore the internal structures of multiple variables and how they correlate to each other based on a few principal components.

Feature Scaling

This component allows you to scale numeric data in the dense or sparse format by using common scaling functions.

Feature Discretization

This component discretizes continuous features based on a specific rule.

Feature Anomaly Smoothing

This component can smooth anomalous features in input data to a specific interval. Both the sparse data and dense data are supported.

Singular-value Decomposition (SVD)

This component is used to decompose matrices in linear algebra. It is a generalization of the diagonalization of normal matrices in matrix analysis.

Anomaly Detection

This component is used to detect data with continuous or enumeration features. It helps you detect exceptions in data.

Linear Model Feature Importance

This component calculates the feature importance for a linear model, such as linear regression and logistic regression for binary classification. Both the sparse and dense data are supported.

Discrete Feature Analysis

This component is used to collect statistics on the distribution of discrete features.

Random Forest Feature Importance Evaluation

This component is used to calculate feature importance.

Feature Selection (Filter Method)

This component selects the top N features from all feature data in the sparse or dense format by using a filter based on the feature selection method that you specify.

Feature Encoding

This component can encode nonlinear features to linear features based on gradient boosting decision tree (GBDT) algorithms.

One Hot Encoding

This component converts data to key-value pairs in the sparse format.

Statistical analytics

Data Pivoting

This component allows you to view the distributions of feature values, feature columns, and label columns. This facilitates follow-up data analysis.

Covariance

This component is used to measure the joint variability of two variables.

Empirical Probability Density Chart

This component uses empirical distribution and kernel density estimation functions.

Whole Table Statistics

This component collects statistics about data in a table or only selected columns.

Chi-square Goodness of Fit Test

This component is used in scenarios in which categorical variables are used. This component is used to determine the difference between the observed frequency and expected frequency for each classification of a single multiclass categorical variable. The null hypothesis assumes that the observed frequency and expected frequency are the same.

Box Plot

A box plot chart shows the distribution of a set of data. It shows the distribution features of raw data. It can also be used to compare the distribution features between multiple sets of data.

Scatter Plot

In regression analysis, a scatter chart shows the distribution of data points in a Cartesian coordinate system.

Correlation Coefficient Matrix

A correlation coefficient indicates the correlation between columns in a matrix. The coefficient is in the range of [-1,1]. The count parameter is measured when the value is the number of non-zero elements in two consecutive columns.

Two Sample T Test

This component checks whether the means of two samples are significantly different based on statistical principles.

One Sample T Test

This component is used to determine whether a significant difference exists between the overall mean of a variable and a specific value. The sample on which you want to perform a T test must follow normal distribution.

Normality Test

This component is a normality test that determines whether the population follows normal distribution by using observations. A normality test is a special goodness-of-fit hypothesis test in statistical determination.

Lorenz Curve

This component can be used to show the income distribution of a country or region.

Percentile

This component is used to calculate the percentile of data in the columns of a data table.

Pearson Coefficient

This component is a linear correlation coefficient that measures the linear correlation between two variables.

Histogram

This component is a histogram, also known as a mass distribution profile. A histogram is a statistical report chart that consists of a series of vertical stripes or line segments with different heights to show data distribution.

Machine learning

Prediction

This component uses the training model and prediction data as input and generates prediction results as output.

XGboost Train

XGBoost is an extension of the gradient boosting algorithm. XGBoost provides better usability and robustness and has been widely used in machine learning production systems and machine learning competitions. XGBoost can be used for classification and regression.

XGboost Predict

XGBoost is an extension of the gradient boosting algorithm. XGBoost provides better usability and robustness and has been widely used in machine learning production systems and machine learning competitions. XGBoost can be used for classification and regression.

Linear SVM

This component is a machine learning model based on statistical learning theory. It minimizes risks and improves the generalization capability of learning machines. This way, empirical risks and confidence intervals are minimized.

Binary Logistic Regression

This component is a binary classification algorithm and supports sparse and dense data.

GBDT Binary Classification

This component is used to set a threshold. If a feature value is greater than the threshold, the feature is a positive example. Otherwise, the feature is a negative example.

Experiment of PS-SMART Binary Classification Training

A parameter server (PS) is used to process a large number of offline and online training tasks. SMART stands for scalable multiple additive regression tree. PS-SMART is an iteration algorithm that is implemented by using a PS-based gradient boosting decision tree (GBDT).

PS Logistic Regression for Binary Classification

This component is a classic binary classification algorithm and is widely used in advertising and search scenarios.

PS-SMART Multiclass Classification

A parameter server (PS) is used to process a large number of offline and online training tasks. SMART stands for scalable multiple additive regression tree. PS-SMART is an iteration algorithm that is implemented by using a PS-based gradient boosting decision tree (GBDT).

KNN

This component selects the K-nearest records from a row in the prediction table for classification. The most common class of the K-nearest records is used as the class of the row.

Logistic Regression for Multiclass Classification

This component is used for multiclass classification, and supports both the sparse and dense data formats.

Random Forest

This component is a classifier that consists of multiple decision trees. The classification result is determined by the mode of output classes of individual trees.

Naive Bayesian

This component is a probabilistic classification algorithm based on Bayesian theorem with independent assumptions.

K-means Clustering

This component randomly selects K objects as the initial centroids of each cluster, computes the distance between the remaining objects and the centroids, distributes the remaining objects to the nearest clusters, and then recalculates the centroids of each cluster.

DBSCAN

This component is used to create clustering models.

GMM Training

This component is used to classify models.

DBSCAN Prediction

This component is used to predict the clusters to which new points may belong based on DBSCAN models.

GMM Prediction

This component is used to perform clustering prediction based on trained Gaussian mixture models.

GBDT Regression

This component is an iterative decision tree algorithm that is suitable for linear and nonlinear regression scenarios.

Linear Regression

This component is used to analyze the linear relationship between a dependent variable and multiple independent variables.

PS-SMART Regression

This component is used to process a large number of offline and online training jobs. SMART is short for scalable multiple additive regression tree. PS-SMART is an iteration algorithm that is implemented by using a PS-based gradient boosting decision tree (GBDT).

PS Linear Regression

This component is used to analyze the linear relationship between a dependent variable and multiple independent variables. A parameter server is used to process a large number of offline and online training jobs.

Binary Classification Evaluation

This component is used to calculate AUC, KS, and F1 score metrics to generate Kolmogorov–Smirnov (KS) curves, precision-recall (P-R) curves, ROC curves, lift charts, and gain charts.

Regression Model Evaluation

This component is used to evaluate the advantages and disadvantages of different models of regression algorithms based on prediction results and original results. Then, evaluation metrics and histograms of residuals are generated.

Clustering Model Evaluation

This component is used to evaluate clustering models and generate evaluation metrics based on raw data and clustering results.

Confusion Matrix

This component is suitable for supervised learning and corresponds to the matching matrix in unsupervised learning.

Multiclass Classification Evaluation

This component is used to evaluate the advantages and disadvantages of models of multiclass classification algorithms based on the prediction results and original results of classification models. Then, evaluation metrics such as accuracy, kappa, and F1 score are generated.

Deep learning

PyTorch

(Phased out soon)

This component can be found in the deep learning component list in Machine Learning Designer. You can use this component with the Read File Data component. The PyTorch component can read data only from OSS.

MXNet

(Phased out soon)

This component is a deep learning framework that supports imperative and symbolic programming. You can run the component on CPU or GPU clusters.

Enable deep learning

PAI supports deep learning frameworks and provides GPU-accelerated clusters. You can use deep learning algorithms based on these frameworks and hardware resources.

Time series

x13_arima

This component is a seasonal autoregressive integrated moving average (ARIMA) algorithm based on the open source X-13ARIMA-SEATS algorithm.

x13_auto_arima

This component uses the automatic selection program of the ARIMA model. The component is developed based on the revised programs (Gomez and Maravall 1998), which are edited in and after TRAMO (1996).

Prophet

This component forecasts time series data for each row of MTable data by using the Prophet algorithm and provides the prediction result of the next time period.

MTable Assembler

This component aggregates columns in a table to create an MTable based on the value specified by groupCols.

MTable Expander

This component expands a MTable into a table.

Recommendation

FM algorithms

The Factorization Machine (FM) algorithm-based components are nonlinear models that incorporate interactions among features. This algorithm is suitable for scenarios in which E-commerce, advertisements, and live video streaming are used to promote commodities.

ALS Training

This component is a model-based recommendation algorithm. It factorizes models by using sparse matrix factorization and predicts the values of missing entries to obtain a basic training model.

Swing Train

This component is an item recall algorithm. You can use this component to measure the similarity of items based on user-item-user principles.

Swing Recommendation

This component is used to predict upstream batch data. You can use this component to perform offline prediction based on the model and prediction data generated by the Swing Train component.

Collaborative Filtering (etrec)

This component is a collaborative filtering algorithm based on items. It uses two input columns and provides the top N items with the highest similarity as the output.

Vector Recall Evaluation

This component calculates the hit rate of recalls. A higher value indicates a higher precision of recalls that are performed by using the vectors generated during model training.

Anomaly detection

Local Outlier Factor (LOF) Outlier Detection

This component identifies samples as outliers based on the Local Outlier Factor (LOF) algorithm.

iForest Outlier

This component uses the sub-sampling algorithm to detect anomalies. The sub-sampling algorithm is less complex and can be used to identify anomalous points in datasets. sub-sampling is widely used in anomaly detection scenarios.

One-Class SVM Outlier Detection

This component is an unsupervised machine learning algorithm that is different from traditional SVM algorithms. You can use this component to detect outliers by learning a decision boundary.

Natural Language Processing

Use the Text Summarization Prediction component

This component is used to extract key information from lengthy and repetitive texts. For example, headlines are the results of text summarization. You can use this component to call a specified pre-trained model to generate headlines for news.

Text Classification Prediction (MaxCompute) (Phased out soon)

This component loads a trained model, makes predictions based on input data, and generates prediction results.

Text Match Prediction (MaxCompute) (Phased out soon)

This component loads a trained model, makes predictions based on input data, and generates prediction results.

Sequence Labeling Prediction (MaxCompute) (Phased out soon)

This component loads a trained model, makes predictions based on input data, and generates prediction results.

Machine Reading Comprehension Predict

This component allows you to make batch predictions by using the models trained by the machine reading comprehension training component.

BERT embedding (MaxCompute) (Phased out soon)

This component uses the original text as the input and provides a vector sequence after feature extraction by the system.

text marking predict (MaxCompute) (Phased out soon)

This component extracts labels from the input text. This facilitates semantic analysis and accurate modeling.

text classification training (MaxCompute) (Phased out soon)

This component is integrated with BERT-based text classification models, traditional deep text classification models (such as TextCNN), and the DGCNN model of PAI.

text match training (MaxCompute) (Phased out soon)

This component checks whether two input sentences match.

sequence labeling training (MaxCompute) (Phased out soon)

This component performs multi-class classification on each token in an input sequence. The component uses the sequence labeling method described in a Google paper to classify tokens in the input sequence. You can use this component to perform tasks such as tokenization, part-of-speech tagging, and named entity recognition.

Text Summarization

This component is used to extract key information from lengthy and repetitive text. For example, headlines are the results of text summarization. You can use this component to train models that generate headlines, which summarize the main points of news.

Machine Reading Comprehension Training

This component allows you to train machine reading comprehension (MRC) models to read and comprehend given text passages and answer relevant questions.

Split Word

This component splits words in specific columns based on Alibaba Word Segmenter (AliWS). The words obtained after splitting are separated by spaces.

Convert Row, Column, and Value to KV Pair

This component converts a triple table (row,col,value) to a key-value table (row,[col_id:value]).

String Similarity

This component performs a basic machine learning operation. It is typically used in information retrieval, natural language processing, and bioinformatics.

String Similarity - top N

This component calculates the string similarity and obtains the top N data records that best match the mapping table.

Deprecated Word Filter

This component is a preprocessing method in text analysis. This component is used to filter noise, such as "of", "is", or "oops", in word tokenization results.

ngram-count

This component is a step in language model training. N-grams are generated based on words. The number of N-grams in all corpora is counted.

Text Summarization

This component can automatically generate abstracts. An abstract is a simple and coherent short text that accurately reflects the main ideas of a document. This component allows computers to extract an abstract from a document.

Keyword Extraction

This component uses one of the important technologies in natural language processing to extract keywords from a document.

Sentence Splitting

This component is used to split text in a document by punctuation. This component processes text before text summarization. It splits the text into rows. Each row contains only one sentence.

Semantic Vector Distance

You can calculate the extension words or sentences of the specified words or sentences based on the calculated semantic vectors, such as the word vectors calculated by the Word2Vec component. The extension words or sentences are a set of vectors that are closest to a certain vector. For example, you can generate a list of words that are most similar to a given word. This is based on the semantic vectors that are returned by the Word2Vec component.

Doc2Vec

You can use the Doc2Vec component to map articles to vectors. The input is a vocabulary. The output is a document vector table, a word vector table, or a vocabulary.

Conditional Random Field

A conditional random field (CRF) is a conditional probability distribution model of a group of output random variables based on a group of input random variables. This model presumes that the output random variables constitute a Markov random field (MRF).

Document Similarity

This component calculates the similarity between articles or between sentences based on string similarity.

PMI

This component counts the co-occurrence of all words in several documents and calculates the pointwise mutual information (PMI).

Conditional Random Field Prediction

This component is an algorithm component provided by Machine Learning Designer based on the online prediction model Linear Conditional Random Field (LinearCRF). This component processes sequence labeling tasks.

Word Splitting (Generate Models)

This component is developed based on Alibaba Word Segmenter (AliWS). The component generates a word segmentation model based on parameters and custom dictionaries.

Word Frequency Statistics

During word frequency calculation, a program is used to calculate the total number of words in strings and the number of times that each word appears in the strings. The strings can be manually entered or read from a specified file.

TF-IDF

Term Frequency-Inverse Document Frequency (TF-IDF) is a commonly used weighting technique for information retrieval and text mining. TF-IDF is used by search engines as a tool in scoring and ranking the relevance of a document for a given search query.

PLDA

In PAI you can set the Topics parameter for the PLDA component to abstract different topics for each document.

Word2Vec

The Word2Vec component uses a neural network to map words to vectors in the K-dimensional space based on extensive training. The component supports operations on the vectors to show the semantics of the vectors. The input is a word column or a vocabulary, and the output is a vector table and a vocabulary.

Network analysis

Tree Depth

This component generates the depth of each node in a tree and the tree ID.

k-Core

This component identifies the subgraph with the specified coreness. The largest coreness is considered to be the coreness of a graph.

Single-source Shortest Path

This component uses the Dijkstra algorithm to generate the shortest paths between a given node and all other nodes.

PageRank

This component is an algorithm that calculates and sorts the rankings of web pages based on their link sources.

Label Propagation Clustering

This component is a semi-supervised machine learning algorithm. The labels of a node (community) depend on those of the neighboring nodes. The degree of dependence is determined by the similarity between nodes. Data becomes stable through iterative propagation updates.

Label Propagation Classification

This component is a semi-supervised classification algorithm. It uses the label information of labeled nodes to predict the label information for unlabeled nodes.

Modularity

This component is a metric that is used to evaluate the structure of communities in a network. It is designed to measure the strength of a network divided into communities. Values greater than 0.3 indicate a strong community structure.

Maximum Connected Subgraph

This component generates maximum connected subgraphs. In Undirected Graph G, Vertex A is connected to Vertex B if a path exists between the two vertices. Undirected Graph G contains several subgraphs. Each vertex is connected to other vertices in the same subgraph. Vertices in different subgraphs are not connected. The subgraphs in Undirected Graph G are called maximum connected subgraphs.

Vertex Clustering Coefficient

This component calculates the peripheral density of a vertex in Undirected Graph G. The density of a star network is 0, and the density of a fully meshed network is 1.

Edge Clustering Coefficient

This component calculates the edge density in Undirected Graph G.

Counting Triangle

This component generates all triangles in Undirected Graph G.

Finance

Data Conversion Module

This component performs normalization, discretization, indexation, or weight of evidence (WOE) conversion on data.

Scorecard Training

This component is a common modeling tool that is used in the field of credit risk assessment. This component performs binning to implement the discretization of variables and uses linear models, such as linear and logistic regression models, to train a model. The model training process includes feature selection and score transformation.

Scorecard Prediction

This component uses the model that is generated by the Scorecard Training component to predict scores.

Binning

This component is used for feature discretization. Feature discretization is a process of converting continuous data into multiple discrete intervals. This component supports equal frequency binning, equal width binning, and automated binning.

Population Stability Index

This component is used to identify a shift in two samples of a population. You can use this component to measure the stability of samples.

Visual algorithms

data to tfrecord (Phased out soon)

This component can convert labeled data to TFRecord files for training image models.

image classification (Phased out soon)

This component can be used to train TFRecord files and obtain an image classification model for inference.

image classification (torch)

If your business involves image classification, you can use the image classification (torch) component to build image classification models for inference.

video classification

This component can be used to train raw video data and obtain a video classification model for inference.

object detection (Phased out soon)

This component can be used to train object detection models that detect entities that have risks in images.

object detection (easycv)

This component can be used to train object detection models that detect entities that have risks in images.

image self-supervise learning

This component can be used to train unlabeled images and obtain a model that extracts image features.

image metric learning (raw)

This component can be used to build metric learning models for inference.

image metric learning (raw) (Phased out soon)

This component can be used to train semantic image segmentation models for inference.

end to end ocr (Phased out soon)

This component can be used to train models that detect and recognize text at a random angle or in a random shape.

pose detection

This component can be used to build pose models for inference. This component is ideal for scenarios that involve human body detection.

image prediction (Phased out soon)

This component can be used to perform offline inference on the output model of an image training component and make predictions based on the input data.

video prediction (Phased out soon)

This component can be used to perform offline inference on the output model of a video training component.

ocr (Phased out soon)

This component can be used to perform offline inference on an optical character recognition (OCR) model based on the OCR algorithms developed by the PAI team and the big data in Alibaba Cloud.

model quantize

This component provides mainstream model quantization algorithms for you to compress and accelerate models. This way, high-performance inference can be implemented.

model prune

This component provides the model prune component that is based on Taylor First order pruning (TaylorFO). TaylorFO is a mainstream adaptive growing and pruning algorithm (AGP). You can use a model prune component to compress models for high training and inference performance.

Audio algorithm (Phased out soon)

Create Dataset for EasyASR Models

This component converts audio data in the WAV format and text data into TFRecord files. You can then use the TFRecord files as pre-processed data to train or evaluate Automatic Speech Recognition (ASR) and speech classification models.

ASR Model Training (EasyASR)

This component uses TFRecord files as input to train a speech recognition model.

Audio Classification Model Training (EasyASR)

This component uses TFRecord files as input to train a speech classification model.

EasyASR Offline Inference (MaxCompute)

This component can use a SavedModel model to make predictions in speech recognition or classification.

EasyASR Offline Inference (DLC)

This component can use a SavedModel model to make predictions in speech recognition or classification based on the DLC compute engine.

Tools

OfflineModel components

OfflineModel is a data format used in MaxCompute. Models that are generated by traditional machine learning algorithms based on the PAICommand framework are stored in OfflineModel format in MaxCompute projects. These components can be used to obtain offline models and use the offline models to run offline prediction jobs.

Model Export

This component can be used to export a model that is trained in MaxCompute to a specified OSS path.

Custom scripts

SQL Script

This component allows you to write custom SQL statements in the SQL script editor. You can submit the statements to MaxCompute for execution.

Python script

This component allows you to install custom dependencies and run custom Python functions.

PyAlink Script

This component allows you to call all algorithms of Alink, such as classification algorithms, regression algorithms, or recommendation algorithms. You can also use this component together with other algorithm components of Machine Learning Designer to create pipelines and verify their effects.

Time Window SQL

This component allows you to use the multi-date loop execution feature to execute multiple day-level SQL tasks within a certain period.

Beta components

Lasso Regression Training

This component provides a compression estimation algorithm.

Lasso Regression Prediction

This component supports both sparse and dense data. You can use this component to estimate values of numeric variables, such as loan limits and temperatures.

Ridge Regression Prediction

This component can be used to estimate values of numeric variables, such as housing prices, sales volumes, and temperatures.

Ridge Regression Training

This component provides the most common regularization method used to deal with ill-posed problems.