Designer component reference - Platform For AI - Alibaba Cloud Documentation Center

Recommended algorithm components

Recommended components cover general-purpose algorithms (data reading, SQL scripts, Python scripts), LLM and LVM data processing, and model training and inference. DLC-based components are preferred because they support heterogeneous resources and custom environments for greater flexibility.

Component type			Component	Description
Custom component			Custom Component	Creates a custom component in AI Asset Management for use in Designer alongside official components.
Source/Target			Read OSS Data	Reads files or folders from a specified path in an Object Storage Service (OSS) bucket.
			Read CSV file	Reads CSV files from OSS, HTTP, or HDFS.
			Read Table	Reads data from a MaxCompute table in the current project.
			Write to Table	Writes upstream data to a MaxCompute table.
Custom script			SQL Script	Executes custom SQL statements in MaxCompute.
Custom script			Python Script	Installs dependency packages and runs custom Python functions.
Tools			Register Dataset	Registers a dataset to AI Asset Management.
			Register Model	Registers a model to AI Asset Management.
			Update EAS Service (Beta)	Calls `eascmd` to update a specified Elastic Algorithm Service (EAS) service. The service must be in a `running` state. Each update creates a new service version.
LLM data processing	Data conversion		Export MaxCompute Table to OSS	Exports a MaxCompute table to OSS.
	Data conversion		Import OSS Data to MaxCompute Table	Imports data from OSS to a MaxCompute table.
	LLM data processing (DLC)		LLM-MD5 Deduplication (DLC)	Calculates the MD5 hash of text content and removes duplicate entries.
			LLM-Text Normalization (DLC)	Performs Unicode normalization on text and converts Traditional Chinese characters to Simplified Chinese.
			LLM-Special Content Removal (DLC)	Removes URLs and strips HTML formatting to extract plain text.
			LLM-Special Character Ratio Filter (DLC)	Filters samples based on the ratio of special characters to total text length.
			LLM-Copyright Information Removal (DLC)	Removes copyright information from text, such as comments in code file headers.
			LLM-Count Filter (DLC)	Filters samples based on the ratio of numeric and alphabetic characters to total text length.
			LLM-Length Filter (DLC)	Filters samples based on total text length, average line length, and maximum line length.
			LLM-Text Quality Scoring and Language Identification - FastText (DLC)	Identifies the language of a text, calculates a quality score, and filters samples based on the specified language and score range.
			LLM-Sensitive Word Filter (DLC)	Filters samples that contain words from a specified sensitive word dictionary.
			LLM-Sensitive Information Masking (DLC)	Masks sensitive information, such as email addresses, phone numbers, and ID numbers.
			LLM-Document Similarity Deduplication (DLC)	Deduplicates documents by calculating SimHash similarity scores.
			LLM-N-Gram Repetition Ratio Filter (DLC)	Filters samples based on the character-level or word-level N-gram repetition ratio.
			LLM-Expand LaTeX Macro Definition (DLC)	Expands parameter-less macros inline in TEX-formatted data by replacing macro names with their values.
			LLM-Remove LaTeX Bibliography (DLC)	Removes the bibliography section from a LaTeX document.
			LLM-Remove LaTeX Comment Lines (DLC)	Removes comment lines and inline comments from LaTeX source text.
			LLM-Remove LaTeX Document Header (DLC)	Removes all content preceding the first section declaration in TEX-formatted data, retaining the chapter title and subsequent content.
	LLM data processing (MaxCompute)		LLM-MD5 Deduplication (MaxCompute)	Calculates the MD5 hash of text content and removes duplicate entries.
			LLM-Text Normalization (MaxCompute)	Performs Unicode normalization and converts Traditional Chinese characters to Simplified Chinese.
			LLM-Special Content Removal (MaxCompute)	Removes content such as navigation, author information, URLs, and HTML formatting.
			LLM-Special Characters Ratio Filter (MaxCompute)	Filters samples based on the ratio of special characters to total text length.
			LLM-Copyright Information Removal (MaxCompute)	Removes copyright information from text, such as comments in code file headers.
			LLM-Count Filter (MaxCompute)	Filters samples based on the count of letters, numbers, and delimiters.
			LLM-Length Filter (MaxCompute)	Filters samples based on total text length, average line length, and maximum line length.
			LLM-Text Quality Scoring and Language Identification (MaxCompute)	Identifies the language of a text, calculates a quality score, and filters samples based on the specified language and score range.
			LLM-Sensitive Word Filter (MaxCompute)	Filters samples that contain words from a specified sensitive word dictionary.
			LLM-Sensitive Information Masking (MaxCompute)	Masks sensitive information, such as email addresses, phone numbers, and ID numbers.
			LLM-Sensitive Information Masking (MaxCompute)
			LLM-N-Gram Repetition Ratio Filter (MaxCompute)	Filters samples based on the character-level or word-level N-gram repetition ratio.
			LLM-Expand LaTeX Macro Definition (MaxCompute)	Inlines parameter-less macro definitions in TEX-formatted data.
			LLM-Remove LaTeX Bibliography (MaxCompute)	Removes the bibliography section from a LaTeX document.
			LLM-Remove LaTeX Comment Lines (MaxCompute)	Removes comment lines and inline comments from LaTeX source text.
			LLM-Remove LaTeX Document Header (MaxCompute)	Removes all content preceding the first section declaration in a LaTeX document.
	LVM data processing (DLC)	Video preprocessing operators	LVM-Text Area Filter (DLC)	Filters video data based on the quantity of text present in the frames.
			LVM-Motion Filter (DLC)	Filters video data based on a specified range of motion speed.
			LVM-Aesthetics Filter (DLC)	Filters video data that falls below a specified aesthetic quality score.
			LVM-Aspect Ratio Filter (DLC)	Filters video data based on a specified range of aspect ratios.
			LVM-Duration Filter (DLC)	Filters video data based on a specified range of durations.
			LVM-Video-Text Similarity Filter (DLC)	Filters video data based on the semantic similarity score between the video and its associated text.
			LVM-Compliance Filter (DLC)	Filters video data based on its Not Safe For Work (NSFW) score.
			LVM-Resolution Filter (DLC)	Filters video data based on a specified range of resolutions.
			LVM-Watermark Filter (DLC)	Filters video data that contains watermarks.
			LVM-Tag Filter (DLC)	Filters video data that does not match a specified set of tags.
			LVM-Tag Generation (DLC)	Generates descriptive tags for video frames.
			LVM-Frame Text Generation (DLC)	Generates descriptive text for video frames.
			LVM-Video Text Generation (DLC)	Generates descriptive text for entire videos.
		Image preprocessing operators	LVM-Image Aesthetics Filter (DLC)	Filters image data that falls below a specified aesthetic quality score.
			LVM-Image Aspect Ratio Filter (DLC)	Filters image data based on a specified range of aspect ratios.
			LVM-Image Face Ratio Filter (DLC)	Filters image data based on the ratio of face area to total image area.
			LVM-Image Compliance Filter (DLC)	Filters image data based on its Not Safe For Work (NSFW) score.
			LVM-Image Resolution Filter (DLC)	Filters image data based on a specified range of resolutions.
			LVM-Image Size Filter (DLC)	Filters image data based on a specified range of file sizes.
			LVM-Image-Text Matching Filter (DLC)	Filters image-text pairs based on their matching score.
			LVM-Image-Text Similarity Filter (DLC)	Filters image-text pairs based on their semantic similarity score.
			LVM-Image Watermark Filter (DLC)	Filters image data that contains watermarks.
			LVM-Image Captioning (DLC)	Generates natural language descriptions for images.
LLM training and inference			BERT Model Offline Inference	Performs offline inference using a pre-trained BERT classification model to classify text in an input table.

Traditional algorithm components

Important

These legacy components are no longer actively maintained. Stability and SLAs are not guaranteed. Replace them with recommended algorithm components in production environments.

Component type	Component	Description
Data preprocessing	Random sampling	Performs random, independent sampling on input data based on a specified ratio or count.
	Weighted sampling	Generates a sample from the input data using a weighted selection method.
	Filter and Map	Filters data rows based on a SQL expression and renames output columns.
	Stratified sampling	Divides data into groups based on a specified column and performs random sampling within each group.
	JOIN	Combines two tables based on a join key, similar to a SQL `JOIN` statement.
	Merge Columns	Merges columns from two tables. Both tables must have the same number of rows.
	Merge rows (UNION)	Appends rows from two tables. Both tables must have the same number and type of selected columns.
	Type Transformation	Converts data types of specified columns to String, Double, or Integer. Fills missing values on conversion failure.
	Add ID Column	Adds a sequential numeric ID column as the first column of the table.
	Split	Randomly splits a dataset into two subsets, typically for creating training and testing sets.
	Fill Missing Values	Fills missing values in specified columns using a selected method, such as mean, median, mode, or a custom value.
	Normalization	Rescales numerical features to a common range, such as [0, 1]. Supports both dense and sparse data formats.
	Standardization	Rescales features to have a mean of 0 and a standard deviation of 1 (z-score normalization).
	KV to Table	Converts a table from a sparse Key-Value (KV) format to a dense table format.
	Table to KV	Converts a dense table to a sparse Key-Value (KV) format.
Feature engineering	Feature importance filtering	Filters for the top N features based on importance scores generated by other components.
	Principal Component Analysis	Reduces dataset dimensionality by transforming features into linearly uncorrelated principal components.
	Feature scaling	Applies min-max, log, or z-score scaling transformations to numerical features.
	Feature discretization	Converts continuous numerical features into discrete categorical features (bins).
	Feature Anomaly Smoothing	Clamps anomalous feature values to a specified range. Supports both sparse and dense data formats.
	Singular Value Decomposition	Performs Singular Value Decomposition (SVD) on a matrix.
	Anomaly Detection	Detects outliers in data containing both continuous and categorical features.
	Linear model feature importance	Calculates feature importance scores using a linear regression or logistic regression model.
	Discrete feature analysis	Analyzes the statistical distribution of discrete features.
	Random forest feature importance	Calculates feature importance scores using a trained Random Forest model.
	Filter-based Feature Selection	Selects a subset of features using filter methods such as Chi-squared, Gini index, or Information Gain.
	Feature Encoding	Encodes non-linear features into linear features using a Gradient Boosting Decision Tree (GBDT) model.
	One-Hot Encoding	Converts categorical features into a binary vector representation. The output is in a sparse Key-Value (KV) format.
Statistical analysis	Data View	Provides a visual summary of data distribution and statistics for selected columns.
	Covariance	Calculates the covariance between two random variables to measure how they change together.
	Empirical Probability Density Plot	Generates a probability density plot using either empirical distribution or kernel density estimation.
	Full Table Statistics	Calculates descriptive statistics for all columns or a subset of columns in a table.
	Chi-Square Goodness-of-Fit Test	Tests whether observed frequencies match theoretical frequencies across categories of a multinomial variable. The null hypothesis assumes no difference between observed and expected values.
	Box plot	A statistical graph that displays dataset dispersion and distribution characteristics. Also used to compare distributions across multiple datasets.
	Scatter chart	Plots data points on a Cartesian coordinate system for regression analysis.
	Correlation Matrix	Calculates pairwise correlation coefficients between columns in a matrix. Values range from [-1,1]. Counts are based on non-empty elements shared between each column pair.
	Two-Sample T-Test	Tests whether two sample means differ significantly.
	One-Sample T-Test	Tests whether there is a significant difference between the population mean of a variable and a specified value. The tested sample must follow a normal distribution.
	Normality test	Tests whether a population follows a normal distribution using a goodness-of-fit hypothesis test.
	Lorenz curve	Visually displays the income distribution of a country or region.
	Percentile	Calculates percentiles for data in a table column.
	Pearson coefficient	A linear correlation coefficient that reflects the degree of linear correlation between two variables.
	Histogram	Displays data distribution as a bar or line chart with varying heights.
Machine learning	Prediction	Takes a trained model and prediction data as inputs, and outputs prediction results.
	XGBoost Training	An extended boosting algorithm that supports classification and regression. Widely used in ML systems and competitions for its ease of use and robustness.
	XGBoost Prediction	Performs prediction using a trained XGBoost model. Supports classification and regression.
	Linear Support Vector Machine	A statistical learning method that improves generalization by minimizing structural risk.
	Logistic regression for binary classification	A binary classification algorithm that supports both sparse and dense data formats.
	GBDT for Binary Classification	Classifies samples as positive or negative based on a feature threshold.
	PS-SMART for Binary Classification	SMART is an iterative GBDT-based algorithm implemented on Parameter Server (PS) for large-scale offline and online training.
	PS-based Logistic Regression for Binary Classification	A classic binary classification algorithm widely used in advertising and search scenarios.
	PS-SMART for Multiclass Classification	SMART is an iterative GBDT-based algorithm implemented on Parameter Server (PS) for large-scale multiclass classification.
	K-Nearest Neighbors	Classifies each row by selecting the K nearest records from the training data and assigning the most frequent class.
	Logistic regression for multiclass classification	A binary classification algorithm. PAI logistic regression supports multiclass classification, sparse data, and dense data.
	Random Forests	A classifier that includes multiple decision trees. Its classification result is determined by the mode of the classes output by individual trees.
	Naive Bayes	A probabilistic classification algorithm based on Bayes' theorem with an independence assumption.
	K-Means Clustering	Randomly selects K initial cluster centers, assigns remaining objects to the nearest cluster, and iteratively recalculates centers.
	DBSCAN	Builds clustering models using the DBSCAN algorithm.
	Gaussian Mixture Model training	Builds classification models using Gaussian Mixture Model (GMM) training.
	DBSCAN prediction	Predicts cluster assignments for new data points based on a trained DBSCAN model.
	GMM prediction	Performs clustering prediction based on a trained Gaussian mixture model.
	GBDT regression	An iterative decision tree algorithm suitable for linear and non-linear regression scenarios.
	Linear regression	A model that analyzes the linear relationship between a dependent variable and multiple independent variables.
	PS-SMART regression	This component is designed to handle large-scale offline and online training tasks. SMART is an iterative algorithm that is based on GBDT and implemented on PS.
	PS linear regression	Performs linear regression on PS for large-scale training.
	Binary classification evaluation	Calculates metrics such as AUC, KS, and F1-score, and outputs KS curves, PR curves, ROC curves, LIFT charts, and Gain charts.
	Regression Model Evaluation	Evaluates the quality of regression algorithm models based on prediction results and raw results, and outputs evaluation metrics and a residual histogram.
	Clustering model evaluation	Evaluates the quality of clustering models based on raw data and clustering results, and outputs evaluation metrics.
	Confusion Matrix	Suitable for supervised learning and corresponds to the matching matrix in unsupervised learning.
	Multiclass classification evaluation	Evaluates the quality of multiclass classification algorithm models based on the prediction results and raw results of classification models, and outputs evaluation metrics such as Accuracy, Kappa, and F1-Score.
Deep learning	Deep learning frameworks and activation instructions	PAI supports deep learning frameworks. Use these frameworks and hardware resources to run deep learning algorithms.
Time series	x13_arima	A seasonal adjustment algorithm based on X-13ARIMA-SEATS.
	x13_auto_arima	Automatic ARIMA model selection based on the Gomez and Maravall (1998) program implemented in TRMO.
	Prophet	Performs Prophet time series prediction on each row of MTable data and provides prediction results for the next time period.
	MTable assembler	Aggregates a table into an MTable based on grouping columns.
	MTable expander	Expands an MTable into a table.
Recommendation methods	FM Algorithm	A non-linear model that considers feature interactions. Suitable for recommendation in e-commerce, advertising, and live streaming.
	ALS Matrix Factorization	Performs matrix factorization on a sparse matrix using ALS to estimate missing values and produce a training model.
	Swing Training	Measures item similarity based on the User-Item-User principle for item recall.
	Swing prediction	A batch processing prediction component for Swing. Use this component to perform offline prediction based on a Swing training model and prediction data.
	Collaborative filtering (etrec)	etrec is an item-based collaborative filtering algorithm. The input consists of two columns, and the output is the top N most similar items.
	Vector retrieval evaluation	Evaluates recall quality by calculating hit rates. Higher hit rates indicate more accurate recall from trained vectors.
Anomaly detection	Local Outlier Factor Anomaly Detection	Determines whether a sample is an anomaly based on its Local Outlier Factor (LOF) value.
	IForest Anomaly Detection	An anomaly detection algorithm that uses sub-sampling to reduce computational complexity.
	One-Class SVM Anomaly Detection	An unsupervised anomaly detection algorithm that learns a boundary to identify anomalies, unlike traditional SVM.
Natural Language Processing	Text Summarization Prediction	Extracts or summarizes key information from text. Calls a pre-trained model to generate news headlines from news text.
	Machine Reading Comprehension Prediction	Performs offline prediction with the generated machine reading comprehension training model.
	Text Summarization Training	Trains a text summarization model that generates news headlines from articles.
	Machine Reading Comprehension Training	Trains a machine reading comprehension model that can quickly understand and answer questions based on a given document.
	Split Word	Based on the AliWS (Alibaba Word Segmenter) lexical analysis system, this component performs tokenization on the content of a specified column. The resulting tokens are separated by spaces.
	Trituple to KV	Converts a trituple table (row,col,value) to a key-value (KV) table (row,[col_id:value]).
	String similarity	A basic operation in machine learning, mainly used in information retrieval, natural language processing, and bioinformatics.
	String Similarity-Top N	Calculates string similarity and filters out the top N most similar data.
	Stop Word Filter	A pre-processing method in text analysis used to filter noise (such as "the", "is", or "a") from tokenization results.
	ngram-count	A step in language model training. It generates n-grams based on words and counts the occurrences of each n-gram across the entire corpus.
	Text summarization	A simple and coherent short text that comprehensively and accurately reflects the central idea of a document. Automatic summarization uses a computer to automatically extract summary content from the original document.
	Keyword extraction	An important technique in natural language processing. It extracts words from a text that are highly relevant to the meaning of the document.
	Sentence splitting	Splits a piece of text into sentences based on punctuation. This component is mainly used for pre-processing before text summarization, converting a paragraph into a one-sentence-per-line format.
	Semantic Vector Distance	Calculates extension words or sentences by finding the nearest vectors from semantic embeddings (e.g., Word2Vec). Returns the most similar items for a given input.
	Doc2Vec	Maps documents to vectors using the Doc2Vec algorithm. Input: vocabulary. Output: document vectors, word vectors, or vocabulary.
	Conditional random field	A conditional random field (CRF) is a probabilistic distribution model of a set of output random variables given a set of input random variables. Its characteristic is the assumption that the output random variables form a Markov random field.
	Document similarity	Builds on string similarity to calculate the similarity between pairs of documents or sentences based on words.
	PMI	This algorithm counts the co-occurrence of all words in several documents and calculates the pointwise mutual information (PMI) between each pair.
	Conditional random field prediction	An algorithm component based on the linearCRF online prediction model, mainly used for sequence labeling problems.
	Split Word (Generate Model)	Based on the AliWS (Alibaba Word Segmenter) lexical analysis system, this component generates a tokenization model based on parameters and a custom dictionary.
	Word Count	Takes strings as input (entered manually or read from a file) and uses a program to count the total number of words and the frequency of each word.
	TF-IDF	A common weighting technique for information retrieval and text mining. It is often used in search engines as a measure or rating of the relevance between a document and a user query.
	PLDA	Sets the topic parameter to extract different topics from each document.
	Word2Vec	Maps words to K-dimensional vectors using a neural network. Supports vector arithmetic corresponding to word semantics. Input: word column or vocabulary. Output: word vectors and vocabulary.
Network analysis	Tree depth	Outputs the depth and tree ID of each node.
	k-Core	Finds closely connected subgraph structures in a graph that meet a specified coreness. The maximum core number of a node is called the core number of the graph.
	Single-Source Shortest Path	Uses the Dijkstra algorithm. Given a starting point, it outputs the shortest path from that point to all other nodes.
	PageRank	Originated from web search ranking. It uses the link structure of web pages to calculate the rank of each page.
	Label propagation clustering	A graph-based semi-supervised method where a node's label depends on its neighbors' labels, weighted by node similarity, and stabilized through iterative propagation.
	Label propagation classification	A semi-supervised classification algorithm that uses the label information of labeled nodes to predict the labels of unlabeled nodes.
	Modularity	A metric for evaluating community network structures. It assesses the tightness of communities within a network structure. A value above 0.3 usually indicates a clear community structure.
	Maximal Connected Subgraph	Finds all maximal connected subgraphs in an undirected graph — subsets where all vertices are connected to each other but not to vertices in other subsets.
	Vertex clustering coefficient	In an undirected graph G, this component calculates the density around each node. The density of a star network is 0, and the density of a fully connected network is 1.
	Edge clustering coefficient	In an undirected graph G, this algorithm calculates the density around each edge.
	Count Triangles	In an undirected graph G, this component outputs all triangles.
Finance	Data Transformation Module	Use this component to perform normalization, discretization, indexing, or Weight of Evidence (WOE) transformation on data.
	Scorecard training	A credit risk modeling tool that discretizes variables by binning, then applies logistic or linear regression. Includes feature selection and score transformation.
	Scorecard Prediction	Scores raw data based on the model results produced by the Scorecard Training component.
	Binning	Performs feature discretization by segmenting continuous data into multiple discrete intervals. The Binning component supports equal frequency binning, equal width binning, and automatic binning.
	Population Stability Index (PSI)	An important indicator for measuring the shift caused by sample changes. It is commonly used to measure the stability of samples.
Visual algorithms	Image Classification Training (torch)	Builds an image classification model for inference.
	Video Classification Training	Trains a video classification model for inference.
	Image Detection Training (easycv)	Builds an object detection model to detect and frame high-risk entities in images.
	Image Self-Supervised Training	Directly trains raw, unlabeled images to obtain a model for image feature extraction.
	Image Metric Learning Training (raw)	Builds a metric learning model for model inference.
	Image Keypoint Training	If your business scenario involves human-related keypoint detection, use the Image Keypoint Training component to build a keypoint model for model inference.
	Model Quantization	Compresses and accelerates models using mainstream quantization algorithms for high-performance inference.
	Model Pruning	Compresses and accelerates models using the AGP (taylorfo) pruning algorithm for high-performance inference.
Tools	Offline Model (OfflineModel) related components	A MaxCompute data structure for storing models generated by traditional ML algorithms. Use Offline Model components to retrieve models for offline prediction.
Tools	General-Purpose Model Export	Exports a MaxCompute-trained model to a specified OSS path.
Custom scripts	PyAlink Script	Calls Alink algorithms for classification, regression, and recommendation. Integrates with other Designer components to build and validate workflows.
Custom scripts	Time Window SQL Script	Adds a multi-date loop execution feature to the standard SQL Script component. It is used for the parallel execution of daily SQL tasks within a specific time period.
Beta components	Lasso Regression Training	A compression estimation algorithm.
	Lasso regression prediction	Supports both sparse and dense data formats. Use this component to predict numeric variables, such as loan amounts and temperatures.
	Ridge regression prediction	Predicts numeric variables, including housing prices, sales volumes, and humidity.
	Ridge regression training	The most commonly used regularization method for regression analysis of ill-posed problems.