Recommended algorithm components
Recommended components cover general-purpose algorithms (data reading, SQL scripts, Python scripts), LLM and LVM data processing, and model training and inference. DLC-based components are preferred because they support heterogeneous resources and custom environments for greater flexibility.
|
Component type |
Component |
Description |
||
|
Custom component |
Creates a custom component in AI Asset Management for use in Designer alongside official components. |
|||
|
Source/Target |
Reads files or folders from a specified path in an Object Storage Service (OSS) bucket. |
|||
|
Reads CSV files from OSS, HTTP, or HDFS. |
||||
|
Reads data from a MaxCompute table in the current project. |
||||
|
Writes upstream data to a MaxCompute table. |
||||
|
Custom script |
Executes custom SQL statements in MaxCompute. |
|||
|
Installs dependency packages and runs custom Python functions. |
||||
|
Tools |
Register Dataset |
Registers a dataset to AI Asset Management. |
||
|
Register Model |
Registers a model to AI Asset Management. |
|||
|
Update EAS Service (Beta) |
Calls |
|||
|
LLM data processing |
Data conversion |
Exports a MaxCompute table to OSS. |
||
|
Imports data from OSS to a MaxCompute table. |
||||
|
LLM data processing (DLC) |
Calculates the MD5 hash of text content and removes duplicate entries. |
|||
|
Performs Unicode normalization on text and converts Traditional Chinese characters to Simplified Chinese. |
||||
|
Removes URLs and strips HTML formatting to extract plain text. |
||||
|
Filters samples based on the ratio of special characters to total text length. |
||||
|
Removes copyright information from text, such as comments in code file headers. |
||||
|
Filters samples based on the ratio of numeric and alphabetic characters to total text length. |
||||
|
Filters samples based on total text length, average line length, and maximum line length. |
||||
|
LLM-Text Quality Scoring and Language Identification - FastText (DLC) |
Identifies the language of a text, calculates a quality score, and filters samples based on the specified language and score range. |
|||
|
Filters samples that contain words from a specified sensitive word dictionary. |
||||
|
Masks sensitive information, such as email addresses, phone numbers, and ID numbers. |
||||
|
Deduplicates documents by calculating SimHash similarity scores. |
||||
|
Filters samples based on the character-level or word-level N-gram repetition ratio. |
||||
|
Expands parameter-less macros inline in TEX-formatted data by replacing macro names with their values. |
||||
|
Removes the bibliography section from a LaTeX document. |
||||
|
Removes comment lines and inline comments from LaTeX source text. |
||||
|
Removes all content preceding the first section declaration in TEX-formatted data, retaining the chapter title and subsequent content. |
||||
|
LLM data processing (MaxCompute) |
Calculates the MD5 hash of text content and removes duplicate entries. |
|||
|
Performs Unicode normalization and converts Traditional Chinese characters to Simplified Chinese. |
||||
|
Removes content such as navigation, author information, URLs, and HTML formatting. |
||||
|
Filters samples based on the ratio of special characters to total text length. |
||||
|
Removes copyright information from text, such as comments in code file headers. |
||||
|
Filters samples based on the count of letters, numbers, and delimiters. |
||||
|
Filters samples based on total text length, average line length, and maximum line length. |
||||
|
LLM-Text Quality Scoring and Language Identification (MaxCompute) |
Identifies the language of a text, calculates a quality score, and filters samples based on the specified language and score range. |
|||
|
Filters samples that contain words from a specified sensitive word dictionary. |
||||
|
Masks sensitive information, such as email addresses, phone numbers, and ID numbers. |
||||
|
Filters samples based on the character-level or word-level N-gram repetition ratio. |
||||
|
Inlines parameter-less macro definitions in TEX-formatted data. |
||||
|
Removes the bibliography section from a LaTeX document. |
||||
|
Removes comment lines and inline comments from LaTeX source text. |
||||
|
Removes all content preceding the first section declaration in a LaTeX document. |
||||
|
LVM data processing (DLC) |
Video preprocessing operators |
Filters video data based on the quantity of text present in the frames. |
||
|
Filters video data based on a specified range of motion speed. |
||||
|
Filters video data that falls below a specified aesthetic quality score. |
||||
|
Filters video data based on a specified range of aspect ratios. |
||||
|
Filters video data based on a specified range of durations. |
||||
|
Filters video data based on the semantic similarity score between the video and its associated text. |
||||
|
Filters video data based on its Not Safe For Work (NSFW) score. |
||||
|
Filters video data based on a specified range of resolutions. |
||||
|
Filters video data that contains watermarks. |
||||
|
Filters video data that does not match a specified set of tags. |
||||
|
Generates descriptive tags for video frames. |
||||
|
Generates descriptive text for video frames. |
||||
|
Generates descriptive text for entire videos. |
||||
|
Image preprocessing operators |
Filters image data that falls below a specified aesthetic quality score. |
|||
|
Filters image data based on a specified range of aspect ratios. |
||||
|
Filters image data based on the ratio of face area to total image area. |
||||
|
Filters image data based on its Not Safe For Work (NSFW) score. |
||||
|
Filters image data based on a specified range of resolutions. |
||||
|
Filters image data based on a specified range of file sizes. |
||||
|
Filters image-text pairs based on their matching score. |
||||
|
Filters image-text pairs based on their semantic similarity score. |
||||
|
Filters image data that contains watermarks. |
||||
|
Generates natural language descriptions for images. |
||||
|
LLM training and inference |
Performs offline inference using a pre-trained BERT classification model to classify text in an input table. |
|||
Traditional algorithm components
These legacy components are no longer actively maintained. Stability and SLAs are not guaranteed. Replace them with recommended algorithm components in production environments.
|
Component type |
Component |
Description |
|
Data preprocessing |
Performs random, independent sampling on input data based on a specified ratio or count. |
|
|
Generates a sample from the input data using a weighted selection method. |
||
|
Filters data rows based on a SQL expression and renames output columns. |
||
|
Divides data into groups based on a specified column and performs random sampling within each group. |
||
|
Combines two tables based on a join key, similar to a SQL |
||
|
Merges columns from two tables. Both tables must have the same number of rows. |
||
|
Appends rows from two tables. Both tables must have the same number and type of selected columns. |
||
|
Converts data types of specified columns to String, Double, or Integer. Fills missing values on conversion failure. |
||
|
Adds a sequential numeric ID column as the first column of the table. |
||
|
Randomly splits a dataset into two subsets, typically for creating training and testing sets. |
||
|
Fills missing values in specified columns using a selected method, such as mean, median, mode, or a custom value. |
||
|
Rescales numerical features to a common range, such as [0, 1]. Supports both dense and sparse data formats. |
||
|
Rescales features to have a mean of 0 and a standard deviation of 1 (z-score normalization). |
||
|
Converts a table from a sparse Key-Value (KV) format to a dense table format. |
||
|
Converts a dense table to a sparse Key-Value (KV) format. |
||
|
Feature engineering |
Filters for the top N features based on importance scores generated by other components. |
|
|
Reduces dataset dimensionality by transforming features into linearly uncorrelated principal components. |
||
|
Applies min-max, log, or z-score scaling transformations to numerical features. |
||
|
Converts continuous numerical features into discrete categorical features (bins). |
||
|
Clamps anomalous feature values to a specified range. Supports both sparse and dense data formats. |
||
|
Performs Singular Value Decomposition (SVD) on a matrix. |
||
|
Detects outliers in data containing both continuous and categorical features. |
||
|
Calculates feature importance scores using a linear regression or logistic regression model. |
||
|
Analyzes the statistical distribution of discrete features. |
||
|
Calculates feature importance scores using a trained Random Forest model. |
||
|
Selects a subset of features using filter methods such as Chi-squared, Gini index, or Information Gain. |
||
|
Encodes non-linear features into linear features using a Gradient Boosting Decision Tree (GBDT) model. |
||
|
Converts categorical features into a binary vector representation. The output is in a sparse Key-Value (KV) format. |
||
|
Statistical analysis |
Provides a visual summary of data distribution and statistics for selected columns. |
|
|
Calculates the covariance between two random variables to measure how they change together. |
||
|
Generates a probability density plot using either empirical distribution or kernel density estimation. |
||
|
Calculates descriptive statistics for all columns or a subset of columns in a table. |
||
|
Tests whether observed frequencies match theoretical frequencies across categories of a multinomial variable. The null hypothesis assumes no difference between observed and expected values. |
||
|
A statistical graph that displays dataset dispersion and distribution characteristics. Also used to compare distributions across multiple datasets. |
||
|
Plots data points on a Cartesian coordinate system for regression analysis. |
||
|
Calculates pairwise correlation coefficients between columns in a matrix. Values range from [-1,1]. Counts are based on non-empty elements shared between each column pair. |
||
|
Tests whether two sample means differ significantly. |
||
|
Tests whether there is a significant difference between the population mean of a variable and a specified value. The tested sample must follow a normal distribution. |
||
|
Tests whether a population follows a normal distribution using a goodness-of-fit hypothesis test. |
||
|
Visually displays the income distribution of a country or region. |
||
|
Calculates percentiles for data in a table column. |
||
|
A linear correlation coefficient that reflects the degree of linear correlation between two variables. |
||
|
Displays data distribution as a bar or line chart with varying heights. |
||
|
Machine learning |
Takes a trained model and prediction data as inputs, and outputs prediction results. |
|
|
An extended boosting algorithm that supports classification and regression. Widely used in ML systems and competitions for its ease of use and robustness. |
||
|
Performs prediction using a trained XGBoost model. Supports classification and regression. |
||
|
A statistical learning method that improves generalization by minimizing structural risk. |
||
|
A binary classification algorithm that supports both sparse and dense data formats. |
||
|
Classifies samples as positive or negative based on a feature threshold. |
||
|
SMART is an iterative GBDT-based algorithm implemented on Parameter Server (PS) for large-scale offline and online training. |
||
|
A classic binary classification algorithm widely used in advertising and search scenarios. |
||
|
SMART is an iterative GBDT-based algorithm implemented on Parameter Server (PS) for large-scale multiclass classification. |
||
|
Classifies each row by selecting the K nearest records from the training data and assigning the most frequent class. |
||
|
A binary classification algorithm. PAI logistic regression supports multiclass classification, sparse data, and dense data. |
||
|
A classifier that includes multiple decision trees. Its classification result is determined by the mode of the classes output by individual trees. |
||
|
A probabilistic classification algorithm based on Bayes' theorem with an independence assumption. |
||
|
Randomly selects K initial cluster centers, assigns remaining objects to the nearest cluster, and iteratively recalculates centers. |
||
|
Builds clustering models using the DBSCAN algorithm. |
||
|
Builds classification models using Gaussian Mixture Model (GMM) training. |
||
|
Predicts cluster assignments for new data points based on a trained DBSCAN model. |
||
|
Performs clustering prediction based on a trained Gaussian mixture model. |
||
|
An iterative decision tree algorithm suitable for linear and non-linear regression scenarios. |
||
|
A model that analyzes the linear relationship between a dependent variable and multiple independent variables. |
||
|
This component is designed to handle large-scale offline and online training tasks. SMART is an iterative algorithm that is based on GBDT and implemented on PS. |
||
|
Performs linear regression on PS for large-scale training. |
||
|
Calculates metrics such as AUC, KS, and F1-score, and outputs KS curves, PR curves, ROC curves, LIFT charts, and Gain charts. |
||
|
Evaluates the quality of regression algorithm models based on prediction results and raw results, and outputs evaluation metrics and a residual histogram. |
||
|
Evaluates the quality of clustering models based on raw data and clustering results, and outputs evaluation metrics. |
||
|
Suitable for supervised learning and corresponds to the matching matrix in unsupervised learning. |
||
|
Evaluates the quality of multiclass classification algorithm models based on the prediction results and raw results of classification models, and outputs evaluation metrics such as Accuracy, Kappa, and F1-Score. |
||
|
Deep learning |
PAI supports deep learning frameworks. Use these frameworks and hardware resources to run deep learning algorithms. |
|
|
Time series |
A seasonal adjustment algorithm based on X-13ARIMA-SEATS. |
|
|
Automatic ARIMA model selection based on the Gomez and Maravall (1998) program implemented in TRMO. |
||
|
Performs Prophet time series prediction on each row of MTable data and provides prediction results for the next time period. |
||
|
Aggregates a table into an MTable based on grouping columns. |
||
|
Expands an MTable into a table. |
||
|
Recommendation methods |
A non-linear model that considers feature interactions. Suitable for recommendation in e-commerce, advertising, and live streaming. |
|
|
Performs matrix factorization on a sparse matrix using ALS to estimate missing values and produce a training model. |
||
|
Measures item similarity based on the User-Item-User principle for item recall. |
||
|
A batch processing prediction component for Swing. Use this component to perform offline prediction based on a Swing training model and prediction data. |
||
|
etrec is an item-based collaborative filtering algorithm. The input consists of two columns, and the output is the top N most similar items. |
||
|
Evaluates recall quality by calculating hit rates. Higher hit rates indicate more accurate recall from trained vectors. |
||
|
Anomaly detection |
Determines whether a sample is an anomaly based on its Local Outlier Factor (LOF) value. |
|
|
An anomaly detection algorithm that uses sub-sampling to reduce computational complexity. |
||
|
An unsupervised anomaly detection algorithm that learns a boundary to identify anomalies, unlike traditional SVM. |
||
|
Natural Language Processing |
Extracts or summarizes key information from text. Calls a pre-trained model to generate news headlines from news text. |
|
|
Performs offline prediction with the generated machine reading comprehension training model. |
||
|
Trains a text summarization model that generates news headlines from articles. |
||
|
Trains a machine reading comprehension model that can quickly understand and answer questions based on a given document. |
||
|
Based on the AliWS (Alibaba Word Segmenter) lexical analysis system, this component performs tokenization on the content of a specified column. The resulting tokens are separated by spaces. |
||
|
Converts a trituple table (row,col,value) to a key-value (KV) table (row,[col_id:value]). |
||
|
A basic operation in machine learning, mainly used in information retrieval, natural language processing, and bioinformatics. |
||
|
Calculates string similarity and filters out the top N most similar data. |
||
|
A pre-processing method in text analysis used to filter noise (such as "the", "is", or "a") from tokenization results. |
||
|
A step in language model training. It generates n-grams based on words and counts the occurrences of each n-gram across the entire corpus. |
||
|
A simple and coherent short text that comprehensively and accurately reflects the central idea of a document. Automatic summarization uses a computer to automatically extract summary content from the original document. |
||
|
An important technique in natural language processing. It extracts words from a text that are highly relevant to the meaning of the document. |
||
|
Splits a piece of text into sentences based on punctuation. This component is mainly used for pre-processing before text summarization, converting a paragraph into a one-sentence-per-line format. |
||
|
Calculates extension words or sentences by finding the nearest vectors from semantic embeddings (e.g., Word2Vec). Returns the most similar items for a given input. |
||
|
Maps documents to vectors using the Doc2Vec algorithm. Input: vocabulary. Output: document vectors, word vectors, or vocabulary. |
||
|
A conditional random field (CRF) is a probabilistic distribution model of a set of output random variables given a set of input random variables. Its characteristic is the assumption that the output random variables form a Markov random field. |
||
|
Builds on string similarity to calculate the similarity between pairs of documents or sentences based on words. |
||
|
This algorithm counts the co-occurrence of all words in several documents and calculates the pointwise mutual information (PMI) between each pair. |
||
|
An algorithm component based on the linearCRF online prediction model, mainly used for sequence labeling problems. |
||
|
Based on the AliWS (Alibaba Word Segmenter) lexical analysis system, this component generates a tokenization model based on parameters and a custom dictionary. |
||
|
Takes strings as input (entered manually or read from a file) and uses a program to count the total number of words and the frequency of each word. |
||
|
A common weighting technique for information retrieval and text mining. It is often used in search engines as a measure or rating of the relevance between a document and a user query. |
||
|
Sets the topic parameter to extract different topics from each document. |
||
|
Maps words to K-dimensional vectors using a neural network. Supports vector arithmetic corresponding to word semantics. Input: word column or vocabulary. Output: word vectors and vocabulary. |
||
|
Network analysis |
Outputs the depth and tree ID of each node. |
|
|
Finds closely connected subgraph structures in a graph that meet a specified coreness. The maximum core number of a node is called the core number of the graph. |
||
|
Uses the Dijkstra algorithm. Given a starting point, it outputs the shortest path from that point to all other nodes. |
||
|
Originated from web search ranking. It uses the link structure of web pages to calculate the rank of each page. |
||
|
A graph-based semi-supervised method where a node's label depends on its neighbors' labels, weighted by node similarity, and stabilized through iterative propagation. |
||
|
A semi-supervised classification algorithm that uses the label information of labeled nodes to predict the labels of unlabeled nodes. |
||
|
A metric for evaluating community network structures. It assesses the tightness of communities within a network structure. A value above 0.3 usually indicates a clear community structure. |
||
|
Finds all maximal connected subgraphs in an undirected graph — subsets where all vertices are connected to each other but not to vertices in other subsets. |
||
|
In an undirected graph G, this component calculates the density around each node. The density of a star network is 0, and the density of a fully connected network is 1. |
||
|
In an undirected graph G, this algorithm calculates the density around each edge. |
||
|
In an undirected graph G, this component outputs all triangles. |
||
|
Finance |
Use this component to perform normalization, discretization, indexing, or Weight of Evidence (WOE) transformation on data. |
|
|
A credit risk modeling tool that discretizes variables by binning, then applies logistic or linear regression. Includes feature selection and score transformation. |
||
|
Scores raw data based on the model results produced by the Scorecard Training component. |
||
|
Performs feature discretization by segmenting continuous data into multiple discrete intervals. The Binning component supports equal frequency binning, equal width binning, and automatic binning. |
||
|
An important indicator for measuring the shift caused by sample changes. It is commonly used to measure the stability of samples. |
||
|
Visual algorithms |
Builds an image classification model for inference. |
|
|
Trains a video classification model for inference. |
||
|
Builds an object detection model to detect and frame high-risk entities in images. |
||
|
Directly trains raw, unlabeled images to obtain a model for image feature extraction. |
||
|
Builds a metric learning model for model inference. |
||
|
If your business scenario involves human-related keypoint detection, use the Image Keypoint Training component to build a keypoint model for model inference. |
||
|
Compresses and accelerates models using mainstream quantization algorithms for high-performance inference. |
||
|
Compresses and accelerates models using the AGP (taylorfo) pruning algorithm for high-performance inference. |
||
|
Tools |
A MaxCompute data structure for storing models generated by traditional ML algorithms. Use Offline Model components to retrieve models for offline prediction. |
|
|
Exports a MaxCompute-trained model to a specified OSS path. |
||
|
Custom scripts |
Calls Alink algorithms for classification, regression, and recommendation. Integrates with other Designer components to build and validate workflows. |
|
|
Adds a multi-date loop execution feature to the standard SQL Script component. It is used for the parallel execution of daily SQL tasks within a specific time period. |
||
|
Beta components |
A compression estimation algorithm. |
|
|
Supports both sparse and dense data formats. Use this component to predict numeric variables, such as loan amounts and temperatures. |
||
|
Predicts numeric variables, including housing prices, sales volumes, and humidity. |
||
|
The most commonly used regularization method for regression analysis of ill-posed problems. |