AutoML (automated machine learning) refers to the automated end-to-end process of applying machine learning in real and practical scenarios.
A typical machine learning model includes the four following steps:
From data reading, pre-processing, optimization, and result prediction, each step is controlled and performed manually. AutoML focuses on two main aspects: data collection and prediction. Any other intermediate steps can be easily automated. In addition, AutoML provides models that have been optimized and ready for prediction.
Currently, AutoML mainly falls into three categories: 1. AutoML for automated parameter tuning (a relatively basic type) 2. AutoML for non-deep learning, for example, AutoSKlearn. This type is mainly applied in data pre-processing, automated feature analysis, automated feature detection, automated feature selection, and automated model selection. 3. AutoML for deep learning/neural networks, including NAS and ENAS as well as Auto-Keras for frameworks.
From the application perspective, the demand for machine learning systems has soared over the past few years. ML has been adopted in a wide range of applications. However, although it is proven that machine learning can provide better support for some enterprises, many enterprises are still struggling to implement ML model deployment.
Theoretically, one goal of AI is to replace a portion of manpower. Specifically, a large part of the AI design work can also be implemented by using proper algorithms. Take parameter tuning for example: Algorithms like Bayes, NAS, and evolutionary programming can be used in the parameter tuning process to replace manpower by allowing more computing power.
To deploy AI models, an enterprise first needs to have a team of experienced data scientists, who expect high salaries. Even if an enterprise does have an excellent team, usually more experience rather than AI knowledge is needed to decide which model best fits the enterprise. The success of machine learning in a variety of applications leads to an increasingly higher demand for machine learning systems, which are supposed to be easy-to-use even for non-experts. AutoML tends to automate as many steps as possible in ML pipelines and retain good model performance with minimum manpower.
AutoML has three major advantages:
AutoML has a history of many years. Since last year, many excellent AutoML frameworks have emerged. This article only briefly describes several common frameworks. Subsequent articles will give more information about the use and performance of these frameworks.
Since SKLearn is usually a choice for getting started, let's first talk about AutoSKLearn.
Auto-SKLearn is an automated machine learning software package built on scikit-learn. Auto-SKLearn frees a machine learning user from algorithm selection and hyper-parameter tuning. It includes feature engineering methods such as One-Hot, digital feature standardization, and PCA. The model uses SKLearn estimators to process classification and regression problems.
Auto-SKLearn creates a pipeline and uses Bayes search to optimize that channel. In the ML framework, two components are added for hyperparameter tuning by means of Bayesian reasoning: Meta learning is used to initialize optimizers using Bayes and evaluate the auto collection construction of the configuration during the optimization process.
Auto-SKLearn performs well in medium and small datasets, but it cannot produce modern deep learning systems with the most advanced performance in large datasets.
The following example shows how to use Auto-SKLearn to fit a simple regression model.
import sklearn.model_selection import sklearn.datasets import sklearn.metrics import autosklearn.regression def main(): X, y = sklearn.datasets.load_boston(return_X_y=True) feature_types = (['numerical'] * 3) + ['categorical'] + (['numerical'] * 9) X_train, X_test, y_train, y_test = \ sklearn.model_selection.train_test_split(X, y, random_state=1) automl = autosklearn.regression.AutoSklearnRegressor( time_left_for_this_task=120, per_run_time_limit=30, tmp_folder='/tmp/autosklearn_regression_example_tmp', output_folder='/tmp/autosklearn_regression_example_out', ) automl.fit(X_train, y_train, dataset_name='boston', feat_type=feature_types) print(automl.show_models()) predictions = automl.predict(X_test) print("R2 score:", sklearn.metrics.r2_score(y_test, predictions)) if __name__ == '__main__': main()
MLBox is a powerful Automated Machine Learning python library. According to the official document, it provides the following features:
MLBox main package contains 3 sub-packages:
TPOT is a tree-based pipeline optimization tool that uses genetic algorithms to optimize machine learning pipelines. TPOT is built on top of scikit-learn and uses its own regressor and classifier methods. TPOT explore thousands of possible pipelines and finds the one that best fit the data.
H2O is an open source and distributed in-memory machine learning platform developed by H2O.ai. H2O supports both R and Python. It supports the most widely used statistical and machine learning algorithms including gradient boosted machines, generalized linear models, deep learning and more.
H2O includes an Automated Machine Learning module and uses its own algorithms to create pipelines. It uses exhaustive search for feature engineering methods and model hyper-parameters to optimize pipelines.
H2O automates some complex data science and machine learning tasks, such as feature engineering, model validation, model adjustment, model selection, and model deployment. In addition, it provides automatic visualization and machine learning interpretation (MLI).
Auto-Keras is an open source software library for automated machine learning (AutoML) developed by DATA Lab. Built on top of the deep learning framework Keras, Auto-Keras provides functions to automatically search for architecture and hyper-parameters of deep learning models.
Auto-Keras follows the classic Scikit-Learn API design and therefore is easy to use. The current version provides the function to automatically search for hyper-parameters during deep learning.
In Auto-Keras, the trend is to simplify ML by using automatic Neural Architecture Search (NAS) algorithms. NAS basically uses a set of algorithms that automatically adjust models to replace deep learning engineers/practitioners.
Resource link: https://github.com/keras-team/autokeras
Now let's look at a killer library - TransmogrifAI released by Salesforce in 2018.
Einstein, a flagship ML platform of Salesforce, is also powered by TransmogrifAI. TransmogrifAI is an end-to-end AutoML library for structured data written in Scala that runs on top of Apache Spark. It automates feature analysis, feature selection, feature validation, model selection and more. TransmogrifAI is especially useful in the following scenarios:
The essence of AutoML is to automate repetitive tasks such as pipeline creation and hyper-parameter tuning so that data scientists can spend more time on business problems on hand in practical scenarios. AutoML also allows everyone instead a small group of people to use the machine learning technology. Data scientists can accelerate ML development by using AutoML to implement really efficient machine learning.
Whether AutoML will be a success depends on its usage and progress in the machine learning field. Obviously, AutoML is an important part of machine learning in the future.
Alibaba Clouder - November 29, 2018
Alibaba Tech - July 25, 2019
Alibaba Clouder - February 5, 2018
Alibaba Clouder - September 2, 2019
Apache Flink Community China - August 2, 2019
Alibaba Container Service - July 29, 2019
An end-to-end platform that provides various machine learning algorithms to meet your data mining and analysis requirements.Learn More
An online computing service that offers elastic and secure virtual cloud servers to cater all your cloud hosting needs.Learn More
Powerful parallel computing capabilities based on GPU technology.Learn More
More Posts by Alibaba Clouder