Community Blog Machine Learning Algorithms for ML Beginners

Machine Learning Algorithms for ML Beginners

In this tutorial, it introduces Ml beginners with commonly machine learning algorithms such as Graph Algorithms, Linear regression, Logistic regression, Decision Tree, Random Forest and etc.

In most Machine Learning courses, regression algorithms are the first to be introduced for two reasons: Regression algorithms are relatively simple.

The previous blog post, Introduction to Machine Learning, presented the Machine Learning concept. Now, let’s discuss representative methods used in the technology.

Regression Algorithms

In most Machine Learning courses, regression algorithms are the first to be introduced for two reasons:

  1. Regression algorithms are relatively simple, and discussing them allows you to slide easily from statistics on to Machine Learning.
  2. Regression algorithms provide the foundation for some of the powerful algorithms to be discussed late

There are two regression algorithm subtypes: linear regression and logistic regression.

We have already touched on linear regression via the House Price Problem. Let’s figure out how to find the straight line that best fits all the data.

The "least squares" method is used to solve this problem. The thinking behind this method is: to find the line that best represents a set of data; and find the line that most closely follows the observed data points in the data set. Not all of the data points lie exactly along a straight line, so the best fit line is the one with the least deviation from those points. To minimize the deviation, find a straight line that minimizes the sum of the squares of all the distances between the points and the line.

The least squares method converts an optimization problem into an extreme function value problem. In mathematics, we can find the extreme values of a function by looking at the point where the derivative is 0. However, this method is not suitable for computers because they may fail to solve the problem or the computation may be too intensive.

In computer science, there is a specialized discipline called numerical computation. It strives to increase accuracy and efficiency when computers run various computations. For example, the famous gradient descent method and Newton method are classic numerical computation algorithms. They are also well-suited to find extreme values of functions. The gradient descent method is one of the simplest and most effective methods for solving regression The neural networks and recommendation algorithms to be discussed later have linear regression factors, and the gradient descent method is applied in other algorithms.

Logistic regression algorithms are similar to linear regression algorithms. However, the types of problems addressed by linear regression are different from those addressed by logistic regression. Linear regression deals with numerical problems with final predictions being numbers, such as a house price.

Logistic regression, on the other hand, is used in classification algorithms. The predictions made out of logistic regression are discrete classes like logistic regression is often used to determine a spam email or if a user will click on an advertisement.

In practice, logistic regression simply adds a Sigmoid function to the results computed by linear regression. This function converts the numerical result into a probability between 0 and 1 (images of the Sigmoid function are not very intuitive, but you need to understand that larger values are closer to 1 and smaller values are closer to 0). Next, we make a prediction based on this probability. For example, if the probability is greater than 0.5, we can classify an email as spam or judge a tumor to be malignant.

For better understanding, refer to the logistic regression algorithm shown in this blog.

Related Blogs

Alibaba Cloud Machine Learning Platform for AI: Financial Risk Control Experiment with Graph Algorithms

In this article, we will be evaluating credit scores for risk control through graph algorithm components in Alibaba Cloud's Machine Learning Platform for AI.

Graph algorithms are typically applied to relationship-based business. Unlike structured data, graph algorithms organize data into relationship graphs with nodes connected to each other by edges. Alibaba Cloud Machine Learning Platform for AI (PAI) provides several graph algorithm components, including K-Core, maximum connected subgraph, and label propagation classification.

This section uses graph algorithm components in the Alibaba Cloud Machine Learning Platform for AI to create an experiment as follows:

Machine Learning Platform for AI

The figure above shows the relationships among a group of people. The arrows in the figure represent the relationships between these people, for example, coworkers or relatives. Enoch is a trusted customer and Evan is a fraudulent customer. Graph algorithms are used to calculate the credit score of other people in order to learn the probability of a person being a fraudulent customer. The results can be used by corresponding institutions for risk control.

Machine Learning Algorithms and Scikit-Learn

This tutorial will focus on the more practical side of the machine learning pipeline. First we will take a look at the Scikit-learn library for machine learning, which is one of the most popular library in machine learning, and then we will give a practical demonstration of machine learning algorithms.

What Is Scikit-Learn?

Scikit-learn is currently one of the most popular machine learning library in the world. It is easy to use and features several powerful algorithms. It was originally created by David Cournapeau as a 2007 Google summer of code project but later went on to be publically released in 2010 when a team of developers from a research institute in France took the project to new heights. The main language that Scikit-learn is written in is Python, but some of the core algorithms are also written in Cython, which is a combination of C and Python. All of this design helps to ensure the overall good performance of Scikit-learn.

How Can I Use Machine Learning Algorithms on Alibaba Cloud?
So now let's see how machine learning works in Alibaba cloud in connection with Scikit-Learn.

The type of machine learning model that you will develop here is a regression algorithm, where prediction is based on continuous numerical values. At the same time, you will integrate other algorithms already found on Alibaba Cloud. Each algorithm will have an evaluation value. Towards the end of the tutorial you will find few sentence description about each algorithm used.

As the first step in this tutorial, obtain some data that's present in the cloud already. For this, I'm using the farmers data.

Alibaba Cloud Machine Learning Platform for AI: Using Regression Algorithm to Predict Agriculture Loan Issuing

This article will illustrate how to build a loan issuing prediction algorithm using linear regression on the Alibaba Cloud Machine Learning Platform for Artificial Intelligence using real data.

Issuing agriculture loans is a typical data mining case. Lenders use an experience model built based on statistics of past years (including a borrower's yearly income, types of planted crops, loan history, and other factors) to predict that borrower's repayment ability.

This document is based on agriculture loan scenarios and shows you how to use a linear regression algorithm to handle loan issuing business.

Linear regression is a widely applicable statistics analysis method used in statistics to determine the quantitative relation that two or more variables depend on. This article predicts whether to issue requested loan amounts to users in the prediction set by analyzing the issuing history information of agriculture loans. We will be performing all our data analysis on the Alibaba Cloud Machine Learning platform.

Data Exploration Procedure

1. Data Source Preparation

Input data is divided into two parts:

Loan training set: Over 200 pieces of loan data are used to train the regression model. This training set includes features such as "farmsize" and "rainfall". "claimvalue" is the recovered loan amount.
Loan prediction set: This prediction set includes a total of 71 loan applicants this year. "claimvalue" is a farmer's requested loan amount.
Predicate whom of the 71 applicants will receive loans based on the existing 200+ pieces of history data.

2. Data Pre-Processing

Map data of string type to numbers according to data meanings. For example, for the "region" field, map "north", "middle", and "south"in order to 0, 1, and 2 respectively, then convert the field to the double type by using the type conversion component, as shown in the following diagram. You can perform model training after data is pre-processed.

Related Courses

Machine Learning Algorithm Primer Series 3- Logistic Regression Model

This course is the 3rd class of the Alibaba Cloud Machine Learning Algorithm QuickStart series, It mainly introduces the basic concept and theory on Logistic Regression Model, as well as the data preprocessing technique in Logistic Regression Model, implements Logistic Regression Model with PAI to reslove business problem, prepar for the knowledge associate with subsequent machine learning courses.

Machine Learning Algorithm Primer Series 4- Decision Tree and Random Forest

This course is the 4th class of the Alibaba Cloud Machine Learning Algorithm QuickStart series, It mainly introduces the Decision Tree main algorithms and pruning principle,as well as applies feature engineering,implements Random Forest Model with PAI to explain the model comparison and selection, prepar for the knowledge associate with subsequent machine learning courses.

Related Market Products

Machine Learning Algorithm Primer 2-Naive Bayes Classifier

How to use Alibaba Cloud advanced machine learning platform for AI (PAI) to quickly apply the linear regression model in machine learning to properly solve business-related prediction problems.

Machine Learning Algorithm Primer - Linear Regression

How to use Alibaba Cloud advanced machine learning platform for AI (PAI) to quickly apply the linear regression model in machine learning to properly solve business-related prediction problems.

Related Documentation

Deep learning - Machine Learning Platform for AI

Introduction to deep learning

Alibaba Cloud Machine Learning Platform for AI supports multiple deep learning frameworks and provides powerful GPU clusters that contain both M40 and P100 GPU nodes. You can use these frameworks and hardware resources to train your deep learning algorithms.

Supported frameworks currently include TensorFlow (versions 1.0, 1.1, and 1.2), MXNet 0.9.5, and Caffe RC3. TensorFlow and MXNet support Python. Caffe supports custom net files.

Before using deep learning frameworks, you must upload your data to Alibaba Cloud Object Storage Service (OSS). The algorithms can read data from specified OSS directories when running. Note that machine learning GPU clusters are currently only available in the China (Shanghai) region. If your algorithms only read OSS data from the China (Shanghai) region, no traffic fees are incurred.

Upload data to OSS

Before using deep learning to process your data, you must upload your data to OSS. First, create OSS buckets. Since deep learning GPU clusters are only available in the China (Shanghai) region, we recommend that you choose this region when creating OSS buckets. Your data is then transmitted over the Alibaba Cloud classic network. No traffic fees are incurred when you run your algorithms. After the OSS buckets have been created, you can log on to the [OSS console] to create folders and upload your data.

Graph analysis - Machine Learning Platform for AI

The network analysis column provides analytic algorithms which are based on the Graph data structure. The following figure shows an example of the analysis process developed with the network analysis component of the platform.

The running parameters need to be set for the algorithm components in the network analysis column.
The parameters are described as follows:

  1. Process count: The workerNum parameter specifies the number of nodes for concurrent job execution. The concurrency level and framework communication costs increase with the value of this parameter.
  2. Work memory: The workerMem parameter specifies the maximum memory size that a single worker can use. The default value is 4096 MB. The OutOfMemory exception is thrown if memory usage of a single process exceeds the maximum.


Function overview

The KCore of a graph is the subgraph that is left after all nodes whose degrees are less than or equal to K are removed. If a node is included in the KCore but is removed from the (K+1)Core, the coreness of this node is K. Therefore, the coreness of a node whose degree is 1 must be 0. The maximum node coreness is the graph coreness.

Related Products

Machine Learning Platform for AI

Machine Learning Platform for AI provides end-to-end machine learning services, including data processing, feature engineering, model training, model prediction, and model evaluation. Machine Learning Platform for AI combines all of these services to make AI more accessible than ever.


DataWorks is a Big Data platform product launched by Alibaba Cloud. It provides one-stop Big Data development, data permission management, offline job scheduling, and other features.

DataWorks works straight ‘out-the-box’ without the need to worry about complex underlying cluster establishment and Operations & Management.

0 0 0
Share on

Alibaba Clouder

2,021 posts | 472 followers

You may also like