The data in this topic is fictitious and is only used for experimental purposes.
This topic uses real middle school students’ data and machine mining algorithms to determine the key factors affecting middle school students’ academics. The factors include parents’ occupation, parents’ education, and whether Internet is available at home.
This example uses a dataset that contains information about student family backgrounds and students’ behavior at school. This experiment uses the logistic regression algorithm to create an offline model and an academic performance assessment report, and uses this model to predict the students’ final grades. This experiment also creates an online prediction API, which allows you to apply the trained model to your online business.
The dataset consists of 25 feature columns and 1 target column. The detailed fields are as follows.
|sex||Gender||string||F indicates female, and M indicates male.|
|address||Home address||string||U indicates urban, and R indicates rural.|
|famsize||Family size||string||LE3 indicates less than three members, and GT3 indicates more than three members.|
|pstatus||Living with parents or not||string||T indicates living with parents, and A indicates not living with parents.|
|medu||Mother’s education level||string||The value ranges from 0 to 4.|
|fedu||Father’s education level||string||The value ranges from 0 to 4.|
|mjob||Mother’s job||string||It includes education-related, health-related, and service industries.|
|fjob||Father’s job||string||It includes education-related, health-related, and service industries.|
|guardian||The student’s guardian||string||Valid values: mother, father, and other.|
|traveltime||The travel time from home to school||double||Unit: minutes.|
|studytime||The study time per week||double||Unit: hours.|
|failures||Failed exams||double||The number of failed exams.|
|schoolsup||Specifies whether additional learning aid is available||string||Valid values: yes and no.|
|fumsup||Specifies whether tutoring is available||string||Valid values: yes and no.|
|paid||Specifies whether tutoring related to examination subjects is available||string||Valid values: yes and no.|
|activities||Specifies whether extracurricular activity classes are available||string||Valid values: yes and no.|
|higher||Specifies whether the student has interest in higher education||string||Valid values: yes and no.|
|internet||Specifies whether Internet is available at home||string||Valid values: yes and no.|
|famrel||Family relationship||double||The value ranges from 1 to 5, indicating from bad to good family relationship.|
|freetime||Free time||double||The value ranges from 1 to 5, indicating from little to much free time.|
|goout||Frequency for going out with friends||double||The value ranges from 1 to 5, indicating from rarely to frequently going out with friends.|
|dalc||Daily drinking||double||The value ranges from 1 to 5, indicating from little to much drinking on a daily basis.|
|walc||Weekly drinking||double||The value ranges from 1 to 5, indicating from little to much drinking on a weekly basis.|
|health||Health status||double||The value ranges from 1 to 5, indicating from bad to good health.|
|absences||Absences||double||Value range: 0 to 93.|
|g3||Final exam||double||20-point system.|
The following is a screenshot of the data.
The following figure shows the experiment process.
The data flows through the experiment from top to bottom, for preprocessing, splitting, training, prediction, and evaluation in sequence.
The SQL script is provided as follows.
select (case sex when 'F' then 1 else 0 end) as sex,
(case address when 'U' then 1 else 0 end) as address,
(case famsize when 'LE3' then 1 else 0 end) as famsize,
(case Pstatus when 'T' then 1 else 0 end) as Pstatus,
(case Mjob when 'teacher' then 1 else 0 end) as Mjob,
(case Fjob when 'teacher' then 1 else 0 end) as Fjob,
(case guardian when 'mother' then 0 when 'father' then 1 else 2 end) as guardian,
(case schoolsup when 'yes' then 1 else 0 end) as schoolsup,
(case fumsup when 'yes' then 1 else 0 end) as fumsup,
(case paid when 'yes' then 1 else 0 end) as paid,
(case activities when 'yes' then 1 else 0 end) as activities,
(case higher when 'yes' then 1 else 0 end) as higher,
(case internet when 'yes' then 1 else 0 end) as internet,
(case when G3>14 then 1 else 0 end) as finalScore
Structure text data by using the SQL Script component.
- For example, the value assigned to a double type field can be Yes or No. You can use value 0 to represent Yes and value 1 to represent No.
- For some multi-value text fields, the data can be abstracted based on the scenario. For example, for the field “Mjob”, 1 can indicate a teacher and 0 can indicate a non-teacher. After abstraction, this feature indicates whether the job is related to education.
- The target column is quantified so that 1 indicates more than 18 points, and 0 indicates the others. The goal is to find a model that can predict the score through training.
The purpose of the Normalization component is to remove the dimension and transform all the fields to 0 and 1. This eliminates the impact of the imbalance between the fields. The result is shown in the following figure.
The dataset is split in a ratio of 8:2, in which 80% is used for model training and 20% is used for prediction.
Use the Logistic Regression component to train and create an offline model. For more information about the algorithm, see Wiki.
You can use the Confusion Matrix component to view the accuracy of the prediction made by your model. As shown in the following figure, the prediction accuracy of this experiment is 82.911%.
According to the characteristics of the logistic regression algorithm, some valuable information can be mined through the model coefficients. Right-click the Logistic Regression for Binary Classification component and choose Show Model. The results are shown in the following figure.
According to the characteristics of the logistic regression algorithm, the greater the weight, the greater the impact of the feature on the result. A positive weight indicates a positive correlation to the result 1 (high score in final exam), and a negative weight indicates a negative correlation. Several features with large weights are analyzed in the following table.
|mjob||Mother’s job||-0.7998341777833717||The mother being a teacher is disadvantageous for the child to get a high score.|
|fjob||Father’s job||1.422595764037065||The father being a teacher is advantageous for the child to get a high score.|
|internet||Specifies whether Internet is available at home||1.070938672974736||Internet at home will not only have no negative impact on the score, but will also promote the child’s study.|
|medu||Mother’s education level||2.196219307541352||The mother’s education level has the greatest impact on the child. The higher the mother’s education level, the higher the child’s scores.|
Due to the small dataset in this experiment, the preceding analysis results are not necessarily accurate and are for reference only.
After the offline model has been created, deploy the model online and call restful-api to make online prediction.
You can log on to Alibaba Cloud Machine Learning Platform for AI (PAI) to experience this product and go to Yunqi Community to discuss with us.