All Products
Search
Document Center

[Online prediction] Predict middle school students' final grades

Last Updated: Jun 01, 2020

The data in this topic is fictitious and is only used for experimental purposes.

Background

This topic uses real middle school students’ data and machine mining algorithms to determine the key factors affecting middle school students’ academics. The factors include parents’ occupation, parents’ education, and whether Internet is available at home.
This example uses a dataset that contains information about student family backgrounds and students’ behavior at school. This experiment uses the logistic regression algorithm to create an offline model and an academic performance assessment report, and uses this model to predict the students’ final grades. This experiment also creates an online prediction API, which allows you to apply the trained model to your online business.

Dataset

The dataset consists of 25 feature columns and 1 target column. The detailed fields are as follows.

Field Definition Type Description
sex Gender string F indicates female, and M indicates male.
address Home address string U indicates urban, and R indicates rural.
famsize Family size string LE3 indicates less than three members, and GT3 indicates more than three members.
pstatus Living with parents or not string T indicates living with parents, and A indicates not living with parents.
medu Mother’s education level string The value ranges from 0 to 4.
fedu Father’s education level string The value ranges from 0 to 4.
mjob Mother’s job string It includes education-related, health-related, and service industries.
fjob Father’s job string It includes education-related, health-related, and service industries.
guardian The student’s guardian string Valid values: mother, father, and other.
traveltime The travel time from home to school double Unit: minutes.
studytime The study time per week double Unit: hours.
failures Failed exams double The number of failed exams.
schoolsup Specifies whether additional learning aid is available string Valid values: yes and no.
fumsup Specifies whether tutoring is available string Valid values: yes and no.
paid Specifies whether tutoring related to examination subjects is available string Valid values: yes and no.
activities Specifies whether extracurricular activity classes are available string Valid values: yes and no.
higher Specifies whether the student has interest in higher education string Valid values: yes and no.
internet Specifies whether Internet is available at home string Valid values: yes and no.
famrel Family relationship double The value ranges from 1 to 5, indicating from bad to good family relationship.
freetime Free time double The value ranges from 1 to 5, indicating from little to much free time.
goout Frequency for going out with friends double The value ranges from 1 to 5, indicating from rarely to frequently going out with friends.
dalc Daily drinking double The value ranges from 1 to 5, indicating from little to much drinking on a daily basis.
walc Weekly drinking double The value ranges from 1 to 5, indicating from little to much drinking on a weekly basis.
health Health status double The value ranges from 1 to 5, indicating from bad to good health.
absences Absences double Value range: 0 to 93.
g3 Final exam double 20-point system.

The following is a screenshot of the data.

Offline training

The following figure shows the experiment process.

The data flows through the experiment from top to bottom, for preprocessing, splitting, training, prediction, and evaluation in sequence.

1. Data preprocessing

The SQL script is provided as follows.

  1. select (case sex when 'F' then 1 else 0 end) as sex,
  2. (case address when 'U' then 1 else 0 end) as address,
  3. (case famsize when 'LE3' then 1 else 0 end) as famsize,
  4. (case Pstatus when 'T' then 1 else 0 end) as Pstatus,
  5. Medu,
  6. Fedu,
  7. (case Mjob when 'teacher' then 1 else 0 end) as Mjob,
  8. (case Fjob when 'teacher' then 1 else 0 end) as Fjob,
  9. (case guardian when 'mother' then 0 when 'father' then 1 else 2 end) as guardian,
  10. traveltime,
  11. studytime,
  12. failures,
  13. (case schoolsup when 'yes' then 1 else 0 end) as schoolsup,
  14. (case fumsup when 'yes' then 1 else 0 end) as fumsup,
  15. (case paid when 'yes' then 1 else 0 end) as paid,
  16. (case activities when 'yes' then 1 else 0 end) as activities,
  17. (case higher when 'yes' then 1 else 0 end) as higher,
  18. (case internet when 'yes' then 1 else 0 end) as internet,
  19. famrel,
  20. freetime,
  21. goout,
  22. Dalc,
  23. Walc,
  24. health,
  25. absences,
  26. (case when G3>14 then 1 else 0 end) as finalScore
  27. from ${t1};

Structure text data by using the SQL Script component.

  • For example, the value assigned to a double type field can be Yes or No. You can use value 0 to represent Yes and value 1 to represent No.
  • For some multi-value text fields, the data can be abstracted based on the scenario. For example, for the field “Mjob”, 1 can indicate a teacher and 0 can indicate a non-teacher. After abstraction, this feature indicates whether the job is related to education.
  • The target column is quantified so that 1 indicates more than 18 points, and 0 indicates the others. The goal is to find a model that can predict the score through training.

2. Normalization

The purpose of the Normalization component is to remove the dimension and transform all the fields to 0 and 1. This eliminates the impact of the imbalance between the fields. The result is shown in the following figure.

3. Splitting

The dataset is split in a ratio of 8:2, in which 80% is used for model training and 20% is used for prediction.

4. Logistic regression

Use the Logistic Regression component to train and create an offline model. For more information about the algorithm, see Wiki.

5. Result analysis and evaluation

You can use the Confusion Matrix component to view the accuracy of the prediction made by your model. As shown in the following figure, the prediction accuracy of this experiment is 82.911%.

According to the characteristics of the logistic regression algorithm, some valuable information can be mined through the model coefficients. Right-click the Logistic Regression for Binary Classification component and choose Show Model. The results are shown in the following figure.

According to the characteristics of the logistic regression algorithm, the greater the weight, the greater the impact of the feature on the result. A positive weight indicates a positive correlation to the result 1 (high score in final exam), and a negative weight indicates a negative correlation. Several features with large weights are analyzed in the following table.

Field Definition Weight Analysis
mjob Mother’s job -0.7998341777833717 The mother being a teacher is disadvantageous for the child to get a high score.
fjob Father’s job 1.422595764037065 The father being a teacher is advantageous for the child to get a high score.
internet Specifies whether Internet is available at home 1.070938672974736 Internet at home will not only have no negative impact on the score, but will also promote the child’s study.
medu Mother’s education level 2.196219307541352 The mother’s education level has the greatest impact on the child. The higher the mother’s education level, the higher the child’s scores.

Due to the small dataset in this experiment, the preceding analysis results are not necessarily accurate and are for reference only.

Online prediction deployment

After the offline model has been created, deploy the model online and call restful-api to make online prediction.

References

You can log on to Alibaba Cloud Machine Learning Platform for AI (PAI) to experience this product and go to Yunqi Community to discuss with us.