This topic describes how to use logistic regression to generate a performance prediction model. You can use this model to predict the performance of students in an examination based on the family background of the students and their behavior at school. You can also obtain the key factors that affect the performance of students in examinations.

Background information

After you obtain the performance prediction model that is described in this topic, you can import your data to a MaxCompute table to perform offline prediction.

Dataset

The dataset that is used in this topic contains 25 feature fields and one goal field. The following table describes the fields in the dataset.
Field Type Description
sex STRING The gender of the student. Valid values: F and M. F indicates that the student is a female. M indicates that the student is a male.
address STRING The type of area where the student lives. Valid values: U and R. U indicates that the student lives in the urban area. R indicates that the student lives in the rural area.
famsize STRING The number of family members. Valid values: LE3 and GT3. LE3 indicates that the number of family members is less than or equal to three. GT3 indicates that the number of family members is greater than three.
pstatus STRING Indicates whether the student lives with parents. Valid values: T and A. T indicates that the student lives with parents. A indicates that the student does not live with parents.
medu STRING The education level of the mother. Valid values: 0 to 4. A greater value indicates that the mother is better educated.
fedu STRING The education level of the father. Valid values: 0 to 4. A greater value indicates that the father is better educated.
mjob STRING The job of the mother. For example, the mother may work in the education, health, or services industry.
fjob STRING The job of the father. For example, the father may work in the education, health, or services industry.
guardian STRING The guardian of the student. Valid values: mother, father, and other.
traveltime DOUBLE The travel time from home to school, in minutes.
studytime DOUBLE The study time per week, in hours.
failures DOUBLE The number of failed examinations.
schoolsup STRING Indicates whether the student receives additional training in study. Valid values: yes and no.
fumsup STRING Indicates whether the student has a tutor. Valid values: yes and no.
paid STRING Indicates whether the student receives additional training for passing the examination. Valid values: yes and no.
activities STRING Indicates whether the student receives extracurricular training courses. Valid values: yes and no.
higher STRING Indicates whether the student pursues higher education. Valid values: yes and no.
internet STRING Indicates whether the Internet is available for the student at home. Valid values: yes and no.
famrel DOUBLE The family relationship of the student. Valid values: 1 to 5. A greater value indicates a better family relationship.
freetime DOUBLE The free time available for the student. Valid values: 1 to 5. A greater value indicates a greater amount of free time.
goout DOUBLE Indicates how often the student hangs out with friends. Valid values: 1 to 5. A greater value indicates that the students hangs out with friends more often.
dalc DOUBLE Indicates how much the student drinks per day. Valid values: 1 to 5. A greater value indicates that the student drinks more.
walc DOUBLE Indicates how much the student drinks per week. Valid values: 1 to 5. A greater value indicates that the student drinks more.
health DOUBLE The health status of the student. Valid values: 1 to 5. A greater value indicates that the student has a better health status.
absences DOUBLE The attendance of the student. Valid values: 0 to 93.
g3 DOUBLE The performance in the final examination. The performance is scored at a maximum of 20 points.
The following figure shows the sample data that is used in the experiment.Sample data of the experiment

Procedure

  1. Go to the Machine Learning Studio console.
    1. Log on to the PAI console.
    2. In the left-side navigation pane, choose Model Training > Studio-Modeling Visualization.
    3. On the PAI Visualization Modeling page, find the project in which you want to create an experiment and click Machine Learning in the Operation column.Machine Learning
  2. Create an experiment.
    1. In the left-side navigation pane, click Home.
    2. In the Templates section, click Create below [Online Prediction] Student Examination Performance Prediction.
    3. In the New Experiment dialog box, specify the experiment parameters. You can use the default values of the parameters.
      Parameter Description
      Name The name of the experiment. Default value: [Online Prediction] Student Examination Performance Prediction.
      Project The name of the project to which the experiment belongs. You cannot change the value of this parameter.
      Description The description of the experiment. Default value: This experiment is an example of online prediction service. You can use this experiment to predict the final exam scores of middle school students based on their campus activities and analyze the key factors that affect the scores.
      Save To The directory for storing the experiment. Default value: My Experiments.
    4. Click OK.
    5. Optional:Wait about 10 seconds. Then, click Experiments in the left-side navigation pane.
    6. Optional:Click Exam Performance Prediction_XX under My Experiments. The canvas of the experiment appears.
      My Experiments is the directory for storing the experiment that you created and Exam Performance Prediction_XX is the name of the experiment. In the experiment name, _XX is the ID that the system automatically creates for the experiment.
    7. View the components of the experiment on the canvas, as shown in the following figure. The system automatically creates the experiment based on the preset template.
      Exam performance prediction
      Area No. Description
      1 The component in this area preprocesses the source data. The SQL Script-1 component structures text data that is read from the dataset.
      • The component converts yes and no in the source data to 0 and 1.
      • The component abstracts various text data based on service scenarios. For example, the component converts the value teacher of the mjob field to 1 and other values to 0. This way, the mjob field after the abstraction indicates whether the mother works in the education industry.
      • For the goal field g3, the components converts values greater than 18 to 1 and other values to 0.
      2 The Normalization-1 component converts values of all fields to values ranging from 0 to 1. This offsets the imbalance between field values.
      3 The Split-1 component divides the source dataset into a training dataset and a prediction dataset at an 8:2 ratio.
      4 The Logistic Regression for Binary Classification-1 component generates an offline model.
      5 The Confusion Matrix-1 component evaluates the accuracy of the model.
  3. Run the experiment and view the result.
    1. In the top toolbar of the canvas, click Run.
    2. After the experiment is run, right-click Logistic Regression for Binary Classification-1 on the canvas and choose Model Option > Show Model. In the dialog box that appears, you can view the weight of each factor that affects the examination performance.
      Factors that affect examination performanceA greater weight value indicates a greater impact of the factor on the examination performance. A positive weight value indicates a positive correlation between the factor and good examination performance. A negative weight value indicates a negative correlation between the factor and good examination performance. The following table provides a brief analysis of the heavily weighted factors.
      Field Field description Weight Analysis
      mjob The job of the mother. -0.5756277338892716 The job of the mother who works as a teacher does not favor the examination performance of the student.
      fjob The job of the father. 1.114492913509562 The job of the father who works as a teacher favors the examination performance of the student.
      internet Indicates whether the Internet is available for the student at home. 1.121226474778686 The Internet favors the examination performance of the student.
      medu The education level of the mother. 1.275664610095503 A better educated mother favors the examination performance of the student.
      Note The experiment is based on a small dataset. Therefore, the analysis result may be inaccurate. The result is for reference only.
    3. After the experiment is run, right-click Confusion Matrix-1 on the canvas and select View Evaluation Report.
    4. In the Confusion Matrix dialog box, click the Statistics tab. On the Statistics tab, you can find that the prediction accuracy of the model is greater than 80%.