This topic describes how to use population census data to build a statistical model. You can use the model to analyze the impact of academic degree on income based on attributes such as age, job type, and education level.

Datasets

The experiment described in this topic uses an open source dataset from the Machine Learning Repository of the University of California, Irvine (UCI). For more information, see Adult Data Set. The dataset is the population census data of a region and contains 32,561 data records in total. The following table describes the fields in the dataset.
FieldDescriptionData type
ageThe age of the person.DOUBLE
workclassThe job type of the person.STRING
fnlwgtThe ID of the person.STRING
educationThe education level of the person.STRING
education_numThe years of education that the person receives.DOUBLE
maritial_statusThe marital status of the person.STRING
occupationThe job of the person.STRING
relationshipThe family relationship of the person.STRING
capital_gainThe capital gain of the person.STRING
capital_lossThe capital loss of the person.STRING
hours_per_weekThe weekly working hours of the person.DOUBLE
native_countryThe nationality of the person.STRING
incomeThe income of the person.STRING

Procedure

  1. Go to the Machine Learning Designer page.
    1. Log on to the Machine Learning Platform for AI console.
    2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
    3. In the left-side navigation pane, choose Model Training > Visualized Modeling (Designer) to go to the Machine Learning Designer page.
  2. In the upper-right corner of the Visualized Modeling (Machine Learning Designer) page, click Go to Studio (Old Version).
  3. Create an experiment.
    1. In the upper-right corner of the Visualized Modeling (Designer) page, click Go to Studio (Old Version).
    2. In the Templates section, click Create below Population Census.
    3. In the New Experiment dialog box, configure the parameters described in the following table. You can use the default settings.
      ParameterDescription
      NameThe name of the experiment. Default value: Population Census.
      ProjectThe name of the project to which the experiment belongs. You cannot change the value of this parameter.
      DescriptionThe description of the experiment. Default value: Use machine learning algorithms to achieve population census and analyze the correlation between the income and education level.
      Save ToSelect My Experiments.
    4. Click OK.
    5. Optional: Wait about 10 seconds and click Experiments in the left-side navigation pane.
    6. Optional: Click Population Census_XX under My Experiments.
      My Experiments is the directory for storing the experiments that you created, and Population Census_XX is the name of the experiment. In the experiment name, _XX is the ID that the system automatically creates for the experiment.
    7. View the components of the experiment on the canvas, as shown in the following figure. The system automatically creates the experiment based on the template.
      Experiment on population census
      SectionDescription
      The Data source-Population statistics-1 component reads the dataset from MaxCompute.
      The Whole Table Statistics-1, Data Pivoting-1, and Histogram (Multiple Columns)-1 components generate statistical results. Then, you can determine whether the data follows a Poisson distribution or a Gaussian distribution and whether the data is continuous or discrete. Machine Learning Studio can visualize data analysis results. After the experiment is run, right-click Histogram (Multiple Columns)-1 on the canvas and select View Analytics Report to view the distribution of the input data, as shown in the following figure. Histogram
      The components displayed in this section analyze the impact of academic degree on income.
      1. Data preprocessing

        The SQL Script-1 component converts the values of the income field to 0 or 1. 0 indicates an annual income of less than or equal to USD 50,000. 1 indicates an annual income of more than USD 50,000.

      2. Filtering and mapping

        The Filtering and Mapping components divide data into three groups based on the following academic degrees: doctor's degree, master's degree, and bachelor's degree. The Filtering and Mapping components support SQL statements. You can set filter criteria as needed. For example, click Filter-PHD on the canvas. In the right-side Fields Setting pane, set the Filter Criteria parameter to education='Doctorate' to filter the persons with a doctor's degree.

      3. Statistical results

        The Percentile components calculate the income proportions of persons with each academic degree.

  4. Run the experiment and view the results.
    1. In the upper-left corner of the canvas, click Run.
    2. After the experiment is run, right-click Percentile-1 on the canvas and select View Analytics Report.
    3. In the Percentile dialog box, click the Line chart icon icon in the upper-right corner to view the line chart of income distribution for persons with a doctor's degree.
      income proportion of the persons with a doctor's degreeAs shown in the preceding figure, about 25% of persons with a doctor's degree earn an annual income of less than or equal to USD 50,000. These persons are represented by the points with a value of 0 in the line chart.
      Note You can drag the slider below the line chart to view the entire income distribution for the persons with a doctor's degree.
    4. Repeat the preceding steps to view the income distributions of persons with a master's degree or a bachelor's degree. The following table shows the aggregate results.
      Academic degreeProportion of persons with an annual income of more than USD 50,000
      Doctor's degree75%
      Master's degree56%
      Bachelor's degree42%