This topic describes how to use population census data to build a statistical model. You can use the model to analyze the impact of academic degrees on income based on attributes such as the age, job type, and education level.

Dataset

The experiment described in this topic uses an open source dataset from UCI Machine Learning Repository. For more information, see Adult Data Set. The dataset is the population census data of a region in the United States and contains 32,561 data records in total. The following table describes the fields in the dataset.
Field Description Data type
age The age of the person. DOUBLE
workclass The job type of the person. STRING
fnlwgt The ID of the person. STRING
education The education level of the person. STRING
education_num The years of education that the person receives. DOUBLE
maritial_status The marital status of the person. STRING
occupation The job of the person. STRING
relationship The family relationship of the person. STRING
race The race of the person. STRING
sex The gender of the person. STRING
capital_gain The capital gain of the person. STRING
capital_loss The capital loss of the person. STRING
hours_per_week The weekly working hours of the person. DOUBLE
native_country The nationality of the person. STRING
income The income of the person. STRING

Procedure

  1. Go to the Machine Learning Studio console.
    1. Log on to the PAI console.
    2. In the left-side navigation pane, choose Model Training > Studio-Modeling Visualization.
    3. On the PAI Visualization Modeling page, find the project in which you want to create an experiment and click Machine Learning in the Operation column.Machine Learning
  2. Create an experiment.
    1. In the left-side navigation pane, click Home.
    2. In the Templates section, click Create below Population Census.
    3. In the New Experiment dialog box, set the experiment parameters. You can use the default values of the parameters.
      Parameter Description
      Name The name of the experiment. Default value: Population Census.
      Project The project in which you want to create the experiment. You cannot change the value of this parameter.
      Description The description of the experiment. Default value: Use machine learning algorithms to achieve population census and analyze the correlation between the income and education level.
      Save To The directory for storing the experiment. Default value: My Experiments.
    4. Click OK.
    5. Optional:Wait about 10 seconds. Then, click Experiments in the left-side navigation pane.
    6. Optional:Click Population Census_XX under My Experiments.
      My Experiments is the directory for storing the experiment that you created and Population Census_XX is the name of the experiment. In the experiment name, _XX is the ID that the system automatically creates for the experiment.
    7. View the components of the experiment on the canvas. The system automatically creates the experiment based on the preset template.
      Area No. Description
      1 The Data source-Population statistics component reads the dataset from MaxCompute.
      2 The Whole Table Statistics-1, Data Pivoting-1, and Histogram (Multiple Columns)-1 components generate statistical results. Then, you can determine whether the data follows a Poisson distribution or a Gaussian distribution and whether the data is continuous or discrete. Machine Learning Studio can visualize data analysis results. After the experiment is run, right-click Histogram (Multiple Columns)-1 on the canvas and select View Analytics Report to view the distribution of the input data.
      3 The components in this area analyze the impact of academic degrees on income.
      1. Data preprocessing

        The SQL Script-1 component converts the values of the income field to 0 or 1. 0 indicates an annual income of less than or equal to USD 50,000. 1 indicates an annual income of more than USD 50,000.

      2. Filtering and mapping

        The Filtering and Mapping components divide data into three groups based on the following academic degrees: Doctor's degree, Master's degree, and Bachelor's degree. The Filtering and Mapping components support SQL statements. You can set filter criteria as needed. For example, click Filter-PHD on the canvas. In the right-side Fields Setting pane, set the Filter Criteria parameter to education='Doctorate' to filter out the persons with Doctor's degrees.

      3. Statistical results

        The Percentile components calculate the income proportions of persons with each academic degree.

  3. Run the experiment and view the result.
    1. In the top toolbar of the canvas, click Run.
    2. After the experiment is run, right-click Percentile-1 on the canvas and select View Analytics Report.
    3. In the Percentile dialog box, click the Line chart icon icon in the upper-right corner to view the line chart of income distribution for persons with doctor's degrees.
      As shown in the preceding figure, about 25% of persons with doctor's degrees earn an annual income of less than or equal USD 50,000. These persons are represented by the points with the value of 0 in the line chart.
      Note You can drag the slider below the line chart to view the entire income distribution for the persons with Doctor's degrees.
    4. Repeat the preceding steps to view the income distributions of persons with Master's degrees and Bachelor's degrees. The following table shows the aggregate results.
      Academic degree Proportion of persons with an annual income of more than USD 50,000
      Doctor's degree 75%
      Master's degree 56%
      Bachelor's degree 42%