DataWorks is the best platform for building big data warehouses, and provides features include Data Integration, DataStudio, Data Map, Data Quality, and DataService Studio.

By Jessie Angelica,Solution Architect Intern

A Practical Walkthrough with PAI Designer and Data Works for Creating Prediction Model

DataWorks is the best platform for building big data warehouses, and provides features include Data Integration, DataStudio, Data Map, Data Quality, and DataService Studio. DataWorks supports the following compute engines: MaxCompute, E-MapReduce, Hologres, ADB for PostgreSQL and MySQL. DataWorks can synchronize batch data or real-time data between different data sources and allows you to scheduling of millions of tasks to streamline data processing.

Machine Learning Designer allows you to create a pipeline from a template or manually create a pipeline. You only need to drag and drop the components to the canvas. Set the parameters based on your needs and connect between the components to create a pipeline. Then, run the pipeline to fine-tune the trained model. After the pipeline is run, you can view the log and schedule the pipeline as a periodic task to allow the model generated by the pipeline to be automatically updated. You can view the analysis reports on the visualized dashboard and deploy the model to the EAS.

In the healthcare field, Alibaba Cloud could help to predict the risk of heart disease using DataWorks and PAI Designer. The data can be collected and clean it up with DataWorks, and use PAI Designer to build a predictive model. We fine-tune the model for accuracy and set it to run automatically with DataWorks. The trained model is then connected to healthcare systems for real-time risk assessments. We also use monitoring and integrate these predictions into patient care plans. This whole system aims to catch potential heart issues early on, offering personalized healthcare and continuously improving as we get more data and feedback.

PART 1 Create Data Source

1). In the Alibaba Cloud console, Go to DataWorks and click Create Workspace
2). Set your Workspace Name and Display Name, and select No in isolate Development and Production Environment. Then, Click Commit
3). Click Associate Now with MaxCompute.
4). Click Data Source in navigation page, and Add Data Source, then choose MaxCompute
5). Set the resource as shown in the following picture, and select Alibaba Cloud RAM Sub-Account. Click Test Connectivity, then Click Complete Create

download

6). Click Workspace, and Configure it

download

7). Select Data Source in DataStudio Modules. Click Data Source in the navigation page and Click Associate.

8). Back the Workspaces page. After a while, the status will be displayed as Normal, and the creation is successful.

download

PART 2 Import Data

Click Download to download data to a local file, will be used later.
(https://github.com/jessieangelica/heartdisease_data.git)
1). Go to DataStudio and click Create Workflow.
2). Set a name and click Create.

download

3). Create table and set the table name to "heart_data".

download

4). Select DDL mode. Copy the following content into it and Click Generate Table Schema.

download

5). Set the display name and Click Commit to the production environment.

6). Click Import data.

download

7). Select the "heart_data" table and click Next. Then, upload the "clevecp.txt" file you just downloaded and choose By Location. Click Import Data.

download

PART 3 Implementation uaing Designer

1). Go to Machine Learning Platform for AI and Click Designer
2). Click Create Pipeline. Set a name and click OK
3). After the Designer opened, we start to drag the Read Table node from the left to the node panel on the right. Set Table Name on the Table Selection tab page.

download

4). Drag an SQL script node to specify the data transmission direction. Click the node and copy the following content. Click Run. After the execution is successful, you can view the data for each node.

download

5). Create a Type Conversion node. Select all fields and click Confirm to convert all field types to the double type. Click Run the current node

download

6). Create a Normalized node. Select all fields and click Confirm. Click Run the current node

download

7). Create a Split node to perform training and testing, and set Split Ratio to 0.7. Click Run the current node.

download

8). Create a Logistic Regression node for binary classification. Then, Create a Prediction node to predict the model. For both, Click Feature Columns and select 13 fields, excluding ifhealth, and Click Reserved Output Column and select only ifhealth. Click Run the current node

download