Notebook - Platform For AI - Alibaba Cloud Documentation Center

Machine Learning Designer integrates the Notebook component, which can seamlessly work together with Data Science Workshop (DSW) instances. You can use the Notebook component to write, debug, and run code in pipelines without disrupting the context and status of the pipelines.

Background information

The Notebook component is widely used in open source communities, and many data or Artificial Intelligence (AI) developers have regarded the component as a powerful code writing and debugging tool. Machine Learning Designer integrates the Notebook component, which can seamlessly work together with DSW instances. You can use the Notebook component to write, debug, and run code in pipelines without disrupting the context and status of the pipelines.

The Notebook component of Machine Learning Designer provides the following advantages over DSW:

Improves development efficiency: You can directly open the Notebook editor in a pipeline and develop and debug code on a containerized instance.
Preserves context: The Notebook component can automatically load the output data and the status of upstream components in a pipeline when an instance is started. This way, you can use the data for further analysis and development and pass the results to downstream components.
Debugs the overall pipeline: The Notebook component is used as a part of a pipeline. You can seamlessly switch between different components to optimize the entire data processing and model training process.

Scenarios

Development

You can start a DSW instance in the Notebook component and modify a Notebook file on the DSW instance to develop and debug code. You can also obtain the configurations of custom parameters from Machine Learning Designer.

Important

If you use a pay-as-you-go resource group, you are charged based on the running duration of a DSW instance. For more information, see Billing of DSW.

Running

If you run the Notebook component or a pipeline in the canvas of Machine Learning Designer or use DataWorks to schedule a pipeline on a regular basis, the system starts a Deep Learning Containers (DLC) job to run the Notebook file converted by Jupyter nbconvert.

Important

If you use a pay-as-you-go resource group, you are charged based on the running duration of a DLC job. For more information, see Billing of DLC.

Component description

The Notebook component has four input ports and four output ports. All input ports can be used to receive Object Storage Service (OSS) data or MaxCompute table data. Among the output ports, ports 1 and 2 are used for passing data to OSS, and ports 3 and 4 are used for passing data to MaxCompute tables.

Before you start a DSW instance by using the Notebook component, you must install the pai-notebook-utils package to obtain information about the input ports, output ports, and custom parameters of the Notebook node.

Install and use the pai-notebook-utils package

Install pai-notebook-utils

pip install --upgrade https://pai-sdk.oss-cn-shanghai.aliyuncs.com/pai_notebook_utils/dist/pai_notebook_utils-0.0.1-py2.py3-none-any.whl

Use pai-notebook-utils
The pai-notebook-utils package provides the get_inputs(), get_outputs(), and get_custom_params() functions. The get_inputs() function is used to obtain the configurations of input ports, the get_outputs() function is used to obtain the configurations of output ports, and the get_custom_params() function is used to obtain the configurations of custom parameters.
```
from pai_notebook.utils.notebook_utils import NotebookUtils

node_inputs = NotebookUtils.get_inputs()
node_outputs = NotebookUtils.get_outputs()
custom_params = NotebookUtils.get_custom_params()
```

Input ports

The Notebook component has four input ports that can be used to receive OSS data or MaxCompute table data. You can use the get_inputs() function provided by the pai-notebook-utils package to obtain the information of all input ports. The return value is an array, and each item in the array contains fields described in the following table.

Field	Description
name	The name of an input port. Valid values: input1, input2, input3, and input4.
type	The port type, such as DataSet and Model.
location_type	The storage type. Valid values: MaxComputeTable and OSS.
value	The configuration information of a port. The configuration information is saved in the MAP format.

If the input is a MaxCompute table, the value of the location_type field is MaxComputeTable, and the value field consists of the project, table_name, and endpoint fields, which specify the MaxCompute project to which the table belongs, the name of the MaxCompute table, and the endpoint of MaxCompute.

from pai_notebook.utils.notebook_utils import NotebookUtils

node_inputs = NotebookUtils.get_inputs()

for node_input in node_inputs:
    if node_input.location_type == "MaxComputeTable":
        input_name = node_input.name
        table_name = node_input.value["table"]
        project = node_input.value["project"]
        endpoint = node_input.value["endpoint"]
        print(f"input_name: {input_name}, project: {project}, table_name: {table_name}, endpoint: {endpoint}")

If the input is OSS, the OSS path is mounted to a local path. You can access an OSS object by using the local path or OSS SDK. In this case, the value of the location_type field is OSS, and the value field consists of the key, bucket, endpoint, and mount_path fields, which specify the OSS path, the OSS bucket, the endpoint of OSS, and the local mount path of OSS.
Important
If the OSS path is changed and the new path is not mounted to a DSW instance, the mount_path field cannot be obtained. You must restart the DSW instance.
```
from pai_notebook.utils.notebook_utils import NotebookUtils

node_inputs = NotebookUtils.get_inputs()

for node_input in node_inputs:
    if node_input.location_type == "OSS":
        input_name = node_input.name
        key = node_input.value["key"]
        bucket = node_input.value["bucket"]
        endpoint = node_input.value["endpoint"]
        mount_path = node_input.value["mountPath"]
        print(f"input_name: {input_name}, bucket: {bucket}, key: {key}, endpoint: {endpoint}, mount path: {mount_path}")
```

Output ports

The Notebook component has four output ports. Ports 1 and 2 are used to pass data to OSS, and ports 3 and 4 are used to pass data to a MaxCompute table. You can use the get_outputs() function provided by the pai-notebook-utils package to obtain the information of all output ports. The return value is an array, and each item in the array contains the fields described in the following table.

Field	Description
name	The name of an output port. Valid values: output1, output2, output3, and output4.
location_type	The storage type. Valid values: MaxComputeTable and OSS.
value	The configuration information of a port. The configuration information is saved in the MAP format.

If the output is a MaxCompute table, the current workspace must be associated with a MaxCompute compute engine. If the output is OSS, data is passed to the specified OSS path. If the Job Output Path parameter is not configured, an output path is automatically generated based on the global default path. The format of the fields in the value field for an output port is the same as that for an input port.

from pai_notebook.utils.notebook_utils import NotebookUtils

node_outputs = NotebookUtils.get_outputs()

for node_output in node_outputs:
    input_name = node_input.name
    if node_output.location_type == "MaxComputeTable":
        table_name = node_output.value["table"]
        project = node_output.value["project"]
        endpoint = node_output.value["endpoint"]
        print(f"input_name: {input_name}, project: {project}, table_name: {table_name}, endpoint: {endpoint}")
    elif node_output.location_type == "OSS":
        key = node_output.value["key"]
        bucket = node_output.value["bucket"]
        endpoint = node_output.value["endpoint"]
        mount_path = node_output.value["mountPath"]
        print(f"input_name: {input_name}, bucket: {bucket}, key: {key}, endpoint: {endpoint}, mount path: {mount_path}")
    else:
        print(json.dumps(node_output.value, indent=4))

Custom parameters

You can configure custom parameters in the ${globalParamName} format to reference global variables that are configured in a pipeline. You can obtain the custom parameters configured in the Notebook component by using the get_custom_params() function. The return value is in the MAP format, and the keys and values are strings.

from pai_notebook.utils.notebook_utils import NotebookUtils

custom_params = NotebookUtils.get_custom_params()
print(custom_params)

Notebook configuration

On the details page of a pipeline in Machine Learning Designer, add the Notebook component to the pipeline. Then, configure the parameters described in the following table on the right side of the page.

Category	Parameter		Required	Description	Default value
Notebook Config	DSW Instance		No	The status and start operation of the DSW instance associated with the Notebook node. Before you start a DSW instance, make sure that you configure the Pipeline Data Path parameter to specify a pipeline data path when you create the pipeline.	Not Running
	Notebook File		Yes	The Notebook file automatically generated after you configure the Pipeline Data Path parameter. The value is in the `pipeline data path/notebook/${pipeline ID}/${node ID}/main.ipynb` format. You can find the pipeline ID on the Pipeline Attributes tab. The node ID is automatically generated by the system.	Pipeline data path/notebook/${pipeline ID}/${node ID}/main.ipynb
	Job Output Path		No	The selected OSS directory is mounted to the /ml/output/ path of the job container. The data written to the /ml/output/ path is persisted to the selected OSS path. The output port OSS output-1 of the Notebook component corresponds to the output1 subdirectory of the /ml/output path, and the output port OSS output-2 of the Notebook component corresponds to the output2 subdirectory of the /ml/output path. When an OSS output port of the Notebook component is connected to the downstream component, the downstream component receives data from the corresponding subdirectory. The running result of the Notebook component is saved to the result subdirectory of the /ml/output path in the HTML format.	If you leave this parameter empty, the pipeline data path is used.
	Custom Parameters		No	The parameters that are referenced by the Notebook file. The parameters are in the key-value pair format and can be shared by a pipeline and a DSW instance. This facilitates parameter modification. You can reference global variables in a pipeline of Machine Learning Designer.	None
	Init Command		No	The command that is used to initialize the runtime environment before you execute the Notebook file. For example, you can run the `pip install pandas` command to install the required package.	None
	Automatic shutdown time		No	The duration within which the DSW instance is automatically shut down. This prevents you from forgetting to shut down the instance after the debugging is complete.	1 hour
Run Config	Select Resource Group	Public Resource Group	No	You need to configure the Node Type (CPU or GPU), VPC Settings, Security Group, vSwitch, Internet Access Gateway, and Node Image parameters.	Default ECS type: ecs.c6.large
	Select Resource Group	Dedicated resource group	No	You need to configure the number of vCPUs, memory, shared memory, number of GPUs, and number of instances that you want to use.	None
	Node Image		No	You can select an official image or custom image, or enter a public image address.	Official