Machine Learning Designer integrates the Notebook component, which can seamlessly work together with Data Science Workshop (DSW) instances. You can use the Notebook component to write, debug, and run code in pipelines without disrupting the context and status of the pipelines.
Background information
The Notebook component is widely used in open source communities, and many data or Artificial Intelligence (AI) developers have regarded the component as a powerful code writing and debugging tool. Machine Learning Designer integrates the Notebook component, which can seamlessly work together with DSW instances. You can use the Notebook component to write, debug, and run code in pipelines without disrupting the context and status of the pipelines.
The Notebook component of Machine Learning Designer provides the following advantages over DSW:
Improves development efficiency: You can directly open the Notebook editor in a pipeline and develop and debug code on a containerized instance.
Preserves context: The Notebook component can automatically load the output data and the status of upstream components in a pipeline when an instance is started. This way, you can use the data for further analysis and development and pass the results to downstream components.
Debugs the overall pipeline: The Notebook component is used as a part of a pipeline. You can seamlessly switch between different components to optimize the entire data processing and model training process.
Scenarios
Development
You can start a DSW instance in the Notebook component and modify a Notebook file on the DSW instance to develop and debug code. You can also obtain the configurations of custom parameters from Machine Learning Designer.
If you use a pay-as-you-go resource group, you are charged based on the running duration of a DSW instance. For more information, see Billing of DSW.
Running
If you run the Notebook component or a pipeline in the canvas of Machine Learning Designer or use DataWorks to schedule a pipeline on a regular basis, the system starts a Deep Learning Containers (DLC) job to run the Notebook file converted by Jupyter nbconvert.
If you use a pay-as-you-go resource group, you are charged based on the running duration of a DLC job. For more information, see Billing of DLC.
Component description
The Notebook component has four input ports and four output ports. All input ports can be used to receive Object Storage Service (OSS) data or MaxCompute table data. Among the output ports, ports 1 and 2 are used for passing data to OSS, and ports 3 and 4 are used for passing data to MaxCompute tables.
Before you start a DSW instance by using the Notebook component, you must install the pai-notebook-utils package to obtain information about the input ports, output ports, and custom parameters of the Notebook node.
Install and use the pai-notebook-utils package
Install pai-notebook-utils
pip install --upgrade https://pai-sdk.oss-cn-shanghai.aliyuncs.com/pai_notebook_utils/dist/pai_notebook_utils-0.0.1-py2.py3-none-any.whlUse pai-notebook-utils
The pai-notebook-utils package provides the
get_inputs(),get_outputs(), andget_custom_params()functions. The get_inputs() function is used to obtain the configurations of input ports, the get_outputs() function is used to obtain the configurations of output ports, and the get_custom_params() function is used to obtain the configurations of custom parameters.from pai_notebook.utils.notebook_utils import NotebookUtils node_inputs = NotebookUtils.get_inputs() node_outputs = NotebookUtils.get_outputs() custom_params = NotebookUtils.get_custom_params()
Input ports
The Notebook component has four input ports that can be used to receive OSS data or MaxCompute table data. You can use the get_inputs() function provided by the pai-notebook-utils package to obtain the information of all input ports. The return value is an array, and each item in the array contains fields described in the following table.
Field | Description |
name | The name of an input port. Valid values: input1, input2, input3, and input4. |
type | The port type, such as DataSet and Model. |
location_type | The storage type. Valid values: MaxComputeTable and OSS. |
value | The configuration information of a port. The configuration information is saved in the MAP format. |
If the input is a MaxCompute table, the value of the location_type field is MaxComputeTable, and the value field consists of the project, table_name, and endpoint fields, which specify the MaxCompute project to which the table belongs, the name of the MaxCompute table, and the endpoint of MaxCompute.
from pai_notebook.utils.notebook_utils import NotebookUtils node_inputs = NotebookUtils.get_inputs() for node_input in node_inputs: if node_input.location_type == "MaxComputeTable": input_name = node_input.name table_name = node_input.value["table"] project = node_input.value["project"] endpoint = node_input.value["endpoint"] print(f"input_name: {input_name}, project: {project}, table_name: {table_name}, endpoint: {endpoint}")If the input is OSS, the OSS path is mounted to a local path. You can access an OSS object by using the local path or OSS SDK. In this case, the value of the location_type field is OSS, and the value field consists of the key, bucket, endpoint, and mount_path fields, which specify the OSS path, the OSS bucket, the endpoint of OSS, and the local mount path of OSS.
ImportantIf the OSS path is changed and the new path is not mounted to a DSW instance, the mount_path field cannot be obtained. You must restart the DSW instance.
from pai_notebook.utils.notebook_utils import NotebookUtils node_inputs = NotebookUtils.get_inputs() for node_input in node_inputs: if node_input.location_type == "OSS": input_name = node_input.name key = node_input.value["key"] bucket = node_input.value["bucket"] endpoint = node_input.value["endpoint"] mount_path = node_input.value["mountPath"] print(f"input_name: {input_name}, bucket: {bucket}, key: {key}, endpoint: {endpoint}, mount path: {mount_path}")
Output ports
The Notebook component has four output ports. Ports 1 and 2 are used to pass data to OSS, and ports 3 and 4 are used to pass data to a MaxCompute table. You can use the get_outputs() function provided by the pai-notebook-utils package to obtain the information of all output ports. The return value is an array, and each item in the array contains the fields described in the following table.
Field | Description |
name | The name of an output port. Valid values: output1, output2, output3, and output4. |
location_type | The storage type. Valid values: MaxComputeTable and OSS. |
value | The configuration information of a port. The configuration information is saved in the MAP format. |
If the output is a MaxCompute table, the current workspace must be associated with a MaxCompute compute engine. If the output is OSS, data is passed to the specified OSS path. If the Job Output Path parameter is not configured, an output path is automatically generated based on the global default path. The format of the fields in the value field for an output port is the same as that for an input port.
from pai_notebook.utils.notebook_utils import NotebookUtils
node_outputs = NotebookUtils.get_outputs()
for node_output in node_outputs:
input_name = node_input.name
if node_output.location_type == "MaxComputeTable":
table_name = node_output.value["table"]
project = node_output.value["project"]
endpoint = node_output.value["endpoint"]
print(f"input_name: {input_name}, project: {project}, table_name: {table_name}, endpoint: {endpoint}")
elif node_output.location_type == "OSS":
key = node_output.value["key"]
bucket = node_output.value["bucket"]
endpoint = node_output.value["endpoint"]
mount_path = node_output.value["mountPath"]
print(f"input_name: {input_name}, bucket: {bucket}, key: {key}, endpoint: {endpoint}, mount path: {mount_path}")
else:
print(json.dumps(node_output.value, indent=4))Custom parameters
You can configure custom parameters in the ${globalParamName} format to reference global variables that are configured in a pipeline. You can obtain the custom parameters configured in the Notebook component by using the get_custom_params() function. The return value is in the MAP format, and the keys and values are strings.
from pai_notebook.utils.notebook_utils import NotebookUtils
custom_params = NotebookUtils.get_custom_params()
print(custom_params)Notebook configuration
On the details page of a pipeline in Machine Learning Designer, add the Notebook component to the pipeline. Then, configure the parameters described in the following table on the right side of the page.
Category | Parameter | Required | Description | Default value | |
Notebook Config | DSW Instance | No | The status and start operation of the DSW instance associated with the Notebook node. Before you start a DSW instance, make sure that you configure the Pipeline Data Path parameter to specify a pipeline data path when you create the pipeline. | Not Running | |
Notebook File | Yes | The Notebook file automatically generated after you configure the Pipeline Data Path parameter. The value is in the | Pipeline data path/notebook/${pipeline ID}/${node ID}/main.ipynb | ||
Job Output Path | No |
| If you leave this parameter empty, the pipeline data path is used. | ||
Custom Parameters | No | The parameters that are referenced by the Notebook file. The parameters are in the key-value pair format and can be shared by a pipeline and a DSW instance. This facilitates parameter modification. You can reference global variables in a pipeline of Machine Learning Designer. | None | ||
Init Command | No | The command that is used to initialize the runtime environment before you execute the Notebook file. For example, you can run the | None | ||
Automatic shutdown time | No | The duration within which the DSW instance is automatically shut down. This prevents you from forgetting to shut down the instance after the debugging is complete. | 1 hour | ||
Run Config | Select Resource Group | Public Resource Group | No | You need to configure the Node Type (CPU or GPU), VPC Settings, Security Group, vSwitch, Internet Access Gateway, and Node Image parameters. | Default ECS type: ecs.c6.large |
Dedicated resource group | No | You need to configure the number of vCPUs, memory, shared memory, number of GPUs, and number of instances that you want to use. | None | ||
Node Image | No | You can select an official image or custom image, or enter a public image address. | Official | ||