Parameters of PAI-TensorFlow tasks - Platform For AI - Alibaba Cloud Documentation Center

Platform for AI (PAI) provides the PAI-TensorFlow deep learning computing framework to support training based on multiple models. This topic describes the command parameters and I/O parameters that are used to run PAI-TensorFlow tasks.

Warning

GPU-accelerated servers will be phased out. You can submit TensorFlow tasks that run on CPU servers. If you want to use GPU-accelerated instances for model training, go to Deep Learning Containers (DLC) to submit jobs. For more information, see Submit training jobs.

Commands and parameters

To initiate a PAI-TensorFlow task, you can run PAI commands on the MaxCompute client, or an SQL node in the DataWorks console or on the Machine Learning Designer page in the PAI console. You can also use TensorFlow components provided by Machine Learning Designer. This section describes the PAI commands and parameters.

# Set the parameters to actual values. 
pai -name tensorflow1120_ext
    -project algo_public
    -Dscript= 'oss://<bucket_name>.<oss_host>.aliyuncs.com/*.tar.gz'
    -DentryFile='entry_file.py'
    -Dbuckets='oss://<bucket_name>.<oss_host>.aliyuncs.com/<path>'
    -Dtables='odps://prj_name/tables/table_name'
    -Doutputs='odps://prj_name/tables/table_name'
    -DcheckpointDir='oss://<bucket_name>.<oss_host>.aliyuncs.com/<path>'
    -Dcluster="{\"ps\":{\"count\":1},\"worker\":{\"count\":2,\"gpu\":100}}"
    -Darn="acs:ram::******:role/aliyunodpspaidefaultrole"
    -DossHost="oss-cn-beijing-internal.aliyuncs.com"

The following table describes the parameters.

Parameter	Description	Example value	Default value	Required
script	The script of the TensorFlow algorithm that is used to run the PAI-TensorFlow task. You can specify a file that contains the script in the `file:///path/to/file` or `project_name/resources/resource_name` format. `file:///path/to/file` is an absolute path. The TensorFlow model file in Python. The file can be of one of the following types: An on-premises file. An on-premises TAR package. The package is compressed by using gzip. The file name extension of the package is tar.gz. A Python file. If the Python file is stored in Object Storage Service (OSS), you can specify the file in the `oss://..aliyuncs.com/.tar.gz` or `oss://..aliyuncs.com/*.py` format.	`oss://demo-yuze.oss-cn-beijing-internal.aliyuncs.com/deepfm/deepfm.tar.gz`	N/A	Yes
entryFile	The entry script. You need to configure this parameter if the script that you specify for the script parameter is a TAR package.	`main.py`	You do not need to set this parameter if the script that you specify for the script parameter is a single file.	Yes
buckets	The input bucket. Separate multiple buckets with commas (,). Each bucket name must end with a forward slash (`/`).	`oss://..aliyuncs.com/`	N/A	No
tables	The input table. Separate multiple tables with commas (,).	`odps:///tables/`	N/A	No
outputs	The output table. Separate multiple tables with commas (,).	`odps:///tables/`	N/A	No
gpuRequired	Specifies whether the server of the training script specified by the script parameter requires GPUs. Default value: 100. A value of 100 indicates that one GPU is required. A value of 200 indicates that two GPUs are required. This parameter takes effect only for standalone training. For information about multi-server training, see the cluster parameter. If you do not need GPUs, set the gpuRequired parameter to 0. This feature is available only for TensorFlow1120.	100	N/A	No
checkpointDir	The TensorFlow checkpoint directory.	`oss://..aliyuncs.com/`	N/A	No
cluster	The information about the distributed servers on which you want to run the PAI-TensorFlow task. For more information, see the next table in this topic.	`{\"ps\":{\"count\":1},\"worker\":{\"count\":2,\"gpu\":100}}`	N/A	No
enableDynamicCluster	Specifies whether to enable the failover feature for a single worker node. If you set this parameter to true, a worker node restarts when a failure occurs on the node. This ensures that the PAI-TensorFlow task does not fail due to worker node failures.	true false	false	No
jobName	The name of the experiment. You must specify a name. This way, you can search for the historical data of the experiment to analyze the experiment performance. Set this parameter to a descriptive string instead of values such as `test`.	jk_wdl_online_job	N/A	Yes
maxHungTimeBeforeGCInSeconds	The maximum duration for which a GPU is suspended before automatic reclamation is performed. This is a new parameter. If you set this parameter to 0, the automatic reclamation feature is disabled.	3600	3600	No

You can run a PAI-TensorFlow task in distributed mode. You can use the cluster parameter to specify the numbers of parameter servers (PSs) and workers. The value of the cluster parameter must be in the JSON format, and quotation marks must be escaped. Example:

{
  "ps": {
    "count": 2
  },
  "worker": {
    "count": 4
  }
}

The JSON value consists of two keys: ps and worker. The following table describes the parameters that are nested under each key.

Parameter	Description	Default value	Required
count	The number of PSs or workers.	N/A	Yes
gpu	The number of GPUs for PSs or workers. A value of 100 indicates one GPU. If you set the gpu parameter under worker to 0, CPU clusters are scheduled for the PAI-TensorFlow task and GPU resources are not consumed.	By default, the gpu parameter under ps is set to 0 and the gpu parameter under worker is set to 100.	No
cpu	The number of CPU cores for PSs or workers. A value of 100 indicates one CPU core.	600	No
memory	The memory size for PSs or workers. A value of 100 indicates 100 MB of memory.	30000	No

I/O parameters

The following table describes the I/O parameters that are used to run PAI-TensorFlow tasks.

Parameter	Description
tables	The path of the table from which you want to read data.
outputs	The path of the table to which you want to write data. Separate multiple paths with commas (,). For a non-partitioned table, specify the path in the `odps://<prj_name>/tables/<table_name>` format. For a partitioned table, specify the path in the `odps://<proj_name>tables/<table_name>/<pt_key1=v1>` format. For a multi-level partitioned table, specify the path in the `odps://<prj_name>/tables/<table_name>/<pt_key1=v1>/<pt_key2=v2>` format.
buckets	The OSS bucket that stores the objects to be read by the algorithm. I/O operations on MaxCompute data are different from those on OSS objects. To read OSS objects, you must configure the role_arn and host parameters. To obtain the value of the role_arn parameter, perform the following operations: Log on to the PAI console and go to the Dependent Services page. In the Designer section, find OSS and click View authorization in the Actions column. For more information, see Grant the permissions that are required to use Machine Learning Designer.
checkpointDir	The OSS bucket to which data is written.