All Products
Search
Document Center

Platform For AI:Parameters of PAI-TensorFlow tasks

Last Updated:Feb 29, 2024

Platform for AI (PAI) provides the PAI-TensorFlow deep learning computing framework to support training based on multiple models. This topic describes the command parameters and I/O parameters that are used to run PAI-TensorFlow tasks.

Warning

GPU-accelerated servers will be phased out. You can submit TensorFlow tasks that run on CPU servers. If you want to use GPU-accelerated instances for model training, go to Deep Learning Containers (DLC) to submit jobs. For more information, see Submit training jobs.

Commands and parameters

To initiate a PAI-TensorFlow task, you can run PAI commands on the MaxCompute client, or an SQL node in the DataWorks console or on the Machine Learning Designer page in the PAI console. You can also use TensorFlow components provided by Machine Learning Designer. This section describes the PAI commands and parameters.

# Set the parameters to actual values. 
pai -name tensorflow1120_ext
    -project algo_public
    -Dscript= 'oss://<bucket_name>.<oss_host>.aliyuncs.com/*.tar.gz'
    -DentryFile='entry_file.py'
    -Dbuckets='oss://<bucket_name>.<oss_host>.aliyuncs.com/<path>'
    -Dtables='odps://prj_name/tables/table_name'
    -Doutputs='odps://prj_name/tables/table_name'
    -DcheckpointDir='oss://<bucket_name>.<oss_host>.aliyuncs.com/<path>'
    -Dcluster="{\"ps\":{\"count\":1},\"worker\":{\"count\":2,\"gpu\":100}}"
    -Darn="acs:ram::******:role/aliyunodpspaidefaultrole"
    -DossHost="oss-cn-beijing-internal.aliyuncs.com"

The following table describes the parameters.

Parameter

Description

Example value

Default value

Required

script

The script of the TensorFlow algorithm that is used to run the PAI-TensorFlow task. You can specify a file that contains the script in the file:///path/to/file or project_name/resources/resource_name format. file:///path/to/file is an absolute path.

The TensorFlow model file in Python. The file can be of one of the following types:

  • An on-premises file.

  • An on-premises TAR package. The package is compressed by using gzip. The file name extension of the package is tar.gz.

  • A Python file.

If the Python file is stored in Object Storage Service (OSS), you can specify the file in the oss://..aliyuncs.com/.tar.gz or oss://..aliyuncs.com/*.py format.

oss://demo-yuze.oss-cn-beijing-internal.aliyuncs.com/deepfm/deepfm.tar.gz

N/A

Yes

entryFile

The entry script. You need to configure this parameter if the script that you specify for the script parameter is a TAR package.

main.py

You do not need to set this parameter if the script that you specify for the script parameter is a single file.

Yes

buckets

The input bucket.

Separate multiple buckets with commas (,). Each bucket name must end with a forward slash (/).

oss://..aliyuncs.com/

N/A

No

tables

The input table. Separate multiple tables with commas (,).

odps:///tables/

N/A

No

outputs

The output table. Separate multiple tables with commas (,).

odps:///tables/

N/A

No

gpuRequired

Specifies whether the server of the training script specified by the script parameter requires GPUs.

Default value: 100. A value of 100 indicates that one GPU is required. A value of 200 indicates that two GPUs are required. This parameter takes effect only for standalone training. For information about multi-server training, see the cluster parameter. If you do not need GPUs, set the gpuRequired parameter to 0. This feature is available only for TensorFlow1120.

100

N/A

No

checkpointDir

The TensorFlow checkpoint directory.

oss://..aliyuncs.com/

N/A

No

cluster

The information about the distributed servers on which you want to run the PAI-TensorFlow task. For more information, see the next table in this topic.

{\"ps\":{\"count\":1},\"worker\":{\"count\":2,\"gpu\":100}}

N/A

No

enableDynamicCluster

Specifies whether to enable the failover feature for a single worker node.

If you set this parameter to true, a worker node restarts when a failure occurs on the node. This ensures that the PAI-TensorFlow task does not fail due to worker node failures.

  • true

  • false

false

No

jobName

The name of the experiment. You must specify a name. This way, you can search for the historical data of the experiment to analyze the experiment performance.

Set this parameter to a descriptive string instead of values such as test.

jk_wdl_online_job

N/A

Yes

maxHungTimeBeforeGCInSeconds

The maximum duration for which a GPU is suspended before automatic reclamation is performed. This is a new parameter.

If you set this parameter to 0, the automatic reclamation feature is disabled.

3600

3600

No

You can run a PAI-TensorFlow task in distributed mode. You can use the cluster parameter to specify the numbers of parameter servers (PSs) and workers. The value of the cluster parameter must be in the JSON format, and quotation marks must be escaped. Example:

{
  "ps": {
    "count": 2
  },
  "worker": {
    "count": 4
  }
}

The JSON value consists of two keys: ps and worker. The following table describes the parameters that are nested under each key.

Parameter

Description

Default value

Required

count

The number of PSs or workers.

N/A

Yes

gpu

The number of GPUs for PSs or workers. A value of 100 indicates one GPU. If you set the gpu parameter under worker to 0, CPU clusters are scheduled for the PAI-TensorFlow task and GPU resources are not consumed.

By default, the gpu parameter under ps is set to 0 and the gpu parameter under worker is set to 100.

No

cpu

The number of CPU cores for PSs or workers. A value of 100 indicates one CPU core.

600

No

memory

The memory size for PSs or workers. A value of 100 indicates 100 MB of memory.

30000

No

I/O parameters

The following table describes the I/O parameters that are used to run PAI-TensorFlow tasks.

Parameter

Description

tables

The path of the table from which you want to read data.

outputs

The path of the table to which you want to write data. Separate multiple paths with commas (,).

  • For a non-partitioned table, specify the path in the odps://<prj_name>/tables/<table_name> format.

  • For a partitioned table, specify the path in the odps://<proj_name>tables/<table_name>/<pt_key1=v1> format.

  • For a multi-level partitioned table, specify the path in the odps://<prj_name>/tables/<table_name>/<pt_key1=v1>/<pt_key2=v2> format.

buckets

The OSS bucket that stores the objects to be read by the algorithm.

I/O operations on MaxCompute data are different from those on OSS objects. To read OSS objects, you must configure the role_arn and host parameters.

To obtain the value of the role_arn parameter, perform the following operations: Log on to the PAI console and go to the Dependent Services page. In the Designer section, find OSS and click View authorization in the Actions column. For more information, see Grant the permissions that are required to use Machine Learning Designer.

checkpointDir

The OSS bucket to which data is written.