PAI-TensorFlow is a deep learning computing framework. It supports training based on multiple models. This topic describes command parameters and I/O parameters used to perform PAI-TensorFlow tasks.

Commands and parameters

To start a PAI-TensorFlow task, you can run PAI commands on the MaxCompute client or an SQL node in the DataWorks console. This section describes the PAI commands and parameters.
# You can configure parameters as needed.
pai -name tensorflow1120_ext
    -project algo_public
    -Dscript= 'oss://<bucket_name>.<oss_host>.aliyuncs.com/*.tar.gz'
    -DentryFile='entry_file.py'
    -Dbuckets='oss://<bucket_name>.<oss_host>.aliyuncs.com/<path>'
    -Dtables='odps://prj_name/tables/table_name'
    -Doutputs='odps://prj_name/tables/table_name'
    -DcheckpointDir='oss://<bucket_name>.<oss_host>.aliyuncs.com/<path>'
    -Dcluster="{\"ps\":{\"count\":1},\"worker\":{\"count\":2,\"gpu\":100}}"
    -Darn="acs:ram::******:role/aliyunodpspaidefaultrole"
    -DossHost="oss-cn-beijing-internal.aliyuncs.com"
The following table describes the parameters.
Parameter Description Example Default value Required
script The script of the TensorFlow algorithm that is used to run the task. The script is in the format of file:///path/to/file or project_name/resources/resource_name. file:///path/to/file is an absolute path.
The TensorFlow model file in Python. Types of the file:
  • A local file.
  • A local TAR file. The package is compressed by using gzip. The file name extension of the package is tar.gz.
  • A Python file.
The TensorFlow model file is in the format of oss://..aliyuncs.com/.tar.gz or oss://..aliyuncs.com/*.py.
oss://demo-yuze.oss-cn-beijing-internal.aliyuncs.com/deepfm/deepfm.tar.gz No default value Yes
entryFile If the script is a TAR file, use this parameter to specify the entry script. main.py If the script parameter is a single script, this parameter is not required. Yes
buckets The input bucket.

Separate multiple buckets with commas (,). Each bucket must end with a forward slash (/).

oss://..aliyuncs.com/ No default value No
tables The input table. Separate multiple tables with commas (,). odps:///tables/ No default value No
outputs The output table. Separate multiple tables with commas (,). odps:///tables/“"odps:///tables/ No default value No
gpuRequired Specifies whether the server of the training script specified by the script parameter requires GPUs.

Default value: 100. The value 100 indicates that one GPU is required. The value 200 indicates that two GPUs are required. This feature is suitable only for standalone training. For more information about multi-server training, see the Cluster parameter. To run a task based on standalone GPU resources, you can set the gpuRequired parameter to 0. This feature supports only TensorFlow1120.

100 yes
Note After you submit a training task by running a PAI command, the training script specified by the command is run on the server with a GPU.
No
checkpointDir The TensorFlow checkpoint directory. oss://..aliyuncs.com/ No default value No
cluster The information about distributed running. For more information, see the following table. {\"ps\":{\"count\":1},\"worker\":{\"count\":2,\"gpu\":100}} No default value No
enableDynamicCluster Specifies whether to enable the failover feature in a single worker node.

If you set this parameter to true, the worker node restarts if a failure occurs. This does not affect job execution.

  • true
  • false
false No
jobName You must specify the job name. This way, you can search for historical data of this job to analyze the experiment performance.

Set this parameter to an informative string instead of values like test.

jk_wdl_online_job No default value Yes
maxHungTimeBeforeGCInSeconds The newly added DmaxHungTimeBeforeGCInSeconds parameter is used to specify the maximum duration for which the GPU is suspended before automatic reclamation is performed.

If the value of this parameter is set to 0, this feature is disabled.

3600 3600 No
A PAI-TensorFlow task in distributed node supports the cluster parameter. You can use the cluster parameter to specify the numbers of parameter servers (PSs) and workers. cluster is in the JSON format. Quotation marks must be escaped. Example:
{
  "ps": {
    "count": 2
  },
  "worker": {
    "count": 4
  }
}
The JSON data contains two keys: ps and worker. Each key contains the parameters that are described in the following table.
Parameter Description Default value Required
count The number of the PSs or workers. No default value Yes
gpu The number of GPUs that ps or worker requests. The value 100 indicates one GPU. If you set gpu of worker to 0, the system automatically sets this parameter to 100 to ensure scheduling stability. By default, the value of gpu of ps is 0 and the value of gpu of worker is 100. No
cpu The number of CPUs that ps or worker requests. The value 100 indicates one CPU. 600 No
memory The memory size that ps or worker requests. The value 100 indicates 100 MB. 30000 No

I/O parameters

The following table describes I/O parameters.
Parameter Description
tables The path of the table to be read.
outputs The path of the table to which data is written. If multiple paths are required, separate them with commas (,).
  • Format of a non-partitioned table: odps://<prj_name>/tables/<table_name>.
  • Format of a partitioned table: odps://<proj_name>tables/<table_name>/<pt_key1=v1>.
  • Format of a multi-level partitioned table: odps://<prj_name>tables/<table_name>/<pt_key1=v1>/<pt_key2=v2>.
buckets The OSS bucket to be read.

I/O operations on MaxCompute data are different from those on OSS data. To read OSS data, you must configure role_arn and host.

Log on to the PAI console and click Studio-Modeling Visualization in the left-side navigation pane. On the PAI Visualization Modeling page, click Machine Learning in the Operation column and navigate to the Algorithm Platform of Machine Learning Platform For AI. On the Algorithm Platform, click Settings in the left-side navigation pane. In the OSS Authorization section of the General page, select Authorize Machine Learning Platform for AI to access my OSS resources and click Show to obtain role_arn.

checkpointDir The OSS bucket to which data is written.