how to use PAI-TensorFlow to implement the distributed DeepFM algorithm - Platform For AI

This topic describes how to use PAI-TensorFlow to implement the distributed DeepFM algorithm.

Warning

GPU-accelerated servers will be phased out. You can submit TensorFlow tasks that run on CPU servers. If you want to use GPU-accelerated instances for model training, go to Deep Learning Containers (DLC) to submit jobs. For more information, see Submit training jobs.

Prerequisites

Object Storage Service (OSS) is activated, and a bucket is created. For more information, see Activate OSS and Create buckets.
Important
Do not enable versioning when you create a bucket. Otherwise, an object cannot be overwritten by another object that has the same name.
Platform for AI (PAI) is authorized to access your OSS bucket. For more information, see Grant the permissions that are required to use Machine Learning Designer.

Background information

The DeepFM algorithm has the same deep part as the Wide&Deep algorithm. However, the DeepFM algorithm uses Factorization Machine (FM) instead of logistic regression (LR) in the wide part to avoid manual feature engineering.

The training data source is pai_online_project.dwd_avazu_ctr_deepmodel_train. The test data source is pai_online_project.dwd_avazu_ctr_deepmodel_test. pai_online_project.dwd_avazu_ctr_deepmodel_train and pai_online_project.dwd_avazu_ctr_deepmodel_test are public data sources that you can use.

Procedure

Download the model file.

Modify model configuration code.

Configure the embedding_dim, hash_buckets and default_value parameters for each feature.
```
self.fields_config_dict['hour'] = {'field_name': 'field1', 'embedding_dim': self.embedding_dim, 'hash_bucket': 50, 'default_value': '0'}
self.fields_config_dict['c1'] = {'field_name': 'field2', 'embedding_dim': self.embedding_dim, 'hash_bucket': 10, 'default_value': '0'}
```
In a DeepFM model, you need to set the embedding_dim parameter of all features to the same value. This limit does not apply to a Wide&Deep model. We recommend that you set the hash_buckets parameter to a larger value for the user_id and itemid features, and to a smaller value for features that have fewer values.
Configure a model. We recommend that you configure a DeepFM or Wide&Deep model by specifying deepfm or wdl in the model parameter.
```
tf.app.flags.DEFINE_string("model", 'deepfm', "model {'wdl', 'deepfm'}")
```

Configure distributed parameters.

tf.app.flags.DEFINE_string("job_name", "", "job name")
tf.app.flags.DEFINE_integer("task_index", None, "Worker or server index")
tf.app.flags.DEFINE_string("ps_hosts", "", "ps hosts")
tf.app.flags.DEFINE_string("worker_hosts", "", "worker hosts")

When you submit a training task, you need to only configure the cluster parameter. The system automatically generates distributed parameters. For more information about the cluster parameter, see Step 4.

Specify input data.

  def _parse_batch_for_tabledataset(self, *args):
    label = tf.reshape(args[0], [-1])
    fields = [tf.reshape(v, [-1]) for v in args[1:]]
    return dict(zip(self.feas_name, fields)), label

  def train_input_fn_from_odps(self, data_path, epoch=10, batch_size=1024, slice_id=0, slice_count=1):
    with tf.device('/cpu:0'):
      dataset = tf.data.TableRecordDataset([data_path], record_defaults=self.record_defaults,
                         slice_count=slice_count, slice_id=slice_id)
      dataset = dataset.batch(batch_size).repeat(epoch)
      dataset = dataset.map(self._parse_batch_for_tabledataset, num_parallel_calls=8).prefetch(100)
      return dataset

  def val_input_fn_from_odps(self, data_path, epoch=1, batch_size=1024, slice_id=0, slice_count=1):
    with tf.device('/cpu:0'):
      dataset = tf.data.TableRecordDataset([data_path], record_defaults=self.record_defaults,
                         slice_count=slice_count, slice_id=slice_id)
      dataset = dataset.batch(batch_size).repeat(epoch)
      dataset = dataset.map(self._parse_batch_for_tabledataset, num_parallel_calls=8).prefetch(100)
      return dataset

If feature transformation is required, we recommend that you use MaxCompute to transform features outside the model. This way, you can reduce training overheads. If you want to transform features inside the model, we recommend that you use the _parse_batch_for_tabledataset function to transform features.

Upload the modified model file to OSS.
Optional. Submit a training task.
Note
You need to perform this step if the model file is not used for training. Otherwise, skip this step.
Run one of the following commands based on the type of the configured model to submit the training task:
- DeepFM
```
pai -name tensorflow1120_ext
-project algo_public
-Dbuckets='oss://bucket_name.oss-cn-region-internal.aliyuncs.com/'
-Darn=''
-Dscript='oss://bucket_name.oss-cn-region-internal.aliyuncs.com/demo/deepfm_pai_ctr.py'
-Dtables='odps://pai_online_project/tables/dwd_avazu_ctr_predict_deepmodel_train,odps://pai_online_project/tables/dwd_avazu_ctr_predict_deepmodel_test'
-DuserDefinedParameters="--task_type='train' --model='deepfm' --checkpoint_dir='oss://bucket_name/path/' --output_dir='oss://bucket_name/path/'"
-Dcluster='{\"ps\":{\"count\":2,\"cpu\":1200,\"memory\":10000},\"worker\":{\"count\":8,\"cpu\":1200,\"memory\":30000}}';
```
  The project parameter and the pai_online_project variable in the tables parameter specify the name of the project that stores input data. The bucket_name and region parameters in the buckets and script parameters separately specify the name of the OSS bucket and the region where the OSS bucket resides. The arn parameter specifies the Alibaba Cloud Resource Name (ARN) of the Resource Access Management (RAM) role in your Alibaba Cloud account. You can set the parameters and variables based on your business requirements. For more information about the ARN, see RAM role overview.
- Wide&Deep
```
pai -name tensorflow1120_ext
-project algo_public
-Dbuckets='oss://bucket_name.oss-cn-region-internal.aliyuncs.com/'
-Darn=''
-Dscript='oss://bucket_name.oss-cn-region-internal.aliyuncs.com/demo/deepfm_pai_ctr.py'
-Dtables='odps://pai_online_project/tables/dwd_avazu_ctr_predict_deepmodel_train,odps://pai_online_project/tables/dwd_avazu_ctr_predict_deepmodel_test'
-DuserDefinedParameters="--task_type='train' --model='wdl' --checkpoint_dir='oss://bucket_name/path/' --output_dir='oss://bucket_name/path/'"
-Dcluster='{\"ps\":{\"count\":2,\"cpu\":1200,\"memory\":10000},\"worker\":{\"count\":8,\"cpu\":1200,\"memory\":30000}}';
```
  The project parameter and the pai_online_project variable in the tables parameter specify the name of the project that stores input data. The bucket_name and region variables in the buckets and script parameters separately specify the name of the OSS bucket and the region where the OSS bucket resides. The arn parameter specifies the Alibaba Cloud ARN of the RAM role in your Alibaba Cloud account. You can configure the parameters and variables based on your business requirements. For more information about the ARN, see RAM role overview.
The training task is a distributed task that runs on the parameter server (PS)-worker architecture. Therefore, you must set the cluster parameter. In the preceding code, two PS nodes and eight worker nodes are specified in the cluster parameter. Each PS node has 12 CPU cores and 10 GB of memory. Each worker node has 12 CPU cores, and 30 GB of memory. For more information about the parameters in the preceding code, see Parameters of PAI-TensorFlow tasks.
Submit an offline inference task.
```
pai -name tensorflow1120_ext
-project algo_public
-Dbuckets='oss://bucket_name.oss-cn-region-internal.aliyuncs.com/'
-Darn=''
-Dscript='oss://bucket_name.oss-cn-region-internal.aliyuncs.com/demo/deepfm_pai_ctr.py'
-Dtables='odps://pai_online_project/tables/dwd_avazu_ctr_predict_deepmodel_train,odps://pai_online_project/tables/dwd_avazu_ctr_predict_deepmodel_test'
-DuserDefinedParameters="--task_type='predict' --model='deepfm' --checkpoint_dir='oss://bucket_name/path/' --output_dir='oss://bucket_name/path/'"
-Dcluster='{\"worker\":{\"count\":8,\"cpu\":1200,\"memory\":30000}}'
-Doutputs='odps://project_name/tables/output_table_name';
```
The project parameter and the pai_online_project variable in the tables parameter specify the name of the project that stores input data. The bucket_name and region variables in the buckets and script parameters separately specify the name of the OSS bucket and the region where the OSS bucket resides. The arn parameter specifies the ARN of the RAM role in your Alibaba Cloud account. You can set the parameters and variables based on your business requirements. For more information about the ARN, see RAM role overview.
Before you submit an offline inference task, you must create an output table. The result of each inference task is written to the table and overwrites the result of the previous inference task. Run the following command to create an output table:
```
drop table project_name.output_table_name;
create table project_name.output_table_name
(
   probabilities STRING
   ,logits STRING
)STORED AS ALIORC;
```

Export the trained model file.

pai -name tensorflow1120_ext
-project algo_public
-Dbuckets='oss://bucket_name.oss-cn-region-internal.aliyuncs.com/'
-Darn=''
-Dscript='oss://bucket_name.oss-cn-region-internal.aliyuncs.com/demo/deepfm_pai_ctr.py'
-Dtables='odps://pai_online_project/tables/dwd_avazu_ctr_predict_deepmodel_train,odps://pai_online_project/tables/dwd_avazu_ctr_predict_deepmodel_test'
-DuserDefinedParameters="--task_type='savemodel' --model='deepfm' --checkpoint_dir='oss://bucket_name/path/' --output_dir='oss://bucket_name/path/'"
-Dcluster='{\"ps\":{\"count\":2,\"cpu\":1200,\"memory\":10000},\"worker\":{\"count\":8,\"cpu\":1200,\"memory\":30000}}';

The project parameter and the pai_online_project variable in the tables parameter specify the name of the project that stores input data. The bucket_name and region variables in the buckets and script parameters separately specify the name of the OSS bucket and the region where the OSS bucket resides. The arn parameter specifies the ARN of the RAM role in your Alibaba Cloud account. You can set the parameters and variables based on your business requirements. For more information about the ARN, see RAM role overview. The system uses a worker node to export the model file.