This topic describes how to use Machine Learning Platform for AI (PAI)-TensorFlow to implement the distributed DeepFM algorithm.

Prerequisites

  • Object Storage Service (OSS) is activated, and a bucket is created. For more information, see Activate OSS and Create buckets.
    Notice When you create a bucket, do not enable versioning. Otherwise, an object cannot be overwritten by another object that has the same name.
  • PAI is authorized to access your OSS bucket. For more information, see Grant permissions.

Background information

The DeepFM algorithm has the same deep part as the Wide&Deep algorithm. However, the DeepFM algorithm uses Factorization Machine (FM) instead of logistic regression (LR) in the wide part. This way, manual feature engineering can be avoided.DeepFM

The training data source is pai_online_project.dwd_avazu_ctr_deepmodel_train. The test data source is pai_online_project.dwd_avazu_ctr_deepmodel_test. Both pai_online_project.dwd_avazu_ctr_deepmodel_train and pai_online_project.dwd_avazu_ctr_deepmodel_test are public data sources that you can use.

Procedure

  1. Download the model file.
  2. Modify model configuration code.
    1. Set the embedding_dim, hash_buckets, and default_value parameters for each feature.
      self.fields_config_dict['hour'] = {'field_name': 'field1', 'embedding_dim': self.embedding_dim, 'hash_bucket': 50, 'default_value': '0'}
      self.fields_config_dict['c1'] = {'field_name': 'field2', 'embedding_dim': self.embedding_dim, 'hash_bucket': 10, 'default_value': '0'}
      In a DeepFM model, you must set the embedding_dim parameter of all features to the same value. This limit does not apply to a Wide&Deep model. We recommend that you set the hash_buckets parameter to a larger value for the user_id and itemid features, and to a smaller value for features that have fewer values.
    2. Configure a model. We recommend that you configure a DeepFM or Wide&Deep model, that is, specify deepfm or wdl in the model parameter.
      tf.app.flags.DEFINE_string("model", 'deepfm', "model {'wdl', 'deepfm'}")
    3. Set distributed parameters.
      tf.app.flags.DEFINE_string("job_name", "", "job name")
      tf.app.flags.DEFINE_integer("task_index", None, "Worker or server index")
      tf.app.flags.DEFINE_string("ps_hosts", "", "ps hosts")
      tf.app.flags.DEFINE_string("worker_hosts", "", "worker hosts")
      When you submit a training task, you configure only the cluster parameter. The system automatically generates distributed parameters. For more information about the cluster parameter, see Step 4.
    4. Configure input data.
        def _parse_batch_for_tabledataset(self, *args):
          label = tf.reshape(args[0], [-1])
          fields = [tf.reshape(v, [-1]) for v in args[1:]]
          return dict(zip(self.feas_name, fields)), label
      
        def train_input_fn_from_odps(self, data_path, epoch=10, batch_size=1024, slice_id=0, slice_count=1):
          with tf.device('/cpu:0'):
            dataset = tf.data.TableRecordDataset([data_path], record_defaults=self.record_defaults,
                               slice_count=slice_count, slice_id=slice_id)
            dataset = dataset.batch(batch_size).repeat(epoch)
            dataset = dataset.map(self._parse_batch_for_tabledataset, num_parallel_calls=8).prefetch(100)
            return dataset
      
        def val_input_fn_from_odps(self, data_path, epoch=1, batch_size=1024, slice_id=0, slice_count=1):
          with tf.device('/cpu:0'):
            dataset = tf.data.TableRecordDataset([data_path], record_defaults=self.record_defaults,
                               slice_count=slice_count, slice_id=slice_id)
            dataset = dataset.batch(batch_size).repeat(epoch)
            dataset = dataset.map(self._parse_batch_for_tabledataset, num_parallel_calls=8).prefetch(100)
            return dataset
      If feature transformation is required, we recommend that you use MaxCompute to transform features outside the model. This way, training overheads can be reduced. If you want to transform features inside the model, we recommend that you use the _parse_batch_for_tabledataset function to transform features.
  3. Upload the modified model file to OSS.
  4. Optional:Submit a training task.
    Note If the model file has not been trained, this step must be performed. If the model file has been trained, you can skip this step and go to the next step.
    Run one of the following commands based on the type of the configured model to submit the training task:
    • DeepFM
      pai -name tensorflow1120_ext
      -project algo_public
      -Dbuckets='oss://bucket_name.oss-cn-region-internal.aliyuncs.com/'
      -Darn=''
      -Dscript='oss://bucket_name.oss-cn-region-internal.aliyuncs.com/demo/deepfm_pai_ctr.py'
      -Dtables='odps://pai_online_project/tables/dwd_avazu_ctr_predict_deepmodel_train,odps://pai_online_project/tables/dwd_avazu_ctr_predict_deepmodel_test'
      -DuserDefinedParameters="--task_type='train' --model='deepfm' --checkpoint_dir='oss://bucket_name/path/' --output_dir='oss://bucket_name/path/'"
      -Dcluster='{\"ps\":{\"count\":2,\"cpu\":1200,\"memory\":10000},\"worker\":{\"count\":8,\"cpu\":1200,\"gpu\":100,\"memory\":30000}}';
      The project parameter and the pai_online_project variable in the tables parameter specify the name of the project that stores input data. The bucket_name and region variables in the buckets and script parameters separately specify the name of the OSS bucket and the region where the OSS bucket resides. The arn parameter specifies the Alibaba Cloud Resource Name (ARN) of the Resource Access Management (RAM) role in your Alibaba Cloud account. You can set the parameters and variables as required. For more information about the ARN, see RAM role overview.
    • Wide&Deep
      pai -name tensorflow1120_ext
      -project algo_public
      -Dbuckets='oss://bucket_name.oss-cn-region-internal.aliyuncs.com/'
      -Darn=''
      -Dscript='oss://bucket_name.oss-cn-region-internal.aliyuncs.com/demo/deepfm_pai_ctr.py'
      -Dtables='odps://pai_online_project/tables/dwd_avazu_ctr_predict_deepmodel_train,odps://pai_online_project/tables/dwd_avazu_ctr_predict_deepmodel_test'
      -DuserDefinedParameters="--task_type='train' --model='wdl' --checkpoint_dir='oss://bucket_name/path/' --output_dir='oss://bucket_name/path/'"
      -Dcluster='{\"ps\":{\"count\":2,\"cpu\":1200,\"memory\":10000},\"worker\":{\"count\":8,\"cpu\":1200,\"gpu\":100,\"memory\":30000}}';
      The project parameter and the pai_online_project variable in the tables parameter specify the name of the project that stores input data. The bucket_name and region variables in the buckets and script parameters separately specify the name of the OSS bucket and the region where the OSS bucket resides. The arn parameter specifies the ARN of the RAM role in your Alibaba Cloud account. You can set the parameters and variables as required. For more information about the ARN, see RAM role overview.

    The training task is a distributed task that runs on the parameter server (PS)-worker architecture. Therefore, you must set the cluster parameter. In the preceding code, two PS nodes and eight worker nodes are specified in the cluster parameter. Each PS node is configured with 12 CPU cores and 10 GB of memory. Each worker node is configured with one GPU, 12 CPU cores, and 30 GB of memory. For more information about the parameters in the preceding code, see Task parameters of PAI-TensorFlow.

  5. Submit an offline inference task.
    pai -name tensorflow1120_ext
    -project algo_public
    -Dbuckets='oss://bucket_name.oss-cn-region-internal.aliyuncs.com/'
    -Darn=''
    -Dscript='oss://bucket_name.oss-cn-region-internal.aliyuncs.com/demo/deepfm_pai_ctr.py'
    -Dtables='odps://pai_online_project/tables/dwd_avazu_ctr_predict_deepmodel_train,odps://pai_online_project/tables/dwd_avazu_ctr_predict_deepmodel_test'
    -DuserDefinedParameters="--task_type='predict' --model='deepfm' --checkpoint_dir='oss://bucket_name/path/' --output_dir='oss://bucket_name/path/'"
    -Dcluster='{\"worker\":{\"count\":8,\"cpu\":1200,\"gpu\":100,\"memory\":30000}}'
    -Doutputs='odps://project_name/tables/output_table_name';
    The project parameter and the pai_online_project variable in the tables parameter specify the name of the project that stores input data. The project_name variable in the outputs parameter specifies the name of the project that stores output data. The bucket_name and region variables in the buckets and script parameters separately specify the name of the OSS bucket and the region where the OSS bucket resides. The arn parameter specifies the ARN of the RAM role in your Alibaba Cloud account. You can set the parameters and variables as required. For more information about the ARN, see RAM role overview.
    Before you submit an offline inference task, you must create an output table. The result of each inference task is written to the table and overwrites the result of the last inference task. You can run the following command to create an output table:
    drop table project_name.output_table_name;
    create table project_name.output_table_name
    (
       probabilities STRING
       ,logits STRING
    )STORED AS ALIORC;
  6. Export the trained model file.
    pai -name tensorflow1120_ext
    -project algo_public
    -Dbuckets='oss://bucket_name.oss-cn-region-internal.aliyuncs.com/'
    -Darn=''
    -Dscript='oss://bucket_name.oss-cn-region-internal.aliyuncs.com/demo/deepfm_pai_ctr.py'
    -Dtables='odps://pai_online_project/tables/dwd_avazu_ctr_predict_deepmodel_train,odps://pai_online_project/tables/dwd_avazu_ctr_predict_deepmodel_test'
    -DuserDefinedParameters="--task_type='savemodel' --model='deepfm' --checkpoint_dir='oss://bucket_name/path/' --output_dir='oss://bucket_name/path/'"
    -Dcluster='{\"ps\":{\"count\":2,\"cpu\":1200,\"memory\":10000},\"worker\":{\"count\":8,\"cpu\":1200,\"gpu\":100,\"memory\":30000}}';
    The project parameter and the pai_online_project variable in the tables parameter specify the name of the project that stores input data. The bucket_name and region variables in the buckets and script parameters separately specify the name of the OSS bucket and the region where the OSS bucket resides. The arn parameter specifies the ARN of the RAM role in your Alibaba Cloud account. You can set the parameters and variables as required. For more information about the ARN, see RAM role overview. The system uses a worker node to export the model file.