Use the EasyRec algorithm library to build a pipeline - E-MapReduce

This topic provides an example on how to use the EasyRec algorithm library to train models and build a routine pipeline.

Prerequisites

A Data Science cluster is created, and Kubeflow is selected from the optional services when you create the cluster. For more information, see Create a cluster.
PuTTY and SSH Secure File Transfer Client are installed on your on-premises machine.
The dsdemo code is downloaded. If you have created a Data Science cluster, you can join the DingTalk group numbered 32497587 to obtain the dsdemo code.

Procedure

Step 1: Make preparations
Step 2: Submit tasks
(Optional) Step 3: Create an image for Hive CLI, Spark CLI, ds-controller, Hue, a notebook server, or HTTPd
Step 4: Build a pipeline
Step 5: Upload the ***_mlpipeline.tar.gz file
Step 6: Create and run an experiment
(Optional) Step 7: View the status of the pipeline
Step 8: Perform model prediction

Step 1: Make preparations

Optional:Install SDKs.
1. Log on to your cluster in SSH mode. For more information, see Log on to a cluster.
2. Run the following command on the master node to install the seldon_core and kfp SDKs:
```
pip3.7 install seldon_core kfp configobj
```
  Note If the seldon_core and kfp SDKs have been installed, skip this step.
Log on to the Container Registry console. Activate Container Registry Personal Edition and create a namespace.
For more information about how to create a namespace, see Manage namespaces.
Note Container Registry Enterprise Edition provides higher security than Container Registry Personal Edition. Therefore, we recommend that you use Container Registry Enterprise Edition. If you use Container Registry Enterprise Edition, you must set Default Repository Type to Public when you create a namespace.

Modify the address of the registry and the name of the experiment namespace in the config file. Then, log on to your registry.

Run the following command to access the ml_on_ds directory:
```
sudo cd /root/dsdemo/ml_on_ds
```

Run the following command to view the registry address in the config file:

# !!! Extremely Important !!!
# !!! You must use A NEW EXP different from others !!!
EXP=exp1

#!!! ACR, Make sure NAMESPACE is public !!!
REGISTRY=registry-vpc.cn-beijing.aliyuncs.com
NAMESPACE=dsexperiment

#k8s namespace, must be same with username when you are using sub-account.
KUBERNETES_NAMESPACE=default

#kubernetes dashboard host, header1's public ip or inner ip.
KUBERNETES_DASHBOARD_HOST=39.104.**.**:32699

#PREFIX, could be a magic code.
PREFIX=prefix

#sc
NFSPATH=/mnt/disk1/k8s_pv/default_storage_class/
#NFSPATH=/mnt/disk1/nfs/ifs/kubernetes/

# region
REGIONID=cn-default

# emr-datascience clusterid
CLUSTERID="C-DEFAULT"

#HDFSADDR, train/test dir should be exist under $HDFSADDR, like
#user
#└── easy_rec
#    ├── 20210917
#    │   ├── test
#    │   │   ├── test0.csv
#    │   │   └── _SUCCESS
#    │   └── train
#    │       ├── train0.csv
#    │       └── _SUCCESS
#    └── 20210918
#        ├── test
#        │   ├── test0.csv
#        │   └── _SUCCESS
#        └── train
#            ├── train0.csv
#            └── _SUCCESS
HDFSADDR=hdfs://192.168.**.**:9000/user/easy_rec/metric_learning_i2i
MODELDIR=hdfs://192.168.**.**:9000/user/easy_rec/metric_learning_i2i

REGEX="*.csv"
SUCCESSFILE="train/_SUCCESS,test/_SUCCESS,hdfs://192.168.**.**:9000/flag"

EVALRESULTFILE=experiment/eval_result.txt

# for allinone.sh development based on supposed DATE & WHEN & PREDATE
DATE=20220405
WHEN=20220405190001
PREDATE=20220404

# for daytoday.sh & multidays training, use HDFSADDR, MODELDIR
START_DATE=20220627
END_DATE=20220628

#HIVEINPUT
DATABASE=testdb
TRAIN_TABLE_NAME=tb_train
EVAL_TABLE_NAME=tb_eval
PREDICT_TABLE_NAME=tb_predict
PREDICT_OUTPUT_TABLE_NAME=tb_predict_out
PARTITION_NAME=ds

#DSSM: inference user/item model with user&item feature
#metric_learning_i2i: infernece model with item feature
#SEP: user&item feature file use same seperator
USERFEATURE=taobao_user_feature_data.csv
ITEMFEATURE=taobao_item_feature_data.csv
SEP=","

# faiss_mysql: mysql as user_embedding storage, faiss as itemembedding index.
# holo_holo: holo as user & item embedding , along with indexing.
VEC_ENGINE=faiss_mysql
MYSQL_HOST=mysql.bitnami
MYSQL_PORT=3306
MYSQL_USER=root
MYSQL_PASSWORD=emr-datascience

#wait before pod finished after easyrec's python process end.
#example: 30s 10m 1h
WAITBEFOREFINISHED=10s

#tf train
#PS_NUMBER take effect only on training.
#WORKER_NUMBER take effect on training and predict.
TRAINING_REPOSITORY=tf-easyrec-training
TRAINING_VERSION=latest
PS_NUMBER=2
WORKER_NUMBER=3
SELECTED_COLS=""
EDIT_CONFIG_JSON=""
#tf export
ASSET_FILES=""

CHECKPOINT_DIR=

#pytorch train
PYTORCH_TRAINING_REPOSITORY=pytorch-training
PYTORCH_TRAINING_VERSION=latest
PYTORCH_MASTER_NUMBER=1
PYTORCH_WORKER_NUMBER=3

#jax train
JAX_TRAINING_REPOSITORY=jax-training
JAX_TRAINING_VERSION=latest
JAX_MASTER_NUMBER=2
JAX_WORKER_NUMBER=3

#easyrec customize action
CUSTOMIZE_ACTION=easy_rec.python.tools.modify_config_test
USERDEFINEPARAMETERS="--template_config_path hdfs://192.168.**.**:9000/user/easy_rec/rec_sln_test_dbmtl_v3281_template.config --output_config_path hdfs://192.168.**.**:9000/user/easy_rec/output.config"

#hivecli
HIVE_REPOSITORY=ds_hivecli
HIVE_VERSION=latest

#ds-controller
DSCONTROLLER_REPOSITORY=ds_controller
DSCONTROLLER_VERSION=latest

#notebook
NOTEBOOK_REPOSITORY=ds_notebook
NOTEBOOK_VERSION=latest

#hue
HUE_REPOSITORY=ds_hue
HUE_VERSION=latest

#httpd
HTTPD_REPOSITORY=ds_httpd
HTTPD_VERSION=latest

#postgist
POSTGIS_REPOSITORY=ds_postgis
POSTGIS_VERSION=latest

#customize
CUSTOMIZE_REPOSITORY=ds_customize
CUSTOMIZE_VERSION=latest

#faissserver
FAISSSERVER_REPOSITORY=ds_faissserver
FAISSSERVER_VERSION=latest

#vscode
VSCODE_REPOSITORY=ds_vscode
VSCODE_VERSION=latest

# ak/sk for cluster resize, only support resize TASK now!!!
EMR_AKID=AAAAAAAA
EMR_AKSECRET=BBBBBBBB
HOSTGROUPTYPE=TASK
INSTANCETYPE=ecs.g6.8xlarge
NODECOUNT=1
SYSDISKCAPACITY=120
SYSDISKTYPE=CLOUD_SSD
DISKCAPACITY=480
DISKTYPE=CLOUD_SSD
DISKCOUNT=4

# pvc_name, may not be changed, cause EXP will make sure that two or more experiment will not conflicts.
PVC_NAME="easyrec-volume"

SAVEDMODELS_RESERVE_DAYS=3

HIVEDB="jdbc:hive2://192.168.**.**:10000/zqkd"

#eval threshold
THRESHOLD=0.3

#eval result key, 'auc' 'auc_ctcvr' 'recall@5'
EVALRESULTKEY="recall@1"

#eval hit rate for vector recall
ITEM_EMB_TABLE_NAME=item_emb_table
GT_TABLE_NAME=gt_table
EMBEDDING_DIM=32
RECALL_TYPE="u2i"
TOPK=100
NUM_INTERESTS=1
KNN_METRIC=0

# sms or dingding
ALERT_TYPE=sms

# sms alert
SMS_AKID=AAAAAAAA
SMS_AKSECRET=BBBBBBBB
SMS_TEMPLATEDCODE=SMS_220default
SMS_PHONENUMBERS="186212XXXXX,186211YYYYY"
SMS_SIGNATURE="mysignature"

# dingtalk alert
ACCESS_TOKEN=AAAAAAAA

EAS_AKID=AAAAAAAA
EAS_AKSECRET=BBBBBBBB
EAS_ENDPOINT=pai-eas.cn-beijing.aliyuncs.com
EAS_SERVICENAME=datascience_eastest

# ak/sk for access oss
OSS_AKID=AAAAAAAA
OSS_AKSECRET=BBBBBBBB
OSS_ENDPOINT=oss-cn-huhehaote-internal.aliyuncs.com
OSS_BUCKETNAME=emrtest-huhehaote
# !!! Do not change !!!
OSS_OBJECTNAME=%%EXP%%_faissserver/item_embedding.faiss.svm

# ak/sk for access holo
HOLO_AKID=AAAAAAAA
HOLO_AKSECRET=BBBBBBBB
HOLO_ENDPOINT=hgprecn-cn-default-cn-beijing-vpc.hologres.aliyuncs.com

#tensorboard
TENSORBOARDPORT=6006

#nni port
NNIPORT=38080

#jupyter password
JUPYTER_PASSWORD=emr-datascience

#enable_overwrite
ENABLE_OVERWRITE=true

#For some users who are running pyspark & meachine learning jobs in jupyter notebook.
#ports for mapping when notebook enabled, multi-users will conflict on same node.
HOSTNETWORK=true
MAPPING_JUPYTER_NOTEBOOK_PORT=16200
MAPPING_NNI_PORT=16201
MAPPING_TENSORBOARD_PORT=16202
MAPPING_VSCODE_PORT=16203

Run the following command to log on to your registry for easy image pushing:
```
docker login --username=<Username> <your_REGISTRY>-registry.cn-beijing.cr.aliyuncs.com
```
Note You need to enable anonymous access and set the Default Repository Type parameter to Public for easy image pulling. <Username> indicates the access credential that you configured in the Container Registry console. For more information about how to configure an access credential, see Configure an access credential. <your_REGISTRY> indicates the registry address that you obtained in the preceding step.

Configure a Network Address Translation (NAT) gateway to access Container Registry resources. For more information, see Create and manage Internet NAT gateways.
Prepare test data.
Important You can write test data to the Hadoop Distributed File System (HDFS) service of your Data Science cluster or to a self-managed HDFS based on your business requirements. You must make sure normal network connection in the write operation.
```
sh allinlone.sh
```
Select ppd) Prepare data as prompted.

Step 2: Submit tasks

Important In the config file, replace the repository address with the address of your Container Registry repository, and modify the name of the experiment namespace and the versions.

Run the allinone.sh file.

sh allinone.sh -d

The following output is returned:

loading ./config

You are now working on k8snamespace: default

*** Welcome to DataScience ***
0)        Exit                                                                 k8s: default
ppd)      Prepare data           ppk)      Prepare DS/K8S config   cacr)     checking ACR
1|build)  build training image   bnt)      build notebook image    buildall) build all images(slow)
dck)      deletecheckpoint       ser)      showevalresult
apc)      applyprecheck          dpc)      deleteprecheck
2)        applytraining          3)        deletetraining
4)        applyeval              5)        deleteeval
4d)       applyevaldist          5d)       deleteevaldist
4hr)      applyevalhitrate       5hr)      deleteevalhitrate
6)        applyexport            7)        deleteexport
8)        applyserving           9)        deleteserving
10)       applypredict           11)       deletepredict
12)       applyfeatureselection  13)       deletefeatureselection
14)       applycustomizeaction   15)       deletecustomizeaction
16)       applypytorchtraining   17)       deletepytorchtraining
mt)       multidaystraining      dmt)      deletemultidaystraining
me)       multidayseval          dme)      deletemultidayseval
cnt)      createnotebook         dnt)      deletenotebook          snt)      shownotebooklink
cft)      createsftp             dft)      deletesftp              sft)      showsftplink
che)      createhue              dhe)      deletehue               she)      showhuelink
chd)      createhttpd            dhd)      deletehttpd             shd)      showhttpdlink
cvs)      createvscode           dvs)      deletevscode            svs)      showvscodelink
a)        kubectl get tfjobs     b)        kubectl get sdep        c)        kubectl get pytorchjobs
mp|mpl)   compile mlpipeline     bp|bpl)   compile bdpipeline      bu)       bdupload
tb)       tensorboard            vc)       verifyconfigfile        spl)      showpaireclink
tp)       kubectl top pods       tn)       kubectl top nodes       util)     show nodes utils
logs)     show pod logs          setnl)    set k8s node label
e|clean)  make clean             cleanall) make cleanall           sml)      showmilvuslink
sall)     show KubeFlow/Grafana/K8SOverview/Spark/HDFS/Yarn/EMR link
99)       kubectl get pods       99l)      kubectl get pods along with log url
>

Enter options and press Enter after you enter each option.
You can use TensorBoard to view the AUC curve during training:
1. Run the following command to access the ml_on_ds directory:
```
sudo cd /root/dsdemo/ml_on_ds
```
2. Run the following command to run TensorBoard:
```
sh run_tensorboard.sh
```
  Select tb to display the TensorBoard information at a checkpoint in the current experiment. Alternatively, run the sh run_tensorboard.sh 20211209 command to view the TensorBoard information at a checkpoint in the training task that is run on December 9, 2021.
  Note
  By default, the model directory that is specified by the TODAY_MODELDIR parameter in the config file is used. You can also run a command such as sh run_tensorboard.sh hdfs://192.168.**.**:9000/user/easy_rec/20210923/ to specify a date-specific model directory.
  You can modify parameters in the run_tensorboard.sh script based on your business requirements.
3. Open your browser, enter http://<yourPublicIPAddress>:6006 in the address bar, and then press Enter to view the AUC curve on the page that appears.

(Optional) Step 3: Create an image for Hive CLI, Spark CLI, ds-controller, Hue, a notebook server, or HTTPd

Note

The purpose of creating an image for Hive CLI or Spark CLI is to submit Hive or Spark tasks for big data processing and generate the data to be trained. If you have prepared related data, you can skip this step. For Spark tasks, the Spark service in the Data Science cluster is used. For Hive tasks, a separate Hadoop or Hive service is required.
An image that is created for ds-controller is used for dynamic scaling.

Hive CLI
Go to the directory where Hive CLI is installed and make an image.
```
cd hivecli && make
```
Spark CLI
Go to the directory where Spark CLI is installed and make an image.
```
cd sparkcli && make
```
dscontroller
Go to the directory where ds-controller is installed and make an image.
```
cd dscontroller && make
```
Hue
Go to the directory where Hue is deployed and make an image.
```
cd hue && make
```
notebook
Go to the directory where a notebook server is deployed and make an image.
```
cd notebook && make
```
httpd
Go to the directory where HTTP Daemon (HTTPd) is installed and make an image.
```
cd httpd && make
```

Step 4: Build a pipeline

For more information about the pipeline code, see the mlpipeline.py file.

Run the following command to access the /ml_on_ds directory:
```
sudo cd /root/dsdemo/ml_on_ds
```
Run the following command to build a pipeline:
```
make mpl
```
Note You can also run the sh allinone.sh command and select mpl to build a pipeline.
After the pipeline is built, a file named ***_mlpipeline.tar.gz is generated. Use SSH Secure File Transfer Client to download the ***_mlpipeline.tar.gz file to your on-premises machine.

Step 5: Upload the ***_mlpipeline.tar.gz file

In the Instance Info section of the Cluster Overview page, view the public IP address of the master node.
Open your browser, enter http://<yourPublicIPAddress>:31380 in the address bar, and then press Enter.
Note Replace <yourPublicIPAddress> with the public IP address that you obtained in the preceding step.
The homepage of Kubeflow appears, as shown in the following figure. Use the default anonymous namespace.
In the left-side navigation pane, click Pipelines.
In the upper-right corner of the Pipelines page, click Upload pipeline.
On the Upload Pipeline or Pipeline Version page, configure the Pipeline Name and Pipeline Description parameters, select Upload a file, and then select the ***_mlpipeline.tar.gz file.
Click Create.

Step 6: Create and run an experiment

In the left-side navigation pane, click Experiments.
In the upper-right corner of the page that appears, click Create experiment.
On the New experiment page, specify Experiment name.
Click Next.
On the Start a run page, configure parameters.
1. Select the ***_mlpipeline.tar.gz file that you downloaded to your on-premises machine in Step 4: Build a pipeline.
2. Select Recurring for Run Type.
Click Start.

(Optional) Step 7: View the status of the pipeline

You can view the status of the pipeline on the Experiments page. The following figure shows an example model. see_pipeline

Step 8: Perform model prediction

(Recommended) Use the HTTP request method

All development languages are supported for this method. The predict_rest.sh file contains the prediction code.

Run the following command to perform model prediction:

Important In the command, default indicates the default namespace. easyrec-tfserving is the default name that is used to deploy the Serving service. You can modify configurations based on your business requirements.

!/bin/sh
curl -X POST http://127.0.0.1:31380/seldon/default/easyrec-tfserving/api/v1.0/predictions -H 'Content-Type: application/json' -d '
{ 
"jsonData": { 
    "inputs": {
        "app_category":["10","10"],
        "app_domain":["1005","1005"],
        "app_id":["0","0"],
        "banner_pos":["85f751fd","4bf5bbe2"],
        "c1":["c4e18dd6","6b560cc1"],
        "c14":["50e219e0","28905ebd"],
        "c15":["0e8e4642","ecad2386"],
        "c16":["b408d42a","7801e8d9"],
        "c17":["09481d60","07d7df22"],
        "c18":["a99f214a","a99f214a"],
        "c19":["5deb445a","447d4613"],
        "c20":["f4fffcd0","cdf6ea96"],
        "c21":["1","1"],
        "device_conn_type":["0","0"],
        "device_id":["2098","2373"],
        "device_ip":["32","32"],
        "device_model":["5","5"],
        "device_type":["238","272"],
        "hour":["0","3"],
        "site_category":["56","5"],
        "site_domain":["0","0"],
        "site_id":["5","3"]
    }
}
}'

The following output is returned:

{"jsonData":{"outputs":{"logits":[-7.20718098,-4.15874624],"probs":[0.000740694755,0.0153866885]}},"meta":{}}

Use Seldon Core
Run the following command to perform model prediction over the REST protocol:
```
python3.7 predict_rest.py
```
The following output is returned:
```
Response:
{'jsonData': {'outputs': {'logits': [-2.66068792, 0.691401482], 'probs': [0.0653333142, 0.66627866]}}, 'meta': {}}
```
Note For information about the prediction code, see the predict_rest.py file.

Feedback

If you have any questions when you use a Data Science cluster,contact technical support for further assistance. You can also join the DingTalk group numbered 32497587 for feedback or communication.