This topic provides an example on how to use the EasyRec algorithm library to train models and build a routine pipeline.

Prerequisites

  • A Data Science cluster is created, and Kubeflow is selected from the optional services when you create the cluster. For more information, see Create a cluster.
  • PuTTY and SSH Secure File Transfer Client are installed on your on-premises machine.
  • The dsdemo code is downloaded. If you have created a Data Science cluster, you can join the DingTalk group numbered 32497587 to obtain the dsdemo code.

Procedure

  1. Step 1: Make preparations
  2. Step 2: Submit tasks
  3. (Optional) Step 3: Create an image for Hive CLI, Spark CLI, ds-controller, Hue, a notebook server, or HTTPd
  4. Step 4: Build a pipeline
  5. Step 5: Upload the ***_mlpipeline.tar.gz file
  6. Step 6: Create and run an experiment
  7. (Optional) Step 7: View the status of the pipeline
  8. Step 8: Perform model prediction

Step 1: Make preparations

  1. Optional:Install SDKs.
    1. Log on to your cluster in SSH mode. For more information, see Log on to a cluster.
    2. Run the following command on the master node to install the seldon_core and kfp SDKs:
      pip3.7 install seldon_core kfp configobj
      Note If the seldon_core and kfp SDKs have been installed, skip this step.
  2. Log on to the Container Registry console. Activate Container Registry Personal Edition and create a namespace.
    For more information about how to create a namespace, see Manage namespaces.
    Note Container Registry Enterprise Edition provides higher security than Container Registry Personal Edition. Therefore, we recommend that you use Container Registry Enterprise Edition. If you use Container Registry Enterprise Edition, you must set Default Repository Type to Public when you create a namespace.
  3. Modify the address of the registry and the name of the experiment namespace in the config file. Then, log on to your registry.
    1. Run the following command to access the ml_on_ds directory:
      sudo cd /root/dsdemo/ml_on_ds
    2. Run the following command to view the registry address in the config file:
      # !!! Extremely Important !!!
      # !!! You must use A NEW EXP different from others !!!
      EXP=exp1
      
      #!!! ACR, Make sure NAMESPACE is public !!!
      REGISTRY=registry-vpc.cn-beijing.aliyuncs.com
      NAMESPACE=dsexperiment
      
      #k8s namespace, must be same with username when you are using sub-account.
      KUBERNETES_NAMESPACE=default
      
      #kubernetes dashboard host, header1's public ip or inner ip.
      KUBERNETES_DASHBOARD_HOST=39.104.**.**:32699
      
      #PREFIX, could be a magic code.
      PREFIX=prefix
      
      #sc
      NFSPATH=/mnt/disk1/k8s_pv/default_storage_class/
      #NFSPATH=/mnt/disk1/nfs/ifs/kubernetes/
      
      # region
      REGIONID=cn-default
      
      # emr-datascience clusterid
      CLUSTERID="C-DEFAULT"
      
      #HDFSADDR, train/test dir should be exist under $HDFSADDR, like
      #user
      #└── easy_rec
      #    ├── 20210917
      #    │   ├── test
      #    │   │   ├── test0.csv
      #    │   │   └── _SUCCESS
      #    │   └── train
      #    │       ├── train0.csv
      #    │       └── _SUCCESS
      #    └── 20210918
      #        ├── test
      #        │   ├── test0.csv
      #        │   └── _SUCCESS
      #        └── train
      #            ├── train0.csv
      #            └── _SUCCESS
      HDFSADDR=hdfs://192.168.**.**:9000/user/easy_rec/metric_learning_i2i
      MODELDIR=hdfs://192.168.**.**:9000/user/easy_rec/metric_learning_i2i
      
      REGEX="*.csv"
      SUCCESSFILE="train/_SUCCESS,test/_SUCCESS,hdfs://192.168.**.**:9000/flag"
      
      EVALRESULTFILE=experiment/eval_result.txt
      
      # for allinone.sh development based on supposed DATE & WHEN & PREDATE
      DATE=20220405
      WHEN=20220405190001
      PREDATE=20220404
      
      # for daytoday.sh & multidays training, use HDFSADDR, MODELDIR
      START_DATE=20220627
      END_DATE=20220628
      
      #HIVEINPUT
      DATABASE=testdb
      TRAIN_TABLE_NAME=tb_train
      EVAL_TABLE_NAME=tb_eval
      PREDICT_TABLE_NAME=tb_predict
      PREDICT_OUTPUT_TABLE_NAME=tb_predict_out
      PARTITION_NAME=ds
      
      #DSSM: inference user/item model with user&item feature
      #metric_learning_i2i: infernece model with item feature
      #SEP: user&item feature file use same seperator
      USERFEATURE=taobao_user_feature_data.csv
      ITEMFEATURE=taobao_item_feature_data.csv
      SEP=","
      
      # faiss_mysql: mysql as user_embedding storage, faiss as itemembedding index.
      # holo_holo: holo as user & item embedding , along with indexing.
      VEC_ENGINE=faiss_mysql
      MYSQL_HOST=mysql.bitnami
      MYSQL_PORT=3306
      MYSQL_USER=root
      MYSQL_PASSWORD=emr-datascience
      
      #wait before pod finished after easyrec's python process end.
      #example: 30s 10m 1h
      WAITBEFOREFINISHED=10s
      
      #tf train
      #PS_NUMBER take effect only on training.
      #WORKER_NUMBER take effect on training and predict.
      TRAINING_REPOSITORY=tf-easyrec-training
      TRAINING_VERSION=latest
      PS_NUMBER=2
      WORKER_NUMBER=3
      SELECTED_COLS=""
      EDIT_CONFIG_JSON=""
      #tf export
      ASSET_FILES=""
      
      CHECKPOINT_DIR=
      
      #pytorch train
      PYTORCH_TRAINING_REPOSITORY=pytorch-training
      PYTORCH_TRAINING_VERSION=latest
      PYTORCH_MASTER_NUMBER=1
      PYTORCH_WORKER_NUMBER=3
      
      #jax train
      JAX_TRAINING_REPOSITORY=jax-training
      JAX_TRAINING_VERSION=latest
      JAX_MASTER_NUMBER=2
      JAX_WORKER_NUMBER=3
      
      #easyrec customize action
      CUSTOMIZE_ACTION=easy_rec.python.tools.modify_config_test
      USERDEFINEPARAMETERS="--template_config_path hdfs://192.168.**.**:9000/user/easy_rec/rec_sln_test_dbmtl_v3281_template.config --output_config_path hdfs://192.168.**.**:9000/user/easy_rec/output.config"
      
      #hivecli
      HIVE_REPOSITORY=ds_hivecli
      HIVE_VERSION=latest
      
      #ds-controller
      DSCONTROLLER_REPOSITORY=ds_controller
      DSCONTROLLER_VERSION=latest
      
      #notebook
      NOTEBOOK_REPOSITORY=ds_notebook
      NOTEBOOK_VERSION=latest
      
      #hue
      HUE_REPOSITORY=ds_hue
      HUE_VERSION=latest
      
      #httpd
      HTTPD_REPOSITORY=ds_httpd
      HTTPD_VERSION=latest
      
      #postgist
      POSTGIS_REPOSITORY=ds_postgis
      POSTGIS_VERSION=latest
      
      #customize
      CUSTOMIZE_REPOSITORY=ds_customize
      CUSTOMIZE_VERSION=latest
      
      #faissserver
      FAISSSERVER_REPOSITORY=ds_faissserver
      FAISSSERVER_VERSION=latest
      
      #vscode
      VSCODE_REPOSITORY=ds_vscode
      VSCODE_VERSION=latest
      
      # ak/sk for cluster resize, only support resize TASK now!!!
      EMR_AKID=AAAAAAAA
      EMR_AKSECRET=BBBBBBBB
      HOSTGROUPTYPE=TASK
      INSTANCETYPE=ecs.g6.8xlarge
      NODECOUNT=1
      SYSDISKCAPACITY=120
      SYSDISKTYPE=CLOUD_SSD
      DISKCAPACITY=480
      DISKTYPE=CLOUD_SSD
      DISKCOUNT=4
      
      # pvc_name, may not be changed, cause EXP will make sure that two or more experiment will not conflicts.
      PVC_NAME="easyrec-volume"
      
      SAVEDMODELS_RESERVE_DAYS=3
      
      HIVEDB="jdbc:hive2://192.168.**.**:10000/zqkd"
      
      #eval threshold
      THRESHOLD=0.3
      
      #eval result key, 'auc' 'auc_ctcvr' 'recall@5'
      EVALRESULTKEY="recall@1"
      
      #eval hit rate for vector recall
      ITEM_EMB_TABLE_NAME=item_emb_table
      GT_TABLE_NAME=gt_table
      EMBEDDING_DIM=32
      RECALL_TYPE="u2i"
      TOPK=100
      NUM_INTERESTS=1
      KNN_METRIC=0
      
      # sms or dingding
      ALERT_TYPE=sms
      
      # sms alert
      SMS_AKID=AAAAAAAA
      SMS_AKSECRET=BBBBBBBB
      SMS_TEMPLATEDCODE=SMS_220default
      SMS_PHONENUMBERS="186212XXXXX,186211YYYYY"
      SMS_SIGNATURE="mysignature"
      
      # dingtalk alert
      ACCESS_TOKEN=AAAAAAAA
      
      EAS_AKID=AAAAAAAA
      EAS_AKSECRET=BBBBBBBB
      EAS_ENDPOINT=pai-eas.cn-beijing.aliyuncs.com
      EAS_SERVICENAME=datascience_eastest
      
      # ak/sk for access oss
      OSS_AKID=AAAAAAAA
      OSS_AKSECRET=BBBBBBBB
      OSS_ENDPOINT=oss-cn-huhehaote-internal.aliyuncs.com
      OSS_BUCKETNAME=emrtest-huhehaote
      # !!! Do not change !!!
      OSS_OBJECTNAME=%%EXP%%_faissserver/item_embedding.faiss.svm
      
      # ak/sk for access holo
      HOLO_AKID=AAAAAAAA
      HOLO_AKSECRET=BBBBBBBB
      HOLO_ENDPOINT=hgprecn-cn-default-cn-beijing-vpc.hologres.aliyuncs.com
      
      #tensorboard
      TENSORBOARDPORT=6006
      
      #nni port
      NNIPORT=38080
      
      #jupyter password
      JUPYTER_PASSWORD=emr-datascience
      
      #enable_overwrite
      ENABLE_OVERWRITE=true
      
      #For some users who are running pyspark & meachine learning jobs in jupyter notebook.
      #ports for mapping when notebook enabled, multi-users will conflict on same node.
      HOSTNETWORK=true
      MAPPING_JUPYTER_NOTEBOOK_PORT=16200
      MAPPING_NNI_PORT=16201
      MAPPING_TENSORBOARD_PORT=16202
      MAPPING_VSCODE_PORT=16203
    3. Run the following command to log on to your registry for easy image pushing:
      docker login --username=<Username> <your_REGISTRY>-registry.cn-beijing.cr.aliyuncs.com
      Note You need to enable anonymous access and set the Default Repository Type parameter to Public for easy image pulling. <Username> indicates the access credential that you configured in the Container Registry console. For more information about how to configure an access credential, see Configure an access credential. <your_REGISTRY> indicates the registry address that you obtained in the preceding step.
  4. Configure a Network Address Translation (NAT) gateway to access Container Registry resources. For more information, see Create and manage Internet NAT gateways.
  5. Prepare test data.
    Important You can write test data to the Hadoop Distributed File System (HDFS) service of your Data Science cluster or to a self-managed HDFS based on your business requirements. You must make sure normal network connection in the write operation.
    sh allinlone.sh

    Select ppd) Prepare data as prompted.

Step 2: Submit tasks

Important In the config file, replace the repository address with the address of your Container Registry repository, and modify the name of the experiment namespace and the versions.
  1. Run the allinone.sh file.
    sh allinone.sh -d
    The following output is returned:
    loading ./config
    
    You are now working on k8snamespace: default
    
    *** Welcome to DataScience ***
    0)        Exit                                                                 k8s: default
    ppd)      Prepare data           ppk)      Prepare DS/K8S config   cacr)     checking ACR
    1|build)  build training image   bnt)      build notebook image    buildall) build all images(slow)
    dck)      deletecheckpoint       ser)      showevalresult
    apc)      applyprecheck          dpc)      deleteprecheck
    2)        applytraining          3)        deletetraining
    4)        applyeval              5)        deleteeval
    4d)       applyevaldist          5d)       deleteevaldist
    4hr)      applyevalhitrate       5hr)      deleteevalhitrate
    6)        applyexport            7)        deleteexport
    8)        applyserving           9)        deleteserving
    10)       applypredict           11)       deletepredict
    12)       applyfeatureselection  13)       deletefeatureselection
    14)       applycustomizeaction   15)       deletecustomizeaction
    16)       applypytorchtraining   17)       deletepytorchtraining
    mt)       multidaystraining      dmt)      deletemultidaystraining
    me)       multidayseval          dme)      deletemultidayseval
    cnt)      createnotebook         dnt)      deletenotebook          snt)      shownotebooklink
    cft)      createsftp             dft)      deletesftp              sft)      showsftplink
    che)      createhue              dhe)      deletehue               she)      showhuelink
    chd)      createhttpd            dhd)      deletehttpd             shd)      showhttpdlink
    cvs)      createvscode           dvs)      deletevscode            svs)      showvscodelink
    a)        kubectl get tfjobs     b)        kubectl get sdep        c)        kubectl get pytorchjobs
    mp|mpl)   compile mlpipeline     bp|bpl)   compile bdpipeline      bu)       bdupload
    tb)       tensorboard            vc)       verifyconfigfile        spl)      showpaireclink
    tp)       kubectl top pods       tn)       kubectl top nodes       util)     show nodes utils
    logs)     show pod logs          setnl)    set k8s node label
    e|clean)  make clean             cleanall) make cleanall           sml)      showmilvuslink
    sall)     show KubeFlow/Grafana/K8SOverview/Spark/HDFS/Yarn/EMR link
    99)       kubectl get pods       99l)      kubectl get pods along with log url
    >
  2. Enter options and press Enter after you enter each option.
    You can use TensorBoard to view the AUC curve during training:
    1. Run the following command to access the ml_on_ds directory:
      sudo cd /root/dsdemo/ml_on_ds
    2. Run the following command to run TensorBoard:
      sh run_tensorboard.sh

      Select tb to display the TensorBoard information at a checkpoint in the current experiment. Alternatively, run the sh run_tensorboard.sh 20211209 command to view the TensorBoard information at a checkpoint in the training task that is run on December 9, 2021.

      Note
      • By default, the model directory that is specified by the TODAY_MODELDIR parameter in the config file is used. You can also run a command such as sh run_tensorboard.sh hdfs://192.168.**.**:9000/user/easy_rec/20210923/ to specify a date-specific model directory.
      • You can modify parameters in the run_tensorboard.sh script based on your business requirements.
    3. Open your browser, enter http://<yourPublicIPAddress>:6006 in the address bar, and then press Enter to view the AUC curve on the page that appears. auc

(Optional) Step 3: Create an image for Hive CLI, Spark CLI, ds-controller, Hue, a notebook server, or HTTPd

Note
  • The purpose of creating an image for Hive CLI or Spark CLI is to submit Hive or Spark tasks for big data processing and generate the data to be trained. If you have prepared related data, you can skip this step. For Spark tasks, the Spark service in the Data Science cluster is used. For Hive tasks, a separate Hadoop or Hive service is required.
  • An image that is created for ds-controller is used for dynamic scaling.
  • Hive CLI
    Go to the directory where Hive CLI is installed and make an image.
    cd hivecli && make
  • Spark CLI
    Go to the directory where Spark CLI is installed and make an image.
    cd sparkcli && make
  • dscontroller
    Go to the directory where ds-controller is installed and make an image.
    cd dscontroller && make
  • Hue
    Go to the directory where Hue is deployed and make an image.
    cd hue && make
  • notebook
    Go to the directory where a notebook server is deployed and make an image.
    cd notebook && make
  • httpd
    Go to the directory where HTTP Daemon (HTTPd) is installed and make an image.
    cd httpd && make

Step 4: Build a pipeline

For more information about the pipeline code, see the mlpipeline.py file.

  1. Run the following command to access the /ml_on_ds directory:
    sudo cd /root/dsdemo/ml_on_ds
  2. Run the following command to build a pipeline:
    make mpl
    Note You can also run the sh allinone.sh command and select mpl to build a pipeline.

    After the pipeline is built, a file named ***_mlpipeline.tar.gz is generated. Use SSH Secure File Transfer Client to download the ***_mlpipeline.tar.gz file to your on-premises machine.

Step 5: Upload the ***_mlpipeline.tar.gz file

  1. In the Instance Info section of the Cluster Overview page, view the public IP address of the master node.
    header_ip
  2. Open your browser, enter http://<yourPublicIPAddress>:31380 in the address bar, and then press Enter.
    Note Replace <yourPublicIPAddress> with the public IP address that you obtained in the preceding step.
    The homepage of Kubeflow appears, as shown in the following figure. Use the default anonymous namespace. Kubeflow_index
  3. In the left-side navigation pane, click Pipelines.
  4. In the upper-right corner of the Pipelines page, click Upload pipeline. upload
  5. On the Upload Pipeline or Pipeline Version page, configure the Pipeline Name and Pipeline Description parameters, select Upload a file, and then select the ***_mlpipeline.tar.gz file.
    upload
  6. Click Create.

Step 6: Create and run an experiment

  1. In the left-side navigation pane, click Experiments.
  2. In the upper-right corner of the page that appears, click Create experiment.
  3. On the New experiment page, specify Experiment name.
  4. Click Next.
  5. On the Start a run page, configure parameters.
    1. Select the ***_mlpipeline.tar.gz file that you downloaded to your on-premises machine in Step 4: Build a pipeline.
      Select_JAR
    2. Select Recurring for Run Type.
      run type
  6. Click Start.

(Optional) Step 7: View the status of the pipeline

You can view the status of the pipeline on the Experiments page. The following figure shows an example model. see_pipeline

Step 8: Perform model prediction

  • (Recommended) Use the HTTP request method

    All development languages are supported for this method. The predict_rest.sh file contains the prediction code.

    Run the following command to perform model prediction:
    Important In the command, default indicates the default namespace. easyrec-tfserving is the default name that is used to deploy the Serving service. You can modify configurations based on your business requirements.
    !/bin/sh
    curl -X POST http://127.0.0.1:31380/seldon/default/easyrec-tfserving/api/v1.0/predictions -H 'Content-Type: application/json' -d '
    { 
    "jsonData": { 
        "inputs": {
            "app_category":["10","10"],
            "app_domain":["1005","1005"],
            "app_id":["0","0"],
            "banner_pos":["85f751fd","4bf5bbe2"],
            "c1":["c4e18dd6","6b560cc1"],
            "c14":["50e219e0","28905ebd"],
            "c15":["0e8e4642","ecad2386"],
            "c16":["b408d42a","7801e8d9"],
            "c17":["09481d60","07d7df22"],
            "c18":["a99f214a","a99f214a"],
            "c19":["5deb445a","447d4613"],
            "c20":["f4fffcd0","cdf6ea96"],
            "c21":["1","1"],
            "device_conn_type":["0","0"],
            "device_id":["2098","2373"],
            "device_ip":["32","32"],
            "device_model":["5","5"],
            "device_type":["238","272"],
            "hour":["0","3"],
            "site_category":["56","5"],
            "site_domain":["0","0"],
            "site_id":["5","3"]
        }
    }
    }'
    The following output is returned:
    {"jsonData":{"outputs":{"logits":[-7.20718098,-4.15874624],"probs":[0.000740694755,0.0153866885]}},"meta":{}}
  • Use Seldon Core
    Run the following command to perform model prediction over the REST protocol:
    python3.7 predict_rest.py
    The following output is returned:
    Response:
    {'jsonData': {'outputs': {'logits': [-2.66068792, 0.691401482], 'probs': [0.0653333142, 0.66627866]}}, 'meta': {}}
    Note For information about the prediction code, see the predict_rest.py file.

Feedback

If you have any questions when you use a Data Science cluster,contact technical support for further assistance. You can also join the DingTalk group numbered 32497587 for feedback or communication.