The Data Science service is developed on top of E-MapReduce (EMR). Some artificial intelligence (AI) components provided by Alibaba Cloud Machine Learning Platform for Artificial Intelligence are integrated into Data Science. The AI components include Alink (machine learning algorithm platform), Faiss (vector calculation engine), and TensorFlow or PyTorch (deep learning framework).

Background information

A Data Science cluster provides various features in big data and AI scenarios:
  • Supports distributed model training based on TensorFlow and PAI-TF and provides a built-in EasyRec algorithm package.
  • Supports Alink-based stream processing, batch processing, and machine learning algorithms.
  • Supports the JupyterHub and Zeppelin services.
  • Supports Spark.
  • Provides an AutoML package to support machine learning.
  • Provides various tools, such as EASCMD, Redis, Hive CLI, and Faiss-Server.

Intended users

The Data Science service is intended for the following users:
  • Users of open source big data systems
  • Users of intelligent recommendation and risk management solutions powered by the AI technology of Alibaba Cloud

Create a cluster

Log on to the EMR console and create a Data Science cluster. For more information, see Create a cluster.
Note If a cluster is not created within 15 minutes, you can click History in the upper-right corner of the Cluster Management page to view the error. You can contact the O&M personnel or submit a ticket.
  1. In the Software Settings step, select a zone, an EMR version, and optional services.
    • Zone: Only some zones are supported. You can view the supported zones of each region on the buy page.
    • EMR Version: The latest EMR version is displayed by default.

      EMR V3.29.1 is used as an example in this topic.

    • Optional:Optional Services: Select optional services, such as TensorFlow, based on your business requirements.
    Create_Data_Science
  2. In the Hardware Settings step, specify a VPC, a vSwitch, and a security group. If you need to create a security group, you must go to the ECS console for creation. network

    You must enable port 8443 to allow access to web UIs of related components. For more information, see Configure security group rules.

  3. In the Basic Settings step, add a Knox account. It is used to log on to the Knox service. Knox

    The added Knox account is a RAM user within your Alibaba Cloud account. For more information about how to add a Knox account, see Manage user accounts.

View cluster details

After the cluster is created, you can view the running status of the cluster on the Cluster Management page. Cluster Management

Enable port 8443

Enable port 8443 of the created cluster to access the web UIs of services, such as YARN and HDFS. For more information about how to enable the port, see Access the web UIs of open source components.

View log data

You can access the web UI of a service to view the service log data. Example:

  1. Access the web UI of YARN.

    For more information, see Access the web UIs of open source components.

  2. Click History in the row where your application is located. History
  3. Click Logs. logs

    You can view log data.

Resize a disk

You can resize the data disks of a cluster if the disk space is insufficient. For more information, see Expand disks.

Log on to a worker node

  1. Log on to your cluster in SSH mode.

    For more information, see Connect to the master node of an EMR cluster in SSH mode.

  2. Switch to the hadoop user.
    su hadoop
  3. Obtain the IP address of the worker node.
    cat /etc/hosts | grep worker
    Information similar to the following output is returned:
    192.168.**.**    emr-worker-2.cluster-20**** emr-worker-2 emr-header-3.cluster-20**** emr-header-3 iZbp19nv7e19wx1ub0t****
    192.168.**.**    emr-worker-1.cluster-20**** emr-worker-1 emr-header-2 emr-header-2.cluster-20**** iZbp19nv7e19wx1ub0t****
                            
    Note 192.168.**.** indicates the IP address of the worker node.
  4. Log on to the worker node without a password.
    ssh <yourWorkIp>
    Note yourWorkIp indicates the IP address that you obtained.
  5. Use the sudo command on the worker node to run commands as a root user.
    sudo pip3.7 install xxx
    Note xxx indicates the command that you need to run.

Use EasyRec

The EasyRec algorithm library includes mainstream algorithms, such as DeepFM, DIN, and MultiTower. An EMR Data Science cluster provides a built-in EasyRec algorithm library for you to use. For more information about EasyRec, see EasyRec.

Assign a public IP address to a worker node

  1. On the Instances page of the EMR console, click the ID of the worker node in the ECS ID/Instance Name column. ID
  2. On the Instances page of the ECS console, click Bind EIP. IP
  3. In the Bind EIP dialog box, select an existing elastic IP address (EIP) or click Create EIP to create an EIP.

    For more information about how to create an EIP, see Apply for new EIPs.