All Products
Search
Document Center

:Overview

Last Updated:Dec 20, 2023

The Data Science service is developed on top of E-MapReduce (EMR). Some artificial intelligence (AI) components provided by Alibaba Cloud Machine Learning Platform for AI (PAI) are integrated into Data Science. The AI components include Faiss (vector calculation engine) and TensorFlow or PyTorch (deep learning framework).

Background information

A Data Science cluster provides various features in big data and AI scenarios:

  • Supports distributed model training based on TensorFlow and PAI-TF and provides a built-in EasyRec algorithm package.

  • Supports the JupyterHub and Zeppelin services.

  • Supports Spark.

  • Provides various tools, such as EASCMD, Redis, Hive CLI, and Faiss-Server.

Intended users

The Data Science service is intended for the following users:

  • Users of open source big data systems

  • Users of intelligent recommendation and risk management solutions powered by the AI technology of Alibaba Cloud

Create a cluster

Log on to the EMR console and create a Data Science cluster.

Note

If a cluster is not created within 15 minutes, you can click History in the upper-right corner of the Cluster Management page to view the error. You can contact the O&M personnel.

When you create a cluster, take note of the following points:

  • Select a region where you want to create a cluster and an EMR version.

    • Region: You can select a region in the top navigation bar after you log on to the EMR console. You can view the supported regions on the buy page.

    • EMR version: The latest EMR version is displayed by default.

      EMR V3.35.7 is used as an example in this topic.

    • Optional:Optional services: Select optional services based on your business requirements. For example, select TensorFlow from the optional services.

    Create a Data Science cluster

  • In the Hardware Settings step, specify a VPC, a vSwitch, and a security group. If you need to create a security group, you must go to the ECS console for creation. Network configuration

    You must enable port 8443 to allow access to web UIs of related components.

  • In the Basic Settings step, add a Knox account. It is used to log on to the Knox service. Knox

    The added Knox account is a RAM user within your Alibaba Cloud account. For more information about how to add a Knox account, see Manage user accounts.

View cluster details

After a cluster is created, you can view the status of the cluster on the Cluster Management page. Cluster Management

Enable port 8443

Enable port 8443 of the cluster that you created to access the web UIs of services, such as YARN and HDFS.

View log data

You can access the web UI of a service to view the service log data. Example:

  1. Access the web UI of YARN. YARN web UI

  2. Click History in the row where your application is located. History

  3. Click the link of Log URL. Log URL

  4. Search for log data from the bottom upwards. Click log in the row where the log data that you want to view is located. Log_info

    View log details. Log details

Log on to a worker node

  1. Log on to your cluster in SSH mode.

  2. Switch to the hadoop user.

    su hadoop
  3. Obtain the IP address of the desired worker node.

    cat /etc/hosts | grep worker

    Information similar to the following output is returned:

    192.168.**.**    emr-worker-2.cluster-20**** emr-worker-2 emr-header-3.cluster-20**** emr-header-3 iZbp19nv7e19wx1ub0t****
    192.168.**.**    emr-worker-1.cluster-20**** emr-worker-1 emr-header-2 emr-header-2.cluster-20**** iZbp19nv7e19wx1ub0t****
                            
    Note

    192.168.**.** indicates the IP address of the worker node.

  4. Log on to the worker node in password-free mode.

    ssh <yourWorkIp>
    Note

    yourWorkIp indicates the IP address that you obtained.

  5. Use the sudo command on the worker node to run related commands as a root user.

    sudo pip3.7 install xxx
    Note

    xxx indicates the command that you want to run or the name of an installation package.

Use EasyRec

The EasyRec algorithm library includes mainstream algorithms, such as DeepFM, DIN, and MultiTower. An EMR Data Science cluster provides a built-in EasyRec algorithm library for you to use.

Assign a public IP address to a worker node

  1. On the Instances page of the EMR console, click the ID of the desired worker node in the ECS ID/Instance Name column. ID

  2. On the Instances page of the ECS console, click Bind EIP. IP

  3. In the Bind EIP dialog box, select an existing elastic IP address (EIP) or click Create EIP to create an EIP.

    For more information about how to create an EIP, see Apply for an EIP.