Run Spark-Alibaba Cloud developer community on Amazon EC2

the Spark ec2 Directory has a spark-ec2 script that helps you start, manage, and close Spark clusters on Amazon EC2. This script can automatically set Spark and HDFS on the EC2 cluster. This topic describes how to use spark-ec2 scripts to start and close a cluster and how to submit jobs in the cluster. Of course, first you must register an EC2 account on the Amazon Web Services site.

 

spark-ec2 can manage multiple named clusters. You can use it to start a new cluster (you need to provide the cluster size and cluster name), close an existing cluster, or log on to a cluster. Machines in each cluster are divided into different EC2 security groups, and the names of these security groups are derived from the cluster names. For example, for a cluster named test, its master is assigned to a security group called test-master, while other slave nodes will be assigned to the test-slaves security group. The spark-ec2 script automatically creates a security group based on the cluster name you provided. You can use these names in the EC2 Console (Amazon EC2 Console).

  • first, you need to create Amazon EC2 key pair . You need to log on to the Amazon Web Services account, click AWS console in the sidebar on the AWS console (Key Pairs), and download it. At the same time, you must attach 600 permissions (that is, readable and writable) to the private key file to log on using ssh.
  • When using spark-ec2, you must set these two environment variables, AWS_ACCESS_KEY_IDand AWS_SECRET_ACCESS_KEY, and make it point to your Amazon EC2 access key ID and secret access key . You can click AWS homepage> Account > Security Credentials on the AWS homepage (Access Credentials).
  • switch to the spark ec2 directory that you downloaded
  • run the./spark-ec2 -k command. -i -s launch , where it's yours Amazon EC2 key pair name (you created Amazon EC2 key pair when specified by the name), <key-file>is Amazon EC2 key pair private key (private key) file, <num-slaves>is the number of slave nodes (at least 1), <cluster-name>the name of the specified cluster.

Example:

bash export AWS_SECRET_ACCESS_KEY=AaBbCcDdEeFGgHhIiJjKkLlMmNnOoPpQqRrSsTtU \
export AWS_ACCESS_KEY_ID=ABCDEFG1234567890123

./spark-ec2 --key-pair=awskey \
--identity-file=awskey.pem \
--region=us-west-1 \
--zone=us-west-1a \
launch my-spark-cluster
  • after the cluster is started, check whether the Cluster Scheduler is started. You can also check whether all slave nodes are displayed correctly on the Web UI, the Web UI link is printed on the screen after the script is executed (usually this link is http:// : 8080)

you can run./spark-ec2 -help to view more options. The following are some important options:

  • -instance-type= you can specify the instance type of the EC2 machine. Currently, this script only supports 64- bit instance types.
  • -region= you can specify the region to which the EC2 cluster is deployed. The default region is us-east-1.
  • -zone= you can specify the region where the EC2 cluster instance is deployed (the region where the EC2 cluster instance is available). When specifying this parameter, note that sometimes you may need to start the EC2 cluster in other regions because the capacity may be insufficient in some regions.
  • -ebs-vol-size = <GB>you can attach an EBS volume to each node and specify its total capacity. These volumes are persistent and will not be lost even if the cluster is restarted.
  • -spot-price= the Spot instance ( Spot Instances) working nodes will be started. These nodes can be allocated on demand and can be Spot. The highest Spot price (in US dollars) can be set.
  • -spark-version= you can preload a specified version of spark into a cluster. It can be a version number (such as 0.7.3) or a git hash value. By default, the latest version of spark is used.
  • -spark-git-repo = you can specify a custom git library to download and deploy a specific spark build version in the git Library. Apache Github mirror is used by default. If the spark version is specified at the same time, the value of the-spark-version parameter cannot use the version number, but must be a git commit hash (for example, 317 e114) corresponding to a git commit.
  • If the startup fails due to some reasons (for example, the private key file is not configured with the correct file permissions), you can use the-resume option to restart and continue the deployment process of the existing cluster.
  • Run./spark-ec2 -k <keypair>-i <key-file>-s <num-slaves>-vpc-id = <vpc-id>-subnet-id = <subnet-id>launch <cluster-name>, where, <keypair>is your EC2 key pair (previously created), <key-file>is the private key file in the key pair, <num-slaves>is the number of slave nodes (if you are using it for the first time, you can set it to 1 first), <vpc-id>is the name of the VPC, <subnet-id>is your subnet name, and finally <cluster-name>the name of your cluster.

Example:

bash export AWS_SECRET_ACCESS_KEY=AaBbCcDdEeFGgHhIiJjKkLlMmNnOoPpQqRrSsTtU \
export AWS_ACCESS_KEY_ID=ABCDEFG1234567890123 

./spark-ec2 --key-pair=awskey \
--identity-file=awskey.pem \
--region=us-west-1 \
--zone=us-west-1a \
--vpc-id=vpc-a28d24c7 \
--subnet-id=subnet-4eb27b39 \
--spark-version=1.1.0 \
launch my-spark-cluster
  • go to the spark ec2 directory that you downloaded
  • implementation ./spark-ec2 -k <keypair> -i <key-file> login <cluster-name>remotely log on to your EC2 cluster, where, <keypair>and <key-file> 的说明见本文上面(This is for convenience, you can also use the EC2 console)
  • if you need to deploy code or data to the EC2 cluster, you can use the script ~/spark-ec2/copy-dir after logging on, and specify an RSYNC synchronization to all slave nodes (slave nodes). The directory.
  • If your application needs to access a large dataset, the fastest way is to load the data from Amazon S3 or Amazon EBS devices and then put it on HDFS in your cluster. The spark-ec2 script has configured an HDFS for you. The installation directory is/root/ephemeral-hdfs, and you can use the bin/hadoop script in this directory to access HDFS. Note that data on HDFS is automatically deleted after the cluster is stopped or restarted.
  • The cluster also has persistent HDFS. The installation path is/root/persistent-hdfs. The data stored in this HDFS will not be lost even if the cluster is restarted. However, in general, this HDFS has less space (about 3GB) on each node, you can use the spark-ec2 option-ebs-vol-size to specify the space used for persistent HDFS on each node.
  • Finally, if your application fails, you can check the logs of the modified application on the slave node. The logs are located in the working directory of the scheduler (/root/spark/work). Of course, you can also use the web UI(http:// : 8080) check the cluster status.

you can edit the/root/spark/conf/spark-env.sh file on each node to set Spark configuration options (such as JVM parameter). Once this file is changed, you must copy it to all nodes in the cluster. The simplest way is to use the copy-dir script. First, edit the spark-env.sh file on the master node. Then, run ~/spark-ec2/copy-dir /root/spark/conf to RSYNC the conf directory to all working nodes.

The configuration guide section describes the available options.

please note that if the EC2 node is closed, there is no way to restore its data! Therefore, make sure that all important data is copied and backed up before the node is closed.

  • Switch to the ec2 directory under spark
  • run the./spark-ec2 destroy command. <cluster-name>

spark-ec2 script also supports suspending clusters. In this case, the virtual machines used by the cluster instance are stopped but will not be destroyed. Temporary disk data will be lost however, data on the root partition and persistent HDFS(persistent-hdfs) is not lost. Stopping an ECS instance does not take extra EC2 cycles (which means that you do not have to pay for the ECS instance), but continues to charge for EBS storage.

  • To stop a cluster, you need to switch to the ec2 directory and run./spark-ec2 -region = <ec2-region>stop <cluster-name>
  • if you want to restart it later, run./spark-ec2-I. <key-file>-region= <ec2-region>start <cluster-name>
  • to destroy the cluster and no longer occupy EBS storage space, run./spark-ec2 -region = destroy (As described in the previous section)
  • to "cluster computing" support have a limit-cannot specify a local group. However, you can manually start some slave nodes in the-slaves group, and then use the spark-ec2 launch - resume command to form a cluster.

If you find some new restrictions or have any suggestions, please contribute to the community.

the Spark file interface allows you to access all data on Amazon S3 through the same URI format. Of course, these data formats must be supported by Hadoop. You can use this URI format to specify S3 path s3n:// /path. When starting a Spark cluster, you can use the-copy-aws-credentials option to specify the AWS certificate to access S3. For a more complete access to the Hadoop library required by S3, you can view the Hadoop S3 page here.

In addition, when accessing S3, you can not only enter a single file path, but also enter the entire directory path.

  

Selected, One-Stop Store for Enterprise Applications
Support various scenarios to meet companies' needs at different stages of development

Start Building Today with a Free Trial to 50+ Products

Learn and experience the power of Alibaba Cloud.

Sign Up Now