×
Community Blog Practices of Simulating IDC Spark Read and Write MaxCompute

Practices of Simulating IDC Spark Read and Write MaxCompute

This article uses EMR (Cloud Hadoop) to simulate a local Hadoop cluster accessing MaxCompute data.

By Yueyi Yang

1. Background

1.1 Background Information

The existing lake house architecture uses MaxCompute as the center to read and write Hadoop cluster data. In some offline IDC scenarios, customers are unwilling to expose the internal information of the cluster to the public network and need to initiate access to cloud data from Hadoop clusters. This article uses EMR (Hadoop on cloud) to simulate a local Hadoop cluster accessing MaxCompute data.

1.2 Basic Architecture

1

2. Construction of Development Environment

2.1 Prepare the EMR Environment

(1) Purchase

① Log on to the Alibaba Cloud console and click the console option in the upper right corner.

2

(2) Go to the navigation page and click cloud products – E-MapReduce. You can also search for it.

3

③ Go to the E-MapReduce homepage, click EMR on ECS, and create a cluster.

Please refer to the official document for specific purchase details: https://www.alibabacloud.com/help/en/e-mapreduce/latest/getting-started#section-55q-jmm-3ts

④ Click the cluster ID to view the basic information, cluster services, node management, and other modules.

(2) Log In

Please refer to the official document for more information about how to log on to the cluster: https://www.alibabacloud.com/help/en/e-mapreduce/latest/log-on-to-a-cluster

This article will use logging on to an ECS as an example.

① Click on the Alibaba Cloud Console - ECS

② Click the Instance Name - Remote Connection - Workbench Remote Connection

2.2 Local IDEA Preparation

(1) Install Maven

Please see the article below for more information: https://blog.csdn.net/l32273/article/details/123684435 (Article in Chinese).

(2) Create a Scala Project

① Download the Scala Plugin:

4

② Install Scala JDK

③ Create a Scala Project:

5

2.3 Data Preparation of MaxCompute

(1) Project

Please refer to the official document for more information about how to create a MaxCompute project: https://www.alibabacloud.com/help/en/maxcompute/latest/create-a-maxcompute-project

(2) AccessKey

The AccessKey (AK) pair used to access Alibaba Cloud APIs includes the AccessKey ID and AccessKey secret. After you create an Alibaba Cloud account on the official site (alibabacloud.com), an AccessKey pair is generated on the AccessKey Management page. AccessKey pairs are used to identify users and verify the signature of requests for accessing MaxCompute or other Alibaba Cloud services or connecting to third-party tools. Keep your AccessKey Secret confidential to prevent credential leaks. If there is a leak, disable or update your AccessKey immediately.

Please refer to the official document for more information about AK: https://ram.console.aliyun.com/manage/ak

(3) Endpoint

MaxCompute Service: The connection address is Endpoint, which varies based on the region and network connection mode.

Please see the official document for more information about the region endpoint: https://www.alibabacloud.com/help/en/maxcompute/latest/prepare-endpoints.

(4) Table

Please see the official document for more information about how to create a MaxCompute table: https://www.alibabacloud.com/help/en/maxcompute/latest/ddl-sql-table-operations

This article needs to prepare a partition table and non-partition table for testing.

3. Code Testing

3.1 Prerequisites

(1) Prepare the project, AK information, and table data on MaxCompute

(2) Prepare the E-MapReduce cluster

(3) The terminal connects to the E-MapReduce node (the ECS instance)

(4) Configure Scala and Maven environment variables in IDEA and download the Scala plug-in

3.2 Sample Code Description

https://github.com/aliyun/aliyun-maxcompute-data-collectors/blob/master/spark-datasource-v3.1/src/test/scala/PartitionDataReaderTest.scala

3.3 Package and Upload

(1) After After Writing the Code Locally, Maven Packages It

6

(2) Compile the jar Package Locally

① Enter the project directory:

cd ${project.dir}/spark-datasource-v3.1

② Run the mvn command to build a spark-datasource:

mvn clean package jar:test-jar

7

③ Check whether there are dependencies.jar and tests.jar in the target directory:

8

(3) Upload the jar Package to the Server

① Upload the scp command:

scp [local jar package path] root@[ecs instance public IP]:[server storing jar package path]

9

② View server

10

③ Upload jar packages between nodes:

scp -r [path of this server to store jar packages] root@ecs instance private IP:[address of the receiving server to store jar packages]

11

3.4 Test

(1) Operation Mode

Local Mode: Specify the master parameter as local:

./bin/spark-submit \
    --master local \
    --jars ${project.dir}/spark-datasource-v3.1/target/spark-datasource-1.0-SNAPSHOT-jar-with-dependencies.jar,${project.dir}/spark-datasource-v2.3/libs/cupid-table-api-1.1.5-SNAPSHOT.jar,${project.dir}/spark-datasource-v2.3/libs/table-api-tunnel-impl-1.1.5-SNAPSHOT.jar \
    --class DataReaderTest \
    ${project.dir}/spark-datasource-v3.1/target/spark-datasource-1.0-SNAPSHOT-tests.jar \
    ${maxcompute-project-name} \
    ${aliyun-access-key-id} \
    ${aliyun-access-key-secret} \
    ${maxcompute-table-name}

Yarn Mode: Specify the master parameter as yarn and select the endpoint in the code to end with -inc:

Code: val ODPS_ENDPOINT = "http://service.cn-beijing.maxcompute.aliyun-inc.com/api"

./bin/spark-submit \
    --master yarn \
    --jars ${project.dir}/spark-datasource-v3.1/target/spark-datasource-1.0-SNAPSHOT-jar-with-dependencies.jar,${project.dir}/spark-datasource-v2.3/libs/cupid-table-api-1.1.5-SNAPSHOT.jar,${project.dir}/spark-datasource-v2.3/libs/table-api-tunnel-impl-1.1.5-SNAPSHOT.jar \
    --class DataReaderTest \
    ${project.dir}/spark-datasource-v3.1/target/spark-datasource-1.0-SNAPSHOT-tests.jar \
    ${maxcompute-project-name} \
    ${aliyun-access-key-id} \
    ${aliyun-access-key-secret} \
    ${maxcompute-table-name}

(2) Read Non-Partition Table Test

① Command

-- First, enter the spark execution environment.
cd /usr/lib/spark-current
-- Submit a task.
./bin/spark-submit \
    --master local \
    --jars ${project.dir}/spark-datasource-v3.1/target/spark-datasource-1.0-SNAPSHOT-jar-with-dependencies.jar,${project.dir}/spark-datasource-v2.3/libs/cupid-table-api-1.1.5-SNAPSHOT.jar,${project.dir}/spark-datasource-v2.3/libs/table-api-tunnel-impl-1.1.5-SNAPSHOT.jar \
    --class DataReaderTest \
    ${project.dir}/spark-datasource-v3.1/target/spark-datasource-1.0-SNAPSHOT-tests.jar \
    ${maxcompute-project-name} \
    ${aliyun-access-key-id} \
    ${aliyun-access-key-secret} \
    ${maxcompute-table-name}

② Execution Interface

12

③ Execution Results

13

(2) Read Partition Table Test

① Command

-- First, enter the spark execution environment.
cd /usr/lib/spark-current
-- Submit a task.
./bin/spark-submit \
    --master local \
    --jars ${project.dir}/spark-datasource-v3.1/target/spark-datasource-1.0-SNAPSHOT-jar-with-dependencies.jar,${project.dir}/spark-datasource-v2.3/libs/cupid-table-api-1.1.5-SNAPSHOT.jar,${project.dir}/spark-datasource-v2.3/libs/table-api-tunnel-impl-1.1.5-SNAPSHOT.jar \
    --class DataWriterTest \
    ${project.dir}/spark-datasource-v3.1/target/spark-datasource-1.0-SNAPSHOT-tests.jar \
    ${maxcompute-project-name} \
    ${aliyun-access-key-id} \
    ${aliyun-access-key-secret} \
    ${maxcompute-table-name} \
    ${partition-descripion}

② Execution Interface

14

③ Execution Results

15

(3) Write Non-Partition Table Test

① Command

./bin/spark-submit \
    --master local \
    --jars ${project.dir}/spark-datasource-v3.1/target/spark-datasource-1.0-SNAPSHOT-jar-with-dependencies.jar,${project.dir}/spark-datasource-v2.3/libs/cupid-table-api-1.1.5-SNAPSHOT.jar,${project.dir}/spark-datasource-v2.3/libs/table-api-tunnel-impl-1.1.5-SNAPSHOT.jar \
    --class DataWriterTest \
    ${project.dir}/spark-datasource-v3.1/target/spark-datasource-1.0-SNAPSHOT-tests.jar \
    ${maxcompute-project-name} \
    ${aliyun-access-key-id} \
    ${aliyun-access-key-secret} \
    ${maxcompute-table-name}

② Execution Interface

16

③ Execution Results

17

(4) Write Partition Table Test

① Command

./bin/spark-submit \
    --master local \
    --jars ${project.dir}/spark-datasource-v3.1/target/spark-datasource-1.0-SNAPSHOT-jar-with-dependencies.jar,${project.dir}/spark-datasource-v2.3/libs/cupid-table-api-1.1.5-SNAPSHOT.jar,${project.dir}/spark-datasource-v2.3/libs/table-api-tunnel-impl-1.1.5-SNAPSHOT.jar \
    --class DataWriterTest \
    ${project.dir}/spark-datasource-v3.1/target/spark-datasource-1.0-SNAPSHOT-tests.jar \
    ${maxcompute-project-name} \
    ${aliyun-access-key-id} \
    ${aliyun-access-key-secret} \
    ${maxcompute-table-name} \
    ${partition-descripion}

② Execution Interface

18

③ Execution Results

19

4.5 Performance Testing

The experimental environment is EMR and MC, which is connected to the cloud. If the IDC network is connected to the cloud, it depends on the tunnel resources or the leased line bandwidth.

(1) Large Table Read Test

  • size: 4829258484 byte
  • partitions: 593
  • Read the partition 20170422
  • Time consumption: 0.850871 s

20

(2) Large Table Write Test

① Write tens of thousands of data in a partition

  • Duration: 2.5s

21

  • Result

22

② Write 100,000 pieces of data in a partition

  • Duration: 8.44s

23

  • Result

24

③ Write millions of pieces of data in a partition

  • Duration: 73.28s

25

  • Result

26

0 0 0
Share on

Alibaba Cloud MaxCompute

116 posts | 16 followers

You may also like

Comments