Practices of Simulating IDC Spark Read and Write MaxCompute

By Yueyi Yang

1. Background

1.1 Background Information

The existing lake house architecture uses MaxCompute as the center to read and write Hadoop cluster data. In some offline IDC scenarios, customers are unwilling to expose the internal information of the cluster to the public network and need to initiate access to cloud data from Hadoop clusters. This article uses EMR (Hadoop on cloud) to simulate a local Hadoop cluster accessing MaxCompute data.

1.2 Basic Architecture

2. Construction of Development Environment

2.1 Prepare the EMR Environment

(1) Purchase

① Log on to the Alibaba Cloud console and click the console option in the upper right corner.

(2) Go to the navigation page and click cloud products – E-MapReduce. You can also search for it.

③ Go to the E-MapReduce homepage, click EMR on ECS, and create a cluster.

Please refer to the official document for specific purchase details: https://www.alibabacloud.com/help/en/e-mapreduce/latest/getting-started#section-55q-jmm-3ts

④ Click the cluster ID to view the basic information, cluster services, node management, and other modules.

(2) Log In

Please refer to the official document for more information about how to log on to the cluster: https://www.alibabacloud.com/help/en/e-mapreduce/latest/log-on-to-a-cluster

This article will use logging on to an ECS as an example.

① Click on the Alibaba Cloud Console - ECS

② Click the Instance Name - Remote Connection - Workbench Remote Connection

2.2 Local IDEA Preparation

(1) Install Maven

Please see the article below for more information: https://blog.csdn.net/l32273/article/details/123684435 (Article in Chinese).

(2) Create a Scala Project

① Download the Scala Plugin:

② Install Scala JDK

We recommend downloading the *.zip file.
Configure Scala environment variable.
Open cmd through Win + R to test whether the Scala version appears.
Please see the article below for more information: https://blog.csdn.net/m0_59617823/article/details/124310663 (Article in Chinese)

③ Create a Scala Project:

2.3 Data Preparation of MaxCompute

(1) Project

Please refer to the official document for more information about how to create a MaxCompute project: https://www.alibabacloud.com/help/en/maxcompute/latest/create-a-maxcompute-project

(2) AccessKey

The AccessKey (AK) pair used to access Alibaba Cloud APIs includes the AccessKey ID and AccessKey secret. After you create an Alibaba Cloud account on the official site (alibabacloud.com), an AccessKey pair is generated on the AccessKey Management page. AccessKey pairs are used to identify users and verify the signature of requests for accessing MaxCompute or other Alibaba Cloud services or connecting to third-party tools. Keep your AccessKey Secret confidential to prevent credential leaks. If there is a leak, disable or update your AccessKey immediately.

Please refer to the official document for more information about AK: https://ram.console.aliyun.com/manage/ak

(3) Endpoint

MaxCompute Service: The connection address is Endpoint, which varies based on the region and network connection mode.

Please see the official document for more information about the region endpoint: https://www.alibabacloud.com/help/en/maxcompute/latest/prepare-endpoints.

(4) Table

Please see the official document for more information about how to create a MaxCompute table: https://www.alibabacloud.com/help/en/maxcompute/latest/ddl-sql-table-operations

This article needs to prepare a partition table and non-partition table for testing.

3. Code Testing

3.1 Prerequisites

(1) Prepare the project, AK information, and table data on MaxCompute

(2) Prepare the E-MapReduce cluster

(3) The terminal connects to the E-MapReduce node (the ECS instance)

(4) Configure Scala and Maven environment variables in IDEA and download the Scala plug-in

3.2 Sample Code Description

https://github.com/aliyun/aliyun-maxcompute-data-collectors/blob/master/spark-datasource-v3.1/src/test/scala/PartitionDataReaderTest.scala

3.3 Package and Upload

(1) After After Writing the Code Locally, Maven Packages It

(2) Compile the jar Package Locally

① Enter the project directory:

cd ${project.dir}/spark-datasource-v3.1

② Run the mvn command to build a spark-datasource:

mvn clean package jar:test-jar

③ Check whether there are dependencies.jar and tests.jar in the target directory:

(3) Upload the jar Package to the Server

① Upload the scp command:

scp [local jar package path] root@[ecs instance public IP]:[server storing jar package path]

② View server

③ Upload jar packages between nodes:

scp -r [path of this server to store jar packages] root@ecs instance private IP:[address of the receiving server to store jar packages]

3.4 Test

(1) Operation Mode

① Local Mode: Specify the master parameter as local:

./bin/spark-submit \
    --master local \
    --jars ${project.dir}/spark-datasource-v3.1/target/spark-datasource-1.0-SNAPSHOT-jar-with-dependencies.jar,${project.dir}/spark-datasource-v2.3/libs/cupid-table-api-1.1.5-SNAPSHOT.jar,${project.dir}/spark-datasource-v2.3/libs/table-api-tunnel-impl-1.1.5-SNAPSHOT.jar \
    --class DataReaderTest \
    ${project.dir}/spark-datasource-v3.1/target/spark-datasource-1.0-SNAPSHOT-tests.jar \
    ${maxcompute-project-name} \
    ${aliyun-access-key-id} \
    ${aliyun-access-key-secret} \
    ${maxcompute-table-name}

② Yarn Mode: Specify the master parameter as yarn and select the endpoint in the code to end with -inc:

Code: val ODPS_ENDPOINT = "http://service.cn-beijing.maxcompute.aliyun-inc.com/api"

./bin/spark-submit \
    --master yarn \
    --jars ${project.dir}/spark-datasource-v3.1/target/spark-datasource-1.0-SNAPSHOT-jar-with-dependencies.jar,${project.dir}/spark-datasource-v2.3/libs/cupid-table-api-1.1.5-SNAPSHOT.jar,${project.dir}/spark-datasource-v2.3/libs/table-api-tunnel-impl-1.1.5-SNAPSHOT.jar \
    --class DataReaderTest \
    ${project.dir}/spark-datasource-v3.1/target/spark-datasource-1.0-SNAPSHOT-tests.jar \
    ${maxcompute-project-name} \
    ${aliyun-access-key-id} \
    ${aliyun-access-key-secret} \
    ${maxcompute-table-name}

(2) Read Non-Partition Table Test

① Command

-- First, enter the spark execution environment.
cd /usr/lib/spark-current
-- Submit a task.
./bin/spark-submit \
    --master local \
    --jars ${project.dir}/spark-datasource-v3.1/target/spark-datasource-1.0-SNAPSHOT-jar-with-dependencies.jar,${project.dir}/spark-datasource-v2.3/libs/cupid-table-api-1.1.5-SNAPSHOT.jar,${project.dir}/spark-datasource-v2.3/libs/table-api-tunnel-impl-1.1.5-SNAPSHOT.jar \
    --class DataReaderTest \
    ${project.dir}/spark-datasource-v3.1/target/spark-datasource-1.0-SNAPSHOT-tests.jar \
    ${maxcompute-project-name} \
    ${aliyun-access-key-id} \
    ${aliyun-access-key-secret} \
    ${maxcompute-table-name}

② Execution Interface

③ Execution Results

(2) Read Partition Table Test

① Command

-- First, enter the spark execution environment.
cd /usr/lib/spark-current
-- Submit a task.
./bin/spark-submit \
    --master local \
    --jars ${project.dir}/spark-datasource-v3.1/target/spark-datasource-1.0-SNAPSHOT-jar-with-dependencies.jar,${project.dir}/spark-datasource-v2.3/libs/cupid-table-api-1.1.5-SNAPSHOT.jar,${project.dir}/spark-datasource-v2.3/libs/table-api-tunnel-impl-1.1.5-SNAPSHOT.jar \
    --class DataWriterTest \
    ${project.dir}/spark-datasource-v3.1/target/spark-datasource-1.0-SNAPSHOT-tests.jar \
    ${maxcompute-project-name} \
    ${aliyun-access-key-id} \
    ${aliyun-access-key-secret} \
    ${maxcompute-table-name} \
    ${partition-descripion}

② Execution Interface

③ Execution Results

(3) Write Non-Partition Table Test

① Command

./bin/spark-submit \
    --master local \
    --jars ${project.dir}/spark-datasource-v3.1/target/spark-datasource-1.0-SNAPSHOT-jar-with-dependencies.jar,${project.dir}/spark-datasource-v2.3/libs/cupid-table-api-1.1.5-SNAPSHOT.jar,${project.dir}/spark-datasource-v2.3/libs/table-api-tunnel-impl-1.1.5-SNAPSHOT.jar \
    --class DataWriterTest \
    ${project.dir}/spark-datasource-v3.1/target/spark-datasource-1.0-SNAPSHOT-tests.jar \
    ${maxcompute-project-name} \
    ${aliyun-access-key-id} \
    ${aliyun-access-key-secret} \
    ${maxcompute-table-name}

② Execution Interface

③ Execution Results

(4) Write Partition Table Test

① Command

./bin/spark-submit \
    --master local \
    --jars ${project.dir}/spark-datasource-v3.1/target/spark-datasource-1.0-SNAPSHOT-jar-with-dependencies.jar,${project.dir}/spark-datasource-v2.3/libs/cupid-table-api-1.1.5-SNAPSHOT.jar,${project.dir}/spark-datasource-v2.3/libs/table-api-tunnel-impl-1.1.5-SNAPSHOT.jar \
    --class DataWriterTest \
    ${project.dir}/spark-datasource-v3.1/target/spark-datasource-1.0-SNAPSHOT-tests.jar \
    ${maxcompute-project-name} \
    ${aliyun-access-key-id} \
    ${aliyun-access-key-secret} \
    ${maxcompute-table-name} \
    ${partition-descripion}

② Execution Interface

③ Execution Results

4.5 Performance Testing

The experimental environment is EMR and MC, which is connected to the cloud. If the IDC network is connected to the cloud, it depends on the tunnel resources or the leased line bandwidth.

(1) Large Table Read Test

size: 4829258484 byte
partitions: 593
Read the partition 20170422
Time consumption: 0.850871 s

(2) Large Table Write Test

① Write tens of thousands of data in a partition

Duration: 2.5s

Result

② Write 100,000 pieces of data in a partition

Duration: 8.44s

Result

③ Write millions of pieces of data in a partition

Duration: 73.28s

Result

Community

Practices of Simulating IDC Spark Read and Write MaxCompute

1. Background

1.1 Background Information

1.2 Basic Architecture

2. Construction of Development Environment

2.1 Prepare the EMR Environment

(1) Purchase

(2) Log In

2.2 Local IDEA Preparation

(1) Install Maven

(2) Create a Scala Project

2.3 Data Preparation of MaxCompute

(1) Project

(2) AccessKey

(3) Endpoint

(4) Table

3. Code Testing

3.1 Prerequisites

3.2 Sample Code Description

3.3 Package and Upload

(1) After After Writing the Code Locally, Maven Packages It

(2) Compile the jar Package Locally

(3) Upload the jar Package to the Server

3.4 Test

(1) Operation Mode

(2) Read Non-Partition Table Test

(2) Read Partition Table Test

(3) Write Non-Partition Table Test

(4) Write Partition Table Test

4.5 Performance Testing

(1) Large Table Read Test

(2) Large Table Write Test

Read previous post:

Read next post:

Alibaba Cloud MaxCompute

You may also like

Comments

Alibaba Cloud MaxCompute

Related Products

Big Data Consulting for Data Technology Solution

MaxCompute

Big Data Consulting Services for Retail Solution

E-MapReduce Service