All Products
Search
Document Center

Elastic Container Instance:Access data in an OSS bucket from Elastic Container Instance

Last Updated:Jan 02, 2024

When you use data processing frameworks such as Hadoop and Spark to process batch jobs, you can use Object Storage Service (OSS) buckets to store data. This topic describes how to upload a file to an OSS bucket and access the data in the file in an application. In this topic, a Spark application is used.

Prepare and upload a file to an OSS bucket

  1. Log on to the Object Storage Service (OSS) console.

  2. Create a bucket. For more information, see Create buckets.

  3. Upload a file to OSS. For more information, see Simple upload.

    After you upload the file, record the URL and endpoint of the file. Example: oss://test***-hust/test.txt and oss-cn-hangzhou-internal.aliyuncs.com.

Access the data in the file in a Spark application

  1. Develop a Spark application.

    SparkConf conf = new SparkConf().setAppName(WordCount.class.getSimpleName());
    JavaRDD<String> lines = sc.textFile("oss://test***-hust/test.txt", 250);
    ...
    wordsCountResult.saveAsTextFile("oss://test***-hust/test-result");
    sc.close();   
  2. Configure the information about the file in the Spark application.

    Note

    Replace the endpoint, AccessKey ID, and AccessKey secret with your actual values.

    • Method 1: Use a static configuration file

      Modify the core-site.xml file and store the new core-site.xml file to the directory (named resources) of the application project.

      <?xml version="1.0" encoding="UTF-8"?>
      <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
      <!--
        Licensed under the Apache License, Version 2.0 (the "License");
        you may not use this file except in compliance with the License.
        You may obtain a copy of the License at
          http://www.apache.org/licenses/LICENSE-2.0
        Unless required by applicable law or agreed to in writing, software
        distributed under the License is distributed on an "AS IS" BASIS,
        WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        See the License for the specific language governing permissions and
        limitations under the License. See accompanying LICENSE file.
      -->
      <!-- Put site-specific property overrides in this file. -->
      <configuration>
          <! -- OSS configurations -->
          <property>
              <name>fs.oss.impl</name>
              <value>org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem</value>
          </property>
          <property>
              <name>fs.oss.endpoint</name>
              <value>oss-cn-hangzhou-internal.aliyuncs.com</value>
          </property>
          <property>
              <name>fs.oss.accessKeyId</name>
              <value>{your AccessKey ID}</value>
          </property>
          <property>
              <name>fs.oss.accessKeySecret</name>
              <value>{your AccessKey Secret}</value>
          </property>
          <property>
              <name>fs.oss.buffer.dir</name>
              <value>/tmp/oss</value>
          </property>
          <property>
              <name>fs.oss.connection.secure.enabled</name>
              <value>false</value>
          </property>
          <property>
              <name>fs.oss.connection.maximum</name>
              <value>2048</value>
          </property>
      </configuration>
    • Method 2: Perform dynamic settings when you submit the Spark application

      Example:

      hadoopConf:
          # OSS
          "fs.oss.impl": "org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem"
          "fs.oss.endpoint": "oss-cn-hangzhou-internal.aliyuncs.com"
          "fs.oss.accessKeyId": "your AccessKey ID"
          "fs.oss.accessKeySecret": "your AccessKey Secret"
  3. Package the JAR file.

    The packaged JAR file must contain all dependencies. Sample content of the pom.xml file of the Spark application:

     1<?xml version="1.0" encoding="UTF-8"?>
     2<project xmlns="http://maven.apache.org/POM/4.0.0"
     3         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     4         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
     5    <modelVersion>4.0.0</modelVersion>
     6
     7    <groupId>com.aliyun.liumi.spark</groupId>
     8    <artifactId>SparkExampleJava</artifactId>
     9    <version>1.0-SNAPSHOT</version>
    10
    11    <dependencies>
    12        <dependency>
    13            <groupId>org.apache.spark</groupId>
    14            <artifactId>spark-core_2.12</artifactId>
    15            <version>2.4.3</version>
    16        </dependency>
    17
    18        <dependency>
    19            <groupId>com.aliyun.dfs</groupId>
    20            <artifactId>aliyun-sdk-dfs</artifactId>
    21            <version>1.0.3</version>
    22        </dependency>
    23
    24    </dependencies>
    25
    26    <build>
    27    <plugins>
    28        <plugin>
    29            <groupId>org.apache.maven.plugins</groupId>
    30            <artifactId>maven-assembly-plugin</artifactId>
    31            <version>2.6</version>
    32            <configuration>
    33                <appendAssemblyId>false</appendAssemblyId>
    34                <descriptorRefs>
    35                    <descriptorRef>jar-with-dependencies</descriptorRef>
    36                </descriptorRefs>
    37                <archive>
    38                    <manifest>
    39                        <mainClass>com.aliyun.liumi.spark.example.WordCount</mainClass>
    40                    </manifest>
    41                </archive>
    42            </configuration>
    43            <executions>
    44                <execution>
    45                    <id>make-assembly</id>
    46                    <phase>package</phase>
    47                    <goals>
    48                        <goal>assembly</goal>
    49                    </goals>
    50                </execution>
    51            </executions>
    52        </plugin>
    53    </plugins>
    54    </build>
    55</project>
  4. Write a Dockerfile.

    # spark base image
    FROM registry.cn-beijing.aliyuncs.com/eci_open/spark:2.4.4
    RUN rm $SPARK_HOME/jars/kubernetes-client-*.jar
    ADD https://repo1.maven.org/maven2/io/fabric8/kubernetes-client/4.4.2/kubernetes-client-4.4.2.jar $SPARK_HOME/jars
    RUN mkdir -p /opt/spark/jars
    COPY SparkExampleJava-1.0-SNAPSHOT.jar /opt/spark/jars
    # JAR dependency package of OSS
    COPY aliyun-sdk-oss-3.4.1.jar /opt/spark/jars
    COPY hadoop-aliyun-2.7.3.2.6.1.0-129.jar /opt/spark/jars
    COPY jdom-1.1.jar /opt/spark/jars
    Note

    For information about how to download the JAR dependency package of OSS, see Use HDP 2.6-based Hadoop to read and write OSS data.

  5. Build a Spark application image.

    docker build -t registry.cn-beijing.aliyuncs.com/liumi/spark:2.4.4-example -f Dockerfile .
  6. Push the image to an image repository that is provided by Alibaba Cloud Container Registry.

    docker push registry.cn-beijing.aliyuncs.com/liumi/spark:2.4.4-example

After you complete the preceding operations, the Spark application image is prepared. You can use the image to deploy the Spark application in Kubernetes clusters.