All Products
Search
Document Center

:Feature description

Last Updated:Apr 09, 2025

Data Lake Analytics (DLA) is integrated with Marmaray to transmit Apache Parquet files to AnalyticDB for MySQL. Marmaray JAR files that contain user-defined transmission configurations are run in the runtime environment provided by DLA. This allows Parquet files in the data source to be transmitted to the destination AnalyticDB for MySQL cluster. This topic provides a general overview of the feature.

Data types

Table 1. Data type mappings

Category

Apache Parquet

AnalyticDB for MySQL 3.0

Primitive data types

INT32

INT

INT64

BIGINT

FLOAT

FLOAT

DOUBLE

DOUBLE

Non-primitive data types

STRING

VARCHAR

DATE

DATE

DECIMAL

DECIMAL

TIMESTAMP

TIMESTAMP

Features

  • Field mappings

    • The fields of a schema in the source Parquet file can be mapped to the fields of a table in the destination AnalyticDB for MySQL cluster. Example:

      The schema of the Parquet file consists of the following fields: a, b, and c. They correspond to the a1, b1, and c1 fields in the AnalyticDB for MySQL table. The fields are in the following mapping relationship: a|a1, b|b1, and c|c1.

  • Troubleshooting

      • Troubleshooting at the record level

        After an error occurs during a write operation, several retries are performed. The interval between retries increases after each failed retry. The first retry is performed 1 second after the write operation fails. The maximum number of retries is 12, and the maximum retry interval is 34 minutes. After the maximum number of retries is exceeded, the subtask fails, and troubleshooting at the subtask level is initiated. The following array provides all intervals between retries in milliseconds:

        [1000,2000,4000,8000,16000,32000,64000,128000,256000,512000,1024000,2048000]
      • Troubleshooting at the subtask level

        When one of the concurrently executed subtasks of a transmission task fails, the subtask is executed by another compute node. You can configure the maximum number of retries based on your needs.

        The default maximum number of retries is 4. You can customize this number by adding the spark.task.maxFailures parameter to the task configuration file and specifying the parameter.

        After the number is exceeded, the subtask fails. The data of this subtask is not included in the data of the transmission task.

  • Concurrent stream processing

    • A task is divided by using the Apache Spark engine, and the files are concurrently read from multiple partitions.

      The maximum amount of data that can be read from each partition is 128 MB by default. You can customize this number by adding the spark.sql.files.maxPartitionBytes parameter to the task configuration file and specifying the parameter.

      For more information about parameter settings, see Apache Spark official documentation.

  • Visualized task execution

    • The serverless Spark engine of DLA is used to execute tasks. You can view the task progress in real time on the task monitoring page of the Spark web UI. Examples:

      • Event Timeline

        You can view a timeline that displays the points in time when executors are added and when jobs are completed.

      • Job Overview

        You can view the tasks that have been completed and those that are in progress.

      • Job Details

        You can view the directed acyclic graph (DAG) and completed stages of a task

Limits

If a task fails, you must rerun the task.