Data Lake Analytics (DLA) is integrated with Marmaray to transmit Apache Parquet files to AnalyticDB for MySQL. Marmaray JAR files that contain user-defined transmission configurations are run in the runtime environment provided by DLA. This allows Parquet files in the data source to be transmitted to the destination AnalyticDB for MySQL cluster. This topic provides a general overview of the feature.

Data types

Table 1. Data type mappings
Category Apache Parquet AnalyticDB for MySQL 3.0
Primitive data types INT32 INT
INT64 BIGINT
FLOAT FLOAT
DOUBLE DOUBLE
Non-primitive data types STRING VARCHAR
DATE DATE
DECIMAL DECIMAL
TIMESTAMP TIMESTAMP

Features

Field mappings

The fields of a schema in the source Parquet file can be mapped to the fields of a table in the destination AnalyticDB for MySQL cluster. Example:

The schema of the Parquet file consists of the following fields: a, b, and c. They correspond to the a1, b1, and c1 fields in the AnalyticDB for MySQL table. The fields are in the following mapping relationship: a|a1, b|b1, and c|c1.

Troubleshooting
  • Troubleshooting at the record level

    After an error occurs during a write operation, several retries are performed. The interval between retries increases after each failed retry. The first retry is performed 1 second after the write operation fails. The maximum number of retries is 12, and the maximum retry interval is 34 minutes. After the maximum number of retries is exceeded, the subtask fails, and troubleshooting at the subtask level is initiated. The following array provides all intervals between retries in milliseconds:

    [1000,2000,4000,8000,16000,32000,64000,128000,256000,512000,1024000,2048000]
  • Troubleshooting at the subtask level

    When one of the concurrently executed subtasks of a transmission task fails, the subtask is executed by another compute node. You can configure the maximum number of retries based on your needs.

    The default maximum number of retries is 4. You can customize this number by adding the spark.task.maxFailures parameter to the task configuration file and specifying the parameter.

    After the number is exceeded, the subtask fails. The data of this subtask is not included in the data of the transmission task.

Concurrent stream processing

A task is divided by using the Apache Spark engine, and the files are concurrently read from multiple partitions.

The maximum amount of data that can be read from each partition is 128 MB by default. You can customize this number by adding the spark.sql.files.maxPartitionBytes parameter to the task configuration file and specifying the parameter.

For more information about parameter settings, see Apache Spark official documentation.

Visualized task execution

The serverless Spark engine of DLA is used to execute tasks. You can view the task progress in real time on the task monitoring page of the Spark web UI. Examples:

  • Event Timeline

    You can view a timeline that displays the points in time when executors are added and when jobs are completed.

  • Job Overview

    You can view the tasks that have been completed and those that are in progress.

  • Job Details

    You can view the directed acyclic graph (DAG) and completed stages of a task

Limits

If a task fails, you must rerun the task.

How to use

For information about how to use this feature, see Procedure.