Data Lake Analytics (DLA) is integrated with Marmaray to transmit Apache Parquet files to AnalyticDB for MySQL. Marmaray JAR files that contain user-defined transmission configurations are run in the runtime environment provided by DLA. This allows Parquet files in the data source to be transmitted to the destination AnalyticDB for MySQL cluster. This topic provides a general overview of the feature.
Data types
Category | Apache Parquet | AnalyticDB for MySQL 3.0 |
---|---|---|
Primitive data types | INT32 | INT |
INT64 | BIGINT | |
FLOAT | FLOAT | |
DOUBLE | DOUBLE | |
Non-primitive data types | STRING | VARCHAR |
DATE | DATE | |
DECIMAL | DECIMAL | |
TIMESTAMP | TIMESTAMP |
Features
- Field mappings
-
The fields of a schema in the source Parquet file can be mapped to the fields of a table in the destination AnalyticDB for MySQL cluster. Example:
The schema of the Parquet file consists of the following fields: a, b, and c. They correspond to the a1, b1, and c1 fields in the AnalyticDB for MySQL table. The fields are in the following mapping relationship: a|a1, b|b1, and c|c1.
- Troubleshooting
-
- Troubleshooting at the record level
After an error occurs during a write operation, several retries are performed. The interval between retries increases after each failed retry. The first retry is performed 1 second after the write operation fails. The maximum number of retries is 12, and the maximum retry interval is 34 minutes. After the maximum number of retries is exceeded, the subtask fails, and troubleshooting at the subtask level is initiated. The following array provides all intervals between retries in milliseconds:
[1000,2000,4000,8000,16000,32000,64000,128000,256000,512000,1024000,2048000]
- Troubleshooting at the subtask level
When one of the concurrently executed subtasks of a transmission task fails, the subtask is executed by another compute node. You can configure the maximum number of retries based on your needs.
The default maximum number of retries is 4. You can customize this number by adding the
spark.task.maxFailures
parameter to the task configuration file and specifying the parameter.After the number is exceeded, the subtask fails. The data of this subtask is not included in the data of the transmission task.
- Troubleshooting at the record level
- Concurrent stream processing
-
A task is divided by using the Apache Spark engine, and the files are concurrently read from multiple partitions.
The maximum amount of data that can be read from each partition is 128 MB by default. You can customize this number by adding the
spark.sql.files.maxPartitionBytes
parameter to the task configuration file and specifying the parameter.For more information about parameter settings, see Apache Spark official documentation.
- Visualized task execution
-
The serverless Spark engine of DLA is used to execute tasks. You can view the task progress in real time on the task monitoring page of the Spark web UI. Examples:
- Event Timeline
You can view a timeline that displays the points in time when executors are added and when jobs are completed.
- Job Overview
You can view the tasks that have been completed and those that are in progress.
- Job Details
You can view the directed acyclic graph (DAG) and completed stages of a task
- Event Timeline
Limits
If a task fails, you must rerun the task.
How to use
For information about how to use this feature, see Procedure.