All Products
Search
Document Center

FAQ related to data lake management

Last Updated: Sep 07, 2021

This topic provides answers to some frequently asked questions about data lake management of Data Lake Analytics (DLA).

Lakehouse-related issues

What is a lakehouse?

A lakehouse is a data warehouse that is built on top of a data lake. It is a new big data paradigm that is used to resolve various issues that you may encounter when you use data lakes. For example, data lakes do not support upsert operations, multi-version concurrency control (MVCC), and extract, transform, load (ETL) operations for incremental data. When you use data lakes, a large number of small files are generated, and files in data lakes are not suitable for data analytics. In addition, metadata in DLA is inconsistent with that in data sources, some schemas are not supported, and no appropriate data indexes exist in data lakes.

Lakehouses provide upper-layer extensibility for data lake storage systems, such as Object Storage Service (OSS). Lakehouses allow you to write files of different formats to data lakes. The formats are categorized into object management formats and object formats. Object management formats include Apache Hudi and Delta Lake, and object formats include Parquet and ORC. When you write data to data lakes, operations, such as upsert operations, small file merging, MVCC, and snapshot reading are supported. In addition, lakehouses maintain large amounts of table metadata in databases and provide a unified data access view for the upper-layer SQL and analytical engines. Lakehouses use features of data warehouses to provide capabilities that data lakes are unable to provide.

Does data ingestion into a lakehouse cause heavy loads on ApsaraDB RDS data sources? How do I perform throttling for data warehousing?

If you use real-time data ingestion, the lakehouse obtains data by using Data Transmission Service (DTS) and uses binary log files to write data to data lakes. This does not cause heavy loads on ApsaraDB RDS data sources.
If you use data ingestion for data warehousing, ApsaraDB RDS data sources may be overloaded. In this case, you can configure related parameters to control the loads.
For more information, see Advanced options.

The lakehouse workload fails to run and no Spark logs can be queried. Why?

In the workload details dialog box, if the error message "specified driver resource spec:[xlarge]+executor resource spec:[xlarge] * executor instances:[2] exceed the max CPU:[16.0] or max Memory:[64.0], errorCode=InvalidParameter" or a similar error message is displayed for the job status, the number of Spark compute units (CUs) that you configured when you create a workload exceeds the maximum number of CUs that can be deployed in a virtual cluster.

Solutions:

  • Adjust the maximum number of Spark CUs that can be deployed in a virtual cluster.

  • Adjust the number of Spark CUs that you configured for the workload to ensure that this number is less than the maximum number of CUs that can be deployed in a virtual cluster.

Issues related to metadata discovery

After a metadata discovery task is configured and manually triggered, the incremental data that is synchronized after the task is run cannot be queried. Why?

Cause: After you manually trigger the metadata discovery task, only the existing data can be queried. The incremental data that is synchronized after you run the task cannot be queried.

Solution: Change the trigger mode of the metadata discovery task from manual to automatic and specify the trigger time. After the change, the incremental data that is synchronized after the task is run can be queried.

What are the differences between the data warehouse mode and free mode of OSS? In which scenarios can I apply the two modes?

OSS configuration mode

Scenario

OSS path format

Discovery accuracy

Performance

Data warehouse mode

Users directly upload data to OSS and plan to create a standard data warehouse that can be used to compute and analyze data.

Database name/Table name/File name or Database name/Table name/Partition name/.../Partition name/File name

High

High

Free mode

OSS data already exists, but the path of the OSS data is not configured in a specified format. Users plan to create databases, tables, and partitions that can be analyzed.

No requirements

Medium

Medium

Why is the CSV file exported from Excel not discovered?

Metadata discovery tasks support only CSV files. Therefore, you must make sure that the file exported from Excel is a CSV file.

Note

A metadata discovery task samples files to identify the schema of a CSV file. Then, the task reads the first 1,000 rows of data in the file and checks whether the fields and delimiters of these rows are the same. If they are the same, the metadata discovery task considers that the file is a CSV file.

The JSON files that are stored in a directory have the same schema but no tables are created. Why?

Metadata discovery tasks can discover the directories that contain only files. If a directory contains both files and subdirectories, this directory cannot be discovered.

When a metadata discovery task is triggered to discover all data that is shipped from Log Service to OSS, the discovery result shows that some Logstores of Log Service fail to generate tables. Why?

Check whether the Logstores that fail to generate tables are configured to ship data to OSS.

The following error message appears when a Log Service metadata discovery task is running: "Make sure the bucket you specified be in the same region with DLA." Why?

Cause: Log Service ships data to an OSS bucket that is not deployed in the same region as DLA. As a result, DLA cannot access the OSS bucket. When a Log Service metadata discovery task is triggered, the following error message may appear: "OSS object [oss://test-bucket/] is invalid. Make sure the bucket you specified be in the same region with DLA. Detailed message: The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint."

Solution: Use an OSS metadata discovery task instead of the Log Service metadata discovery task in the region of OSS to which Log Service data is shipped. This way, you can query the data that is shipped from Log Service when a table is automatically created in DLA.

The OSS metadata discovery task does not discover tables from the CSV files that are stored in OSS. Why?

Causes:

  • The size of a field value exceeds 4,096 bytes, which interrupts file sampling.

  • An OSS metadata discovery task infers different schemas based on the sampled files in the same directory. As a result, the task cannot determine the schema that can be used.

  • The file format is not CSV.

After a CSV file is stored in OSS, the data type of a field discovered by the OSS metadata discovery task is not the same as the data type of the field in the CSV file. Why?

Causes:

  • This field has a null value. The task infers that the data type of the field is STRING based on the file sampling result.

  • A field in the data of the CSV file is not of the STRING type.

  • The format of the delimiter in the CSV file is incorrect. As a result, the OSS metadata discovery task incorrectly considers that the data types of all the fields in the CSV file are STRING.

In a CSV file, the first row is a description row, the second row is a header row, and the third row is a data row. Can an OSS metadata discovery task discover this CSV file?

No, an OSS metadata discovery task cannot discover this CSV file.

OSS metadata discovery tasks can discover a CSV file only if the file meets one of the following conditions:

  1. The file contains only data rows and does not contain header rows.

  2. The first row is a header row.

CSV files that do not meet the preceding conditions cannot be discovered.

Issues related to data warehousing based on multi-database merging and one-click data warehousing

The error message "Access denied for user" appears when I connect DLA to a database during one-click data warehousing. What do I do?

Cause: The username or password that is used to log on to the database is incorrect.

Solution: Check whether the username or password that is used to log on to the database is changed. If the username or password is changed, log on to the DLA console. In the left-side navigation pane, click Metadata management. On the Metadata management page, find your one-click data warehousing task and click Library table details in the Actions column. On the page that appears, click Configuration and then Update to modify the configurations of the task.

The error message "because Could not create connection to database server. Attempted reconnect 3 times. Giving up" appears when I connect DLA to a database during one-click data warehousing. What do I do?

Cause: The source ApsaraDB RDS database is overloaded.

Solutions:

  • Adjust the time at which data is synchronized for the one-click data warehousing task. This operation is used to limit the number of parallel read and write operations that are performed on an instance.
  • Adjust the advanced configurations of the one-click data warehousing task to reduce the parallelism. For example, if you set the connections-per-job parameter to 10 and the total-allowed-connections parameter to 30, a table can be read by 10 requests in parallel, and the data of up to three tables can be read in parallel. You can adjust the specifications of your ApsaraDB RDS instance. For more information, see Advanced options.

The following error message appears when I connect DLA to a database during one-click data warehousing: "because Encountered too many errors talking to a worker node. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes." What do I do?

Cause: The overall load of DLA Public Edition is excessively high at that time.

Solution: Click Execute again to retry the one-click data warehousing task. If this issue persists, contact DLA technical support personnel.

When I create a data warehouse by merging databases or run a one-click data warehousing task, the error message "No such instance with ID: rm-xxxxxx" appears. What do I do?

This error message indicates that the ApsaraDB RDS instance rm-xxxxxx involved in the current data warehousing task is released.

The error message "because Communications link failure" appears when I run a one-click data warehousing task. What do I do?

Detailed error message: "The last packet successfully received from the server was 900,120 milliseconds ago. The last packet sent successfully to the server was 900,120 milliseconds ago.), 23 out of 110 jobs are finished."

Causes:

  • The database has views.

  • The source ApsaraDB RDS database is overloaded. As a result, the ApsaraDB RDS database fails to restart or does not respond.

Solution: Delete the views from the database or reduce the parallelism.

Which regions support Data Lake Formation?

All regions in mainland China support Data Lake Formation. The following table describes the regions outside mainland China that support Data Lake Formation.

Region

Support Data Lake Formation

China (Hong Kong)

Yes

Japan (Tokyo)

Yes

Singapore

Yes

US (Silicon Valley)

Yes

US (Virginia)

Yes

UK (London)

Yes

Germany (Frankfurt)

Yes

Australia (Sydney)

Yes

Malaysia (Kuala Lumpur)

Yes

Indonesia (Jakarta)

Yes

India (Mumbai)

Yes

The one-click data warehousing task is successful, but some tables are not synchronized. Why?

Some tables may be skipped when the task is running. You can check the number of tables that are skipped based on the status of the task. You can also check these tables in extraWarnTips of the task details. For example, if you run a one-click data warehousing task, the tables whose names and field names are in Chinese are skipped.

If the names and field names of the table that are skipped are not in Chinese, you can check whether the Java Database Connectivity (JDBC) account has permissions to access these tables.

The one-click data warehousing task fails to run and the error message "mppErrorCode=30101, because Direct buffer memory..." appears. What do I do?

Cause: The public cluster of the serverless Presto engine is overloaded.

Solution: Manually retry the one-click data warehousing task. If this issue persists, contact DLA technical support personnel.

The one-click data warehousing task fails to run and the error message "because Failed to rename <OSS Path>" appears. What do I do?

Cause: The storage class of the OSS bucket that you configured when you create a data warehouse is Archive. As a result, DLA fails to rename the OSS bucket.

Solution: Create another data warehousing task and configure an OSS bucket of a non-Archive storage class.

The one-click data warehousing task fails to run for ApsaraDB for MongoDB and the error message "Cannot allocate slice larger than XXXXX bytes" appears. What do I do?

Detailed error message:

"because Error reading from mongodb: Cannot allocate slice larger than 2147483639bytes."

Cause: The size of some fields in the data that the serverless Presto engine of DLA reads from ApsaraDB for MongoDB is excessively large.

Solution: Modify the advanced configurations of the one-click data warehousing task and use sensitive-columns to filter out the fields with large sizes. For more information, see Advanced options.

Why does the one-click data warehousing task fail after it runs for a dozen hours?

Causes:
  • Thousands of tables are stored in the database, which may cause a synchronization timeout.

  • The tables that encounter the synchronization timeout issue do not contain index fields or UNIQUE KEY fields of integer data types. As a result, the split calculation task that runs on the serverless Presto engine times out.

Solution: Set the connections-per-job parameter to 1. For more information, see Advanced options.