If the same data is stored in different formats, the data occupies different volumes of storage capacity, and the time and cost that are required for Data Lake Analytics (DLA) to scan data vary. The performance of scanning data in the CSV format is lower than that in the ORC or Parquet format. To improve scan performance and minimize scan costs, you can convert TEXT data into ORC or Parquet data before you use DLA to scan data.

Most Alibaba Cloud users store CSV data in Object Storage Service (OSS). If you want to improve scan performance, you must use a third-party tool to convert the file format and then upload the converted data to OSS. The entire process is complex. To reduce your workload, you can use DLA to convert file formats.

Assume that 1.2 GB of data is stored in OSS in the CSV, TSV, or LOG format. The following table provides the storage space occupied by the data after the data formats are converted by DLA.

Storage format Data source and characteristics Data volume change Data volume
JSON JSON data generated by a large number of applications. The data contains a large amount of redundant data. Increase by 151.7% 3.02 GB
AVRO Data in the Hadoop ecosystem, which is generated by legacy systems. Increase by 8.3% 1.3 GB
RCFile Data in the Hadoop ecosystem, which is generated by legacy systems. Decrease by 2.5% 1.17 GB
Parquet Data in the Hadoop ecosystem.
  • Adopts high-performance column-oriented storage to improve data query performance.
  • Supports nested data models.
  • Provides functions that return tuple-level statistical data.
Decrease by 53.3% 560 MB
ORC Data in the Hadoop ecosystem.
  • Provides functions that return tuple-level statistical data.
  • Supports a high compression ratio.
Decrease by 80.4% 235 MB