If the same data is stored in different formats, the data occupies different volumes of storage capacity, and the time and cost that are required for Data Lake Analytics (DLA) to scan data vary. The performance of scanning data in the CSV format is lower than that in the ORC or Parquet format. To improve scan performance and minimize scan costs, you can convert TEXT data into ORC or Parquet data before you use DLA to scan data.
Most Alibaba Cloud users store CSV data in Object Storage Service (OSS). If you want to improve scan performance, you must use a third-party tool to convert the file format and then upload the converted data to OSS. The entire process is complex. To reduce your workload, you can use DLA to convert file formats.
Assume that 1.2 GB of data is stored in OSS in the CSV, TSV, or LOG format. The following table provides the storage space occupied by the data after the data formats are converted by DLA.
Storage format | Data source and characteristics | Data volume change | Data volume |
JSON | JSON data generated by a large number of applications. The data contains a large amount of redundant data. | Increase by 151.7% | 3.02 GB |
AVRO | Data in the Hadoop ecosystem, which is generated by legacy systems. | Increase by 8.3% | 1.3 GB |
RCFile | Data in the Hadoop ecosystem, which is generated by legacy systems. | Decrease by 2.5% | 1.17 GB |
Parquet | Data in the Hadoop ecosystem.
|
Decrease by 53.3% | 560 MB |
ORC | Data in the Hadoop ecosystem.
|
Decrease by 80.4% | 235 MB |