After log data is shipped from Log Service to Object Storage Service (OSS), the log data can be stored in different formats. This topic describes the Parquet format.

Parameters

The following table describes the parameters that you must configure when you set the storage format to Parquet. For more information, see Configure a data shipping rule. Parquet Fields

Parameter Description
Key Name The name of the log field that you want to ship to OSS. You can view log fields on the Raw Logs tab of a Logstore. We recommend that you add log fields one by one. When the log fields are shipped to OSS, the log fields are stored in a Parquet file in the order that you add them. The names of the log fields are converted into the names of the columns in the Parquet file. The log fields that you can ship to OSS include the fields in the log content and the reserved fields such as __time__, _topic__, and __source__. For more information about reserved fields, see Reserved fields. The names of the columns in the Parquet file are null in the following two scenarios:
  • The log fields do not exist in logs.
  • The names of the log fields fail to be converted from the string type to a non-string type. The non-string types include double and Int64.
Note The names of the log fields that you add in the Parquet Fields field must be unique.
Type The data types that the Parquet storage format supports. The data types include string, Boolean, Int32, Int64, float, and double.

The names of log fields are converted to data types that the Parquet format supports. If the names of the log fields fail to be converted from the string type to non-string types, the names of the columns in the Parquet file are null.

Directories of OSS buckets

After log data is shipped to Log Service, the log data is stored in OSS buckets. The following table lists the directories of the OSS buckets.

Compression type File extension Directory example Description
Uncompressed .parquet oss://oss-shipper-shenzhen/ecs_test/2016/01/26/20/54_1453812893059571256_937.parquet You download the OSS files to your computer and use the parquet-tools utility to open the files. For more information about the parquet-tools utility, visit parquet-tools.
Snappy .snappy.parquet oss://oss-shipper-shenzhen/ecs_test/2016/01/26/20/54_1453812893059571256_937.snappy.parquet You can download the OSS files to your computer and use the parquet-tools utility to open the files. For more information about the parquet-tools utility, visit parquet-tools.

Data consumption

  • You can use E-MapReduce, Spark, or Hive to consume data that is shipped to OSS. For more information, see LanguageManual DDL.
  • You can use inspection tools to consume data.
    The parquet-tools utility provided by the open source community can be used to inspect Parquet files, view the schema of the data stored in the files, and read the data. You can compile the utility or download the parquet-tools-1.6.0rc3-SNAPSHOT utility that Log Service provides to consume data.
    • View the schema of the data that is stored in a Parquet file
      $ java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar schema -d 00_1490803532136470439_124353.snappy.parquet | head -n 30
      message schema {
        optional int32 __time__;
        optional binary ip;
        optional binary __source__;
        optional binary method;
        optional binary __topic__;
        optional double seq;
        optional int64 status;
        optional binary time;
        optional binary url;
        optional boolean ua;
      }
      creator: parquet-cpp version 1.0.0
      file schema: schema
      --------------------------------------------------------------------------------
      __time__: OPTIONAL INT32 R:0 D:1
      ip: OPTIONAL BINARY R:0 D:1
      .......
    • View the data that is stored in a Parquet file
      $ java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar head -n 2 00_1490803532136470439_124353.snappy.parquet
      __time__ = 1490803230
      ip = 10.200.98.220
      __source__ = *.*.*.*
      method = POST
      __topic__ =
      seq = 1667821.0
      status = 200
      time = 30/Mar/2017:00:00:30 +0800
      url = /PutData?Category=YunOsAccountOpLog&AccessKeyId=*************&Date=Fri%2C%2028%20Jun%202013%2006%3A53%3A30%20GMT&Topic=raw&Signature=********************************* HTTP/1.1
      __time__ = 1490803230
      ip = 10.200.98.220
      __source__ = *.*.*.*
      method = POST
      __topic__ =
      seq = 1667822.0
      status = 200
      time = 30/Mar/2017:00:00:30 +0800
      url = /PutData?Category=YunOsAccountOpLog&AccessKeyId=*************&Date=Fri%2C%2028%20Jun%202013%2006%3A53%3A30%20GMT&Topic=raw&Signature=********************************* HTTP/1.1

    You can run the java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar -h command to view more information about commands.