Jindo SQL is a tool provided by JindoFS. You can use Jindo SQL to analyze metadata, JindoFS access logs, and Object Storage Service (OSS) access logs. This topic describes how to use Jindo SQL to analyze metadata, JindoFS access logs, and OSS access logs.

Prerequisites

  • Java Development Kit (JDK) 8 is installed on your computer.
  • An E-MapReduce (EMR) cluster is created. For more information, see Create a cluster.

Background information

You can use Jindo SQL to perform the following operations:

Limits

Jindo SQL is supported only in EMR V3.36.0 and later minor versions, and EMR V5.2.0 and later minor versions.

Start Jindo SQL

  1. Log on to your cluster in SSH mode. For more information, see Log on to a cluster.
  2. Run the following command to start Jindo SQL:
    jindo sql
    The following table describes the common parameters that can be used in the jindo sql command.
    Parameter Description
    -f Specifies the SQL file to run.
    -i Automatically initializes the SQL script after the jindo sql command is run.
    -d Specifies parameters in key-value pairs, such as -d A=B.

Structures of built-in tables in Jindo SQL

  • audit_log_source (partitioned table)
    The audit_log_source table is used to store raw access logs of JindoFS.
    Parameter Description
    datetime The time format is yyyy-MM-dd HH:mm:ss.
    allowed Indicates whether the current operation is allowed. Valid values:
    • true: The operation is allowed.
    • false: The operation is not allowed.
    ugi The user who performs the operation. Information about the authentication method is provided.
    ip The client IP address.
    ns The name of the namespace in block storage mode.
    cmd The operation command.
    src The source path.
    dst The destination path. This parameter can be left empty.
    perm The operation permissions on the file.
    date (partition key column) The date when the log entry is generated. The format is YYYY-mm-DD.
  • audit_log
    The audit_log table is used to store JindoFS access logs. You can use the partition key column to filter data in the table.
    Parameter Description
    datetime The time format is yyyy-MM-dd HH:mm:ss.
    allowed Indicates whether the current operation is allowed. Valid values:
    • true: The operation is allowed.
    • false: The operation is not allowed.
    ugi The user who performs the operation. Information about the authentication method is provided.
    ip The client IP address.
    ns The name of the namespace in block storage mode.
    cmd The operation command.
    src The source path.
    dst The destination path. This parameter can be left empty.
    perm The operation permissions on the file.
    date (partition key column) The date when the log entry is generated. The format is YYYY-mm-DD.
  • fs_image (partitioned table)
    The fs_image table is used to dump image information.
    Parameter Description
    atime The time when the current inode is last accessed.
    attr File-related attributes.
    etag The ETag value of OSS.
    id The ID of the inode.
    mtime The time when the inode is modified.
    name The name of the inode.
    owner The name of the owner.
    ownerGroup The name of the group to which the owner belongs.
    parentId The ID of the parent node.
    permission The operation permissions on the file.
    size The size of the inode.
    state The status of the inode.
    type The type of the inode.
    storagePolicy The storage policy.
    namespace (partition key column) The name of the namespace.
    datetime (partition key column) The time when the information is dumped.
  • oss_access_log_source
    If the table partitioning mode is enabled, the oss_access_log_source table is a partitioned table. This table is used to store raw access logs of OSS.
    Parameter Description
    line A raw log entry.
    bucket (partition key column) The name of the bucket.
    partition_date (partition key column) The date when the log entry is generated. The format is YYYY-mm-DD.
  • oss_access_log
    If the table partitioning mode is enabled, you can use the partition key columns to filter data in the oss_access_log table. This table is used to store OSS access logs.
    Parameter Description
    Remote_IP The IP address of the requester.
    Reserved A reserved field. Fixed value: -.
    Reserved1 A reserved field. Fixed value: -.
    Time The time when OSS receives the request.
    Request_URI The request URL. It contains a query string. OSS ignores the query string parameter that starts with x-. However, this parameter is recorded in access logs. Therefore, you can use the query string parameter that starts with x- as the tag of a request. Then, you can use this tag to query the log that corresponds to the request.
    HTTP_Status The HTTP status code that is returned by OSS.
    SentBytes The downstream traffic generated by the request. Unit: bytes.
    RequestTime The time required to process the request. Unit: milliseconds.
    Referer The HTTP referer of the request.
    User_Agent The user-agent header in the HTTP request.
    HostName The destination domain name in the request.
    Request_ID The ID of the request.
    LoggingFlag Indicates whether log dump is enabled.
    Requester The user ID of the requester. A value of - indicates anonymous access.
    Operation The request type.
    Bucket The name of the destination bucket.
    Key The name of the destination object.
    ObjectSize The size of the destination object. Unit: bytes.
    Server_Cost_Time The time required by OSS to process the request. Unit: milliseconds.
    ErrorCode The error code that is returned by OSS. A value of - indicates that no error code is returned.
    RequestLength The size of the request. Unit: bytes.
    UserID The ID of the bucket owner.
    Delta_DataSize The variation of the bucket size. A value of - indicates that this request does not involve write operations on objects.
    SyncRequest Indicates whether the request is an Alibaba Cloud CDN back-to-origin request. Valid values:
    • cdn: The request is a CDN back-to-origin request.
    • -: The request is not a CDN back-to-origin request.
    StorageClass The storage class of the destination object. Valid values:
    • Standard
    • IA
    • Archive
    • Cold Archive
    • -: no storage class obtained
    TargetStorageClass The storage class to which the object is converted after a lifecycle rule is triggered or the CopyObject operation is called. Valid values:
    • Standard
    • IA
    • Archive
    • Cold Archive
    • -: no storage class conversion involved
    TransmissionAccelerationAccessPoint The accelerate endpoint that is used to access the destination bucket based on the transfer acceleration feature. A value of - indicates that no accelerate endpoint is used or the accelerate endpoint is in the same region as the destination bucket.

    For example, if the requester accesses the destination bucket by using the accelerate endpoint that corresponds to the China (Hangzhou) region, cn-hangzhou is used as the value of this parameter.

    AccessKeyID The AccessKey ID that is used to access the bucket.
    bucket (partition key column) The name of the bucket.
    partition_date (partition key column) The date when the log entry is generated. The format is YYYY-mm-DD.

Analyze JindoFS access logs

JindoFS allows you to use SQL queries to analyze the JindoFS access logs that are stored in OSS. You can use SQL queries to analyze the most active commands or IP addresses in the related tables. You can use the jindo sql command to analyze the data.
Note The AuditLog feature is enabled. For more information, see Use Jindo AuditLog.
You can use the following commands related to Jindo SQL:
  • Run the following command to query tables:
    show tables;
    Note For more information about the table structures, see Structures of built-in tables in Jindo SQL.
    The information shown in the following figure is returned. show_table
  • Run the following command to query partitions:
    show partitions audit_log_source;
    Information similar to the output in the following figure is returned. show_audit_log_source
  • Run the following commands to query data:
    • select * from audit_log_source limit 10;
      Information similar to the output in the following figure is returned. audit_log_source
    • select * from audit_log limit 10;
      Information similar to the output in the following figure is returned. audit_log
  • Run the command shown in the following figure to collect statistics about the use frequencies of different commands that were run on October 20, 2020. rate

Analyze metadata

JindoFS allows you to use SQL queries to analyze the metadata files that are stored in JindoFS. You can use SQL queries to analyze the related tables. You can use the jindo sql command to analyze the metadata files.
Note The AuditLog feature is enabled. For more information, see Use Jindo AuditLog.
  1. Log on to your cluster in SSH mode. For more information, see Log on to a cluster.
  2. Run the following command to start Jindo SQL:
    jindo sql
  3. Query the tables that can be analyzed by using Jindo SQL.
    • Run the show tables command to query the tables that can be analyzed. Jindo SQL provides two built-in modules: audit_log and fs_image, which are used for auditing and metadata analysis.
      The following figure shows an example. show-tables
    • Run the show partitions fs_image command to view the partition information of the fs_image table. Each partition contains the data generated by using the jindo jfs -dumpMetadata command.
      The following figure shows an example. show-partition
  4. Query and analyze metadata.
    Jindo SQL uses the Spark-SQL syntax. You can use Jindo SQL to query and analyze data in the fs_image table.
    The following figure shows an example. fs_image
    The following figure shows an example on how to obtain the number of directories in a specific namespace based on the dumped metadata. dump
    Note The namespace and datetime columns are added. The namespace column contains the namespace name, and the datetime column contains the timestamp when metadata is uploaded.

Analyze OSS access logs

Notice To analyze OSS access logs, you must specify a log directory and whether to enable the table partitioning mode. If you enable the table partitioning mode, OSS archives logs in the partitioned table by bucket or partition_date, and you can execute statements to query logs in a specific partition. This improves query efficiency. If you enable the partitioning table mode, you must use this mode each time you query data in the table. Otherwise, some files may be archived to the specified log directory, which causes the failure to query some data.
JindoFS allows you to use SQL queries to analyze the OSS access log files that are stored in OSS. You can use SQL queries to analyze the related tables. You can run the jindo sql command to analyze the log files.
Note Make sure that log dump is enabled. For more information, see Logging.
  1. Log on to your cluster in SSH mode. For more information, see Log on to a cluster.
  2. Run the following command to start Jindo SQL:
    jindo sql
  3. Run the following command to specify the storage path of the OSS access logs and whether the table partitioning mode is enabled.
    jindo sql -d access_log_path=oss://test-sh/oss-accesslog -d partition.table.enabled=true

    access_log_path specifies the storage path, and partition.table.enabled specifies whether the table partitioning mode is enabled.

FAQ

  • Q: How do I modify the startup parameters of Jindo SQL?
    Jindo SQL is developed based on Spark. Therefore, the initial resources of Jindo SQL may be insufficient. If you want to increase the resources of Jindo SQL, you can modify the startup parameters of Jindo SQL by using the environment variable JINDO_SPARK_OPTS.
    export JINDO_SPARK_OPTS="--conf spark.driver.memory=4G --conf spark.executor.instances=20 --conf spark.executor.cores=5 --conf spark.executor.memory=20G"
  • Q: How do I use Hive to analyze tables?

    Tables in the default database are invisible to Hive jobs. This prevents Hive metadata from being polluted. If you want to use Hive to analyze the tables, you can execute the show create table {table_name} statement to query the statement that is executed to create a specific table or execute an SQL statement to create a table. In addition, you must use Hive to load external tables.