Jindo SQL is a tool provided by JindoFS. You can use Jindo SQL to analyze metadata, JindoFS access logs, and Object Storage Service (OSS) access logs. This topic describes how to use Jindo SQL to analyze metadata, JindoFS access logs, and OSS access logs.
Prerequisites
- Java Development Kit (JDK) 8 is installed on your computer.
- An E-MapReduce (EMR) cluster is created. For more information, see Create a cluster.
Background information
Limits
Jindo SQL is supported only in EMR V3.36.0 and later minor versions, and EMR V5.2.0 and later minor versions.
Start Jindo SQL
Structures of built-in tables in Jindo SQL
- audit_log_source (partitioned table)
The audit_log_source table is used to store raw access logs of JindoFS.
Parameter Description datetime The time format is yyyy-MM-dd HH:mm:ss. allowed Indicates whether the current operation is allowed. Valid values: - true: The operation is allowed.
- false: The operation is not allowed.
ugi The user who performs the operation. Information about the authentication method is provided. ip The client IP address. ns The name of the namespace in block storage mode. cmd The operation command. src The source path. dst The destination path. This parameter can be left empty. perm The operation permissions on the file. date (partition key column) The date when the log entry is generated. The format is YYYY-mm-DD. - audit_log
The audit_log table is used to store JindoFS access logs. You can use the partition key column to filter data in the table.
Parameter Description datetime The time format is yyyy-MM-dd HH:mm:ss. allowed Indicates whether the current operation is allowed. Valid values: - true: The operation is allowed.
- false: The operation is not allowed.
ugi The user who performs the operation. Information about the authentication method is provided. ip The client IP address. ns The name of the namespace in block storage mode. cmd The operation command. src The source path. dst The destination path. This parameter can be left empty. perm The operation permissions on the file. date (partition key column) The date when the log entry is generated. The format is YYYY-mm-DD. - fs_image (partitioned table)
The fs_image table is used to dump image information.
Parameter Description atime The time when the current inode is last accessed. attr File-related attributes. etag The ETag value of OSS. id The ID of the inode. mtime The time when the inode is modified. name The name of the inode. owner The name of the owner. ownerGroup The name of the group to which the owner belongs. parentId The ID of the parent node. permission The operation permissions on the file. size The size of the inode. state The status of the inode. type The type of the inode. storagePolicy The storage policy. namespace (partition key column) The name of the namespace. datetime (partition key column) The time when the information is dumped. - oss_access_log_source
If the table partitioning mode is enabled, the oss_access_log_source table is a partitioned table. This table is used to store raw access logs of OSS.
Parameter Description line A raw log entry. bucket (partition key column) The name of the bucket. partition_date (partition key column) The date when the log entry is generated. The format is YYYY-mm-DD. - oss_access_log
If the table partitioning mode is enabled, you can use the partition key columns to filter data in the oss_access_log table. This table is used to store OSS access logs.
Parameter Description Remote_IP The IP address of the requester. Reserved A reserved field. Fixed value: -. Reserved1 A reserved field. Fixed value: -. Time The time when OSS receives the request. Request_URI The request URL. It contains a query string. OSS ignores the query string parameter that starts with x-. However, this parameter is recorded in access logs. Therefore, you can use the query string parameter that starts with x- as the tag of a request. Then, you can use this tag to query the log that corresponds to the request. HTTP_Status The HTTP status code that is returned by OSS. SentBytes The downstream traffic generated by the request. Unit: bytes. RequestTime The time required to process the request. Unit: milliseconds. Referer The HTTP referer of the request. User_Agent The user-agent header in the HTTP request. HostName The destination domain name in the request. Request_ID The ID of the request. LoggingFlag Indicates whether log dump is enabled. Requester The user ID of the requester. A value of - indicates anonymous access. Operation The request type. Bucket The name of the destination bucket. Key The name of the destination object. ObjectSize The size of the destination object. Unit: bytes. Server_Cost_Time The time required by OSS to process the request. Unit: milliseconds. ErrorCode The error code that is returned by OSS. A value of - indicates that no error code is returned. RequestLength The size of the request. Unit: bytes. UserID The ID of the bucket owner. Delta_DataSize The variation of the bucket size. A value of - indicates that this request does not involve write operations on objects. SyncRequest Indicates whether the request is an Alibaba Cloud CDN back-to-origin request. Valid values: - cdn: The request is a CDN back-to-origin request.
- -: The request is not a CDN back-to-origin request.
StorageClass The storage class of the destination object. Valid values: - Standard
- IA
- Archive
- Cold Archive
- -: no storage class obtained
TargetStorageClass The storage class to which the object is converted after a lifecycle rule is triggered or the CopyObject operation is called. Valid values: - Standard
- IA
- Archive
- Cold Archive
- -: no storage class conversion involved
TransmissionAccelerationAccessPoint The accelerate endpoint that is used to access the destination bucket based on the transfer acceleration feature. A value of - indicates that no accelerate endpoint is used or the accelerate endpoint is in the same region as the destination bucket. For example, if the requester accesses the destination bucket by using the accelerate endpoint that corresponds to the China (Hangzhou) region, cn-hangzhou is used as the value of this parameter.
AccessKeyID The AccessKey ID that is used to access the bucket. bucket (partition key column) The name of the bucket. partition_date (partition key column) The date when the log entry is generated. The format is YYYY-mm-DD.
Analyze JindoFS access logs
jindo sql
command to analyze the data.
- Run the following command to query tables:
show tables;
Note For more information about the table structures, see Structures of built-in tables in Jindo SQL.The information shown in the following figure is returned. - Run the following command to query partitions:
show partitions audit_log_source;
Information similar to the output in the following figure is returned. - Run the following commands to query data:
select * from audit_log_source limit 10;
Information similar to the output in the following figure is returned.select * from audit_log limit 10;
Information similar to the output in the following figure is returned.
- Run the command shown in the following figure to collect statistics about the use frequencies of different commands that were run on October 20, 2020.
Analyze metadata
jindo sql
command to analyze the metadata files.
Analyze OSS access logs
jindo sql
command to analyze the log files.
FAQ
- Q: How do I modify the startup parameters of Jindo SQL?
Jindo SQL
is developed based on Spark. Therefore, the initial resources of Jindo SQL may be insufficient. If you want to increase the resources of Jindo SQL, you can modify the startup parameters of Jindo SQL by using the environment variableJINDO_SPARK_OPTS
.export JINDO_SPARK_OPTS="--conf spark.driver.memory=4G --conf spark.executor.instances=20 --conf spark.executor.cores=5 --conf spark.executor.memory=20G"
- Q: How do I use Hive to analyze tables?
Tables in the default database are invisible to Hive jobs. This prevents Hive metadata from being polluted. If you want to use Hive to analyze the tables, you can execute the
show create table {table_name}
statement to query the statement that is executed to create a specific table or execute an SQL statement to create a table. In addition, you must use Hive to load external tables.