In E-MapReduce (EMR) V3.30.0 or later, JindoFS in block storage mode allows you to dump the metadata of an entire namespace to Object Storage Service (OSS) and use Jindo SQL to analyze the metadata.

Background information

In Hadoop Distributed File System (HDFS), the metadata is stored in a snapshot file named fsimage. The snapshot file contains the following information about metadata: namespace, file that stores the metadata, blocks, and file system quota. In HDFS, you can run a command to download the fsimage file in the XML format to your on-premises machine and view this file to analyze metadata offline. JindoFS does not require you to download metadata to your on-premises machine.

Upload HDFS metadata to OSS

Run the following command to upload metadata of a specific namespace to OSS:
jindo jfs -dumpMetadata <nsName>

<nsName> indicates the name of the namespace in block storage mode.

For example, upload the metadata of the test-block namespace to OSS and analyze the metadata offline.
jindo jfs -dumpMetadata test-block
test-block
If the following information appears, data is uploaded to OSS and stored as a file in the JSON format.
Sucessfully upload namespace metadata to OSS.

Metadata upload path

The metadata upload path is the metadataDump sub-directory under sysinfo in JindoFS.

For example, if namespace.sysinfo.oss.uri is set to oss://abc/, metadata is uploaded to the oss://abc/metadataDump directory.
Parameter Description
namespace.sysinfo.oss.uri The path of the OSS bucket.
namespace.sysinfo.oss.endpoint The endpoint. An endpoint specified in a different region can also be used.
namespace.sysinfo.oss.access.key The AccessKey ID of your Alibaba Cloud account.
namespace.sysinfo.oss.access.secret The AccessKey secret of your Alibaba Cloud account.
Batch information: Metadata from HDFS changes based on the usage. Each time you want to analyze metadata, you must run commands to view the snapshot file of the metadata. Each time you run a Jindo command to upload metadata, a batch number that is generated based on the upload time is used as the root directory to store the metadata. This ensures that the data you uploaded each time is not overwritten. You can delete historical data based on your requirements. namespace
  • 1 indicates the configuration path of system information in OSS.
  • 2 indicates a namespace.
  • 3 indicates a batch number.

Metadata schema

HDFS metadata that is uploaded to OSS is stored as a file in the JSON format. Schema information:
{
  "type":"string",          /*Inode type, FILE DIRECTORY*/
  "id": "string",            /*INode id*/
  "parentId" :"string",         /*Parent node ID*/
  "name":"string",         /*Inode name*/
  "size": "int",         /*Inode size, BIGINT*/
  "permission":"int",         /*Permission, INT*/
  "owner":"string",          /*Owner name*/
  "ownerGroup":"string",     /*Owner group name*/
  "mtime":"int",              /*Inode modification time, BIGINT*/
  "atime":"int",              /*The time when the inode was last accessed, BIGINT*/
  "attributes":"string",       /*File-related attributes*/
  "state":"string",            /*Inode status*/
  "storagePolicy":"string",    /*Storage policy*/
  "etag":"string"           /*etag*/
}

Use Jindo SQL to analyze metadata

  1. Run the command shown in the following figure to start Jindo SQL.
    start_sql
  2. Query the tables whose data can be analyzed by using Jindo SQL.
    • Run the show tables command to view the tables whose data can be analyzed. Jindo SQL has two built-in modules audit_log and fs_image for audit and metadata analysis.
    • Run the show partitions fs_image command to view the partition information of the fs_image table. Each partition contains the data generated by using the jindo jfs -dumpMetadata command.
      The following figure shows an example. show_partitions
  3. Query and analyze metadata.
    Jindo SQL uses the Spark-SQL syntax. You can use Jindo SQL to query and analyze data in the fs_image table.
    The following figure shows an example. check fs_image

    The namespace and datetime columns are added and indicate the namespace name and the timestamp when metadata is uploaded.

    Example: Obtain the number of directories in a specific namespace based on the dumped metadata. Number of queries

Use Hive to analyze metadata

  1. Create a table schema in Hive.

    You can use the following command to create a table schema of the metadata that is used for queries in Hive:

    CREATE EXTERNAL TABLE `table_name` 
    (`type` string,
     `id` string,
     `parentId` string,
     `name` string,
     `size` bigint, 
     `permission` int,
     `owner` string,
     `ownerGroup` string,
     `mtime` bigint, 
     `atime` bigint,
     `attr` string,
     `state` string,
     `storagePolicy` string,
     `etag` string) 
     ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' 
     STORED AS TEXTFILE 
     LOCATION 'The OSS path to which file data is uploaded';
  2. Use Hive to analyze data offline.
    After you create a Hive table, you can use Hive SQL to analyze metadata.
    select * from table_name limit 200;
    The following figure shows an example. Offline analysis