EMR Doctor allows you to analyze data stored in Object Storage Service (OSS). EMR Doctor also provides the storage analysis feature that allows you to analyze the usage and health status of OSS storage resources and govern data stored in OSS in an efficient manner.

Background information

OSS provides the bucket inventory feature. If you enable this feature, OSS periodically generates inventory lists for a bucket. The inventory lists store information about objects, such as the number and size of the objects. EMR Doctor allows you to analyze the usage and health status of data in the bucket, and the association with Hive storage resources based on the inventory list that is generated for the bucket.

Before you can use the storage analysis feature, you must enable the bucket inventory feature for a bucket. For more information about the bucket inventory feature, see Bucket inventory.

Precautions

You are charged when you use the bucket inventory feature. For more information, see Bucket inventory.

Enable the bucket inventory feature

If your cluster uses multiple OSS buckets and you want to analyze storage resources in the buckets, perform the following steps to enable the bucket inventory feature for the buckets in the OSS console.

  1. Log on to the OSS console.
  2. In the left-side navigation pane, click Buckets. On the Buckets page, click the name of the desired bucket.
  3. In the left-side navigation pane, choose Data Management > Bucket Inventory.
  4. On the Bucket Inventory page, click Create Inventory.
  5. In the Create Inventory panel, configure the parameters. For more information, see Bucket inventory.
    Important
    • Make sure that the bucket that you select for Inventory Storage Bucket is the bucket for which you want to enable the bucket inventory feature.
    • If more than 10 billion objects are stored in a bucket, we recommend that you select Weekly for the Frequency parameter. If the number of objects that are stored in a bucket is less than or equal to 10 billion, you can select Daily for the Frequency parameter.
    • Make sure that you select Object Size and Storage Class for the Optional Fields parameter.
  6. Read and select I understand the terms and agree to authorize Alibaba Cloud OSS to access the resources in my buckets. Then, click OK.
    A long period of time may be required to generate inventory lists for a large number of objects. The following structure shows the directories in which generated inventory lists are stored.
    dest_bucket
        └──destination-prefix/
            └──src_bucket/
                └──inventory_id/
                    ├──YYYY-MM-DDTHH-MMZ/
                    │   ├──manifest.json
                    │   └──manifest.checksum
                    └──data/
                            └──745a29e3-bfaa-490d-9109-47086afcc****.csv.gz

    dest_bucket indicates the current bucket for which the bucket inventory feature is enabled. destination-prefix indicates the directory in which the generated inventory lists are stored. inventory_id indicates the names of the inventories that you configured.

    In this example, the path1/doctor-hive-oss-test1/oss-manifest directory is generated.

Configure the storage analysis feature

The storage analysis feature depends on the inventory lists that are generated when you use the bucket inventory feature. You must configure the following parameters. For more information, see Configuration.
  • collect.oss.bucket: the name of the bucket whose objects you want to analyze.
  • collect.oss.manifest.dir: the directory in which the generated inventory lists are stored. You can configure the directory based on the directory structure that is described in the Enable the bucket inventory feature section of this topic. You need to only configure the destination-prefix/src_bucket/inventory_id/ directory, which is the path1/doctor-hive-oss-test1/oss-manifest directory in the preceding example.
Important If your cluster uses multiple buckets and you enable the bucket inventory feature for the buckets, you can add the names of the buckets and the directories in which the generated inventory lists are stored to the configuration items of your cluster in sequence. Separate the bucket names and directories with commas (,). Make sure that the order of the bucket names corresponds to the order of the directories.

Configuration for a single bucket

In this example, the doctor-hive-oss-test1 bucket is used. The following result shows the configuration for storage analysis.
collect.oss.bucket:   doctor-hive-oss-test1
collect.oss.manifest.dir:      path1/doctor-hive-oss-test1/oss-manifest

Configuration for multiple buckets

In this example, the doctor-hive-oss-test1 and doctor-hive-oss-test2 buckets are used. The following result shows the configuration for storage analysis.
collect.oss.bucket:   doctor-hive-oss-test1,doctor-hive-oss-test2
collect.oss.manifest.dir:      path1/doctor-hive-oss-test1/oss-manifest,path2/doctor-hive-oss-test2/test