EMR Doctor allows you to analyze data stored in Object Storage Service (OSS). EMR Doctor also provides the storage analysis feature that allows you to analyze the usage and health status of OSS storage resources and govern data stored in OSS in an efficient manner.
Background information
OSS provides the bucket inventory feature. If you enable this feature, OSS periodically generates inventory lists for a bucket. The inventory lists store information about objects, such as the number and size of the objects. EMR Doctor allows you to analyze the usage and health status of data in the bucket, and the association with Hive storage resources based on the inventory list that is generated for the bucket.
Before you can use the storage analysis feature, you must enable the bucket inventory feature for a bucket. For more information about the bucket inventory feature, see Bucket inventory.
Precautions
You are charged when you use the bucket inventory feature. For more information, see Bucket inventory.
Enable the bucket inventory feature
If your cluster uses multiple OSS buckets and you want to analyze storage resources in the buckets, perform the following steps to enable the bucket inventory feature for the buckets in the OSS console.
- Log on to the OSS console.
- In the left-side navigation pane, click Buckets. On the Buckets page, click the name of the desired bucket.
- In the left-side navigation pane, choose .
- On the Bucket Inventory page, click Create Inventory.
- In the Create Inventory panel, configure the parameters. For more information, see Bucket inventory. Important
- Make sure that the bucket that you select for Inventory Storage Bucket is the bucket for which you want to enable the bucket inventory feature.
- If more than 10 billion objects are stored in a bucket, we recommend that you select Weekly for the Frequency parameter. If the number of objects that are stored in a bucket is less than or equal to 10 billion, you can select Daily for the Frequency parameter.
- Make sure that you select Object Size and Storage Class for the Optional Fields parameter.
- Read and select I understand the terms and agree to authorize Alibaba Cloud OSS to access the resources in my buckets. Then, click OK. A long period of time may be required to generate inventory lists for a large number of objects. The following structure shows the directories in which generated inventory lists are stored.
dest_bucket └──destination-prefix/ └──src_bucket/ └──inventory_id/ ├──YYYY-MM-DDTHH-MMZ/ │ ├──manifest.json │ └──manifest.checksum └──data/ └──745a29e3-bfaa-490d-9109-47086afcc****.csv.gz
dest_bucket
indicates the current bucket for which the bucket inventory feature is enabled.destination-prefix
indicates the directory in which the generated inventory lists are stored.inventory_id
indicates the names of the inventories that you configured.In this example, the
path1/doctor-hive-oss-test1/oss-manifest
directory is generated.
Configure the storage analysis feature
- collect.oss.bucket: the name of the bucket whose objects you want to analyze.
- collect.oss.manifest.dir: the directory in which the generated inventory lists are stored. You can configure the directory based on the directory structure that is described in the Enable the bucket inventory feature section of this topic. You need to only configure the
destination-prefix/src_bucket/inventory_id/
directory, which is thepath1/doctor-hive-oss-test1/oss-manifest
directory in the preceding example.
Configuration for a single bucket
doctor-hive-oss-test1
bucket is used. The following result shows the configuration for storage analysis. collect.oss.bucket: doctor-hive-oss-test1
collect.oss.manifest.dir: path1/doctor-hive-oss-test1/oss-manifest
Configuration for multiple buckets
doctor-hive-oss-test1
and doctor-hive-oss-test2
buckets are used. The following result shows the configuration for storage analysis. collect.oss.bucket: doctor-hive-oss-test1,doctor-hive-oss-test2
collect.oss.manifest.dir: path1/doctor-hive-oss-test1/oss-manifest,path2/doctor-hive-oss-test2/test