EMR Doctor enables you to analyze data stored in OSS. By activating the storage analysis feature, you gain deeper insights into the usage and health of your OSS storage resources, leading to more effective data governance.
Background information
OSS offers a bucket inventory feature that, when enabled, allows OSS to periodically generate inventory lists for a bucket. These lists contain details about the objects, including their number and size. EMR Doctor utilizes these inventory lists to assess the usage and health of data within the bucket, along with its relationship to Hive storage resources.
To use the storage analysis feature, you must first activate the bucket inventory feature for a bucket. For more information, see the referenced document.
Precautions
Please note that enabling the bucket inventory feature may result in additional costs. For details, see the referenced document.
Enable the bucket inventory feature
If your cluster uses multiple OSS buckets and you want to analyze the storage resources within them, follow these steps in the OSS console to enable the bucket inventory feature for your buckets.
Log on to the OSS console.
In the left-side navigation pane, click Buckets. On the Buckets page, find and click the desired bucket.
-
In the left-side navigation pane, select .
-
On the Bucket Inventory page, click Create Inventory.
-
In the Set Inventory Report Rule panel, configure the necessary parameters. For more information, see the referenced document.
Important-
Make sure that the Inventory Bucket is the same as the bucket for which you are enabling the inventory feature.
-
If your OSS stores many files (over 10 billion), consider setting the Inventory Report Export Cycle to weekly. For fewer files, a daily cycle may suffice.
-
Ensure that the Optional Information For Inventory Content includes both Object Size and Storage Class.
-
-
Select I Acknowledge And Agree To Grant Alibaba Cloud OSS Service Permission To Access Bucket Resources, and then click OK.
Configure the storage analysis feature
The storage analysis feature depends on the inventory lists created by the bucket inventory feature. Configure the necessary parameters on the configuration page of the TAIHAODOCTOR service in the EMR console. For detailed steps and additional configurations, see EMR Doctor configuration instructions.
Configuration item | Description |
| The name of the OSS bucket to be analyzed. |
| The directory in which the generated inventory lists are stored. The format is:
|
For instance, if the configuration parameters for your OSS bucket inventory are: inventory report storage path (inventory_path
) as reports
, the name of the OSS bucket to be analyzed (inventory_bucket
) as my-bucket
, and the inventory name (inventory_name
) as my-inventory
.
Then, the directory where the inventory lists are stored (collect.oss.manifest.dir
) would be: reports/my-bucket/my-inventory
.
If your cluster uses multiple buckets and you have activated the inventory feature for each, you can list the names of the buckets and their corresponding inventory directories in sequence in the configuration item, separated by commas. Make sure the order of the bucket names matches the order of the inventory directories.
Single bucket configuration example
For a bucket named my-bucket
, the storage analysis configuration would be as follows.
collect.oss.bucket: my-bucket
collect.oss.manifest.dir: reports/my-bucket/my-inventory
Multiple bucket configuration example
For buckets named my-bucket1
and my-bucket2
, the storage analysis configuration would be as follows.
collect.oss.bucket: my-bucket1,my-bucket2
collect.oss.manifest.dir: reports1/my-bucket1/my-inventory1,reports2/my-bucket2/my-inventory2