Jindo SQL と Hive による JindoFS メタデータのオフライン分析用ダンプ - E-MapReduce - Alibaba Cloud - E-MapReduce

E-MapReduce (EMR) 3.30.0 以降では、ブロックストレージモードの JindoFS を使用して、名前空間全体のメタデータを Object Storage Service (OSS) にダンプし、Jindo SQL を使用して直接分析できます。

背景情報

HDFS ではオフライン分析のために XML フォーマットの fsimage ファイルをダウンロードする必要がありますが、JindoFS ではメタデータをダウンロードすることなく、クラウド上で直接分析できます。

ファイルシステムメタデータの OSS へのダンプ

名前空間のメタデータを OSS にダンプするには、次のコマンドを実行します：

jindo jfs -dumpMetadata <nsName>

<nsName> は、ブロックストレージモードの名前空間の名前です。

例えば、test-block 名前空間のメタデータをダンプして分析するには、次のようにします：

jindo jfs -dumpMetadata test-block

:bin/xxx jindo jfs -dumpMetadata test-block
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/xxx code/bigboot-3rdparty/bigboot/output/sdk/lib/bigboot-emr-cli.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/xxx code/bigboot-3rdparty/bigboot/output/sdk/lib/jindo-auditlog-full.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/xxx code/bigboot-3rdparty/bigboot/output/sdk/lib/jboot.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/xxx code/bigboot-3rdparty/bigboot/output/sdk/lib/jindo-distcp-2.7.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Successfully upload namespace metadata to OSS.

次のメッセージは、メタデータが JSON ファイルとして OSS に正常にアップロードされたことを示します：

Successfully upload namespace metadata to OSS.

メタデータのアップロードパス

メタデータのアップロードパスは、JindoFS で設定された sysinfo ディレクトリ内の metadataDump サブディレクトリです。

例えば、namespace.sysinfo.oss.uri が oss://abc/ として設定されている場合、JindoFS はファイルを oss://abc/metadataDump サブディレクトリにアップロードします。

パラメーター	説明
namespace.sysinfo.oss.uri	送信先の OSS バケットとパス。
namespace.sysinfo.oss.endpoint	OSS バケットのエンドポイント。クロスリージョンアクセスがサポートされています。
namespace.sysinfo.oss.access.key	ご利用の Alibaba Cloud アカウントの AccessKey ID。
namespace.sysinfo.oss.access.secret	ご利用の Alibaba Cloud アカウントの AccessKey Secret。

バッチ情報：分散ファイルシステムのメタデータは時間とともに変化するため、各分析ではダンプコマンドの実行時に取得されたメタデータスナップショットが使用されます。各ダンプコマンドは、アップロードのタイムスタンプから生成されたバッチ番号で名前が付けられた新しいサブディレクトリを作成します。このサブディレクトリは、アップロードのルートディレクトリとして機能します。これにより、新しいアップロードが以前のものを上書きすることはありません。必要に応じて古いデータを削除できます。メタデータのアップロードパスは次のフォーマットを使用します： oss://{Bucket}/sysinfo/metadataDump/{namespace}/{batch_number}/。例： oss://emr-xxx-test/sysinfo/metadataDump/test-block/2020_09_14_18_58_16/。

sysinfo/metadataDump は、メタデータシステム情報のベースパスです。
{namespace} は、名前空間です。
{batch_number} は、バッチ番号です。

メタデータスキーマ

JindoFS は、ファイルシステムのメタデータを JSON ファイルとして OSS にアップロードします。スキーマは次のとおりです：

{
  "type":"string",          /* inode タイプ：FILE または DIRECTORY */
  "id": "string",            /* inode ID */
  "parentId" :"string",         /* 親ノード ID */
  "name":"string",         /* inode 名 */
  "size": "int",         /* inode サイズ (bigint) */
  "permission":"int",         /* 整数として保存される権限 */
  "owner":"string",          /* オーナー名 */
  "ownerGroup":"string",     /* オーナーグループ名 */
  "mtime":"int",              /* inode 変更時刻 (bigint) */
  "atime":"int",              /* inode 最終アクセス時刻 (bigint) */
  "attributes":"string",       /* ファイル関連の属性 */
  "state":"string",            /* inode 状態 */
  "storagePolicy":"string",    /* ストレージポリシー */
  "etag":"string"           /* etag */
}

Jindo SQL を使用したメタデータ分析

次のコマンドを実行して Jindo SQL を起動します。


起動後、Spark マスターは yarn として表示されます。
jindo-sql プロンプトで、テーブル情報を表示するには、次のコマンドを実行します:

以下は出力の例です。
jindo sql
<code code-type="xCode" data-tag="codeblock">
起動後、Spark マスターは yarn として表示されます。
jindo-sql プロンプトで、次のコマンドを実行してテーブル情報を表示します：

分析テーブルのクエリ。
- show tables を使用して、分析可能なテーブルを表示します。Jindo SQL は、監査情報とメタデータ情報を分析するために、組み込みの audit_log テーブルと fs_image テーブルを提供します。
- show partitions fs_image を使用して、fs_image テーブルのパーティション情報を表示します。各パーティションは、jindo jfs -dumpMetadata コマンドの 1 回の実行によるデータに対応します。
  例：
```
jindo-sql> show partitions fs_image;
partition
namespace=xxx/datetime=2020_10_20_10_47_14
namespace=xxx/datetime=2020_10_20_10_50_36
namespace=xxx/datetime=2020_10_20_10_52_06
Time taken: 0.045 seconds, Fetched 3 row(s)
```

メタデータのクエリと分析。

Jindo SQL は Spark SQL 構文を使用します。SQL を使用して fs_image テーブルをクエリおよび分析できます。

例：

[root@emr-worker-2 hadoop]# jindo sql
Spark master: yarn, Application Id: app
jindo-sql> show tables;
database  tableName        isTemporary
default   audit_log        false
default   audit_log_source false
default   fs_image         false
Time taken: 0.345 seconds, Fetched 3 row(s)
jindo-sql> select * from fs_image limit 10;
atime  ctime  etag  id  mtime  name  owner  ownerGroup  parentId  permission  size  state  storagePolicy  type  name
0      5855433 489  0  7311076005051899448  1603084070081  /tpcds/orc/5000/web_returns/wr_returned_date_sk=2450819  root  xxx  334790833296
0      5855433 489  0  16534448041906675495  1603084071350  /tpcds/orc/5000/web_returns/wr_returned_date_sk=2450820  root  xxx  334790833296
...
Time taken: 6.764 seconds, Fetched 10 row(s)

Jindo SQL は、それぞれ名前空間名とメタデータのアップロードタイムスタンプに対応する namespace と datetime の 2 つの列を追加します。

例えば、特定のメタデータダンプに基づいて名前空間内のディレクトリ数をカウントするには、次のようにします：

jindo-sql>  select count(*) from fs_image where type = "Directory" and namespace="kugou" and datetime="2020_10_20_10_47_14";
count(1)
11837
Time taken: 6.852 seconds, Fetched 1 row(s)

Hive を使用したメタデータ分析

Hive テーブルの作成。

Hive で、メタデータをクエリするための外部テーブルを作成します。次の DDL 文をテンプレートとして使用して、ファイルシステムのメタデータテーブルを作成できます。

CREATE EXTERNAL TABLE `table_name` 
(`type` string,
 `id` string,
 `parentId` string,
 `name` string,
 `size` bigint, 
 `permission` int,
 `owner` string,
 `ownerGroup` string,
 `mtime` bigint, 
 `atime` bigint,
 `attr` string,
 `state` string,
 `storagePolicy` string,
 `etag` string) 
 ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' 
 STORED AS TEXTFILE 
 LOCATION 'OSS_PATH_TO_UPLOADED_FILES';

Hive を使用したデータ分析。

Hive テーブルを作成した後、Hive SQL を使用してメタデータを分析できます。

select * from table_name limit 200;

例：

hive> select * from inode_metadata_test8 limit 100;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_xxx
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1xxx       , Tracking URL = http://emr-heade xxx          :20888/proxy/applicxxx
Kill Command = /usr/lib/hadoop-current/bin/hadoop job  -kill job_1599xxx
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-09-08 14:57:26,112 Stage-1 map = 0%,  reduce = 0%
2020-09-08 14:57:31,263 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.22 sec
MapReduce Total cumulative CPU time: 1 seconds 220 msec
Ended Job = job_xxx
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 1.22 sec   HDFS Read: 6867 HDFS Write: 1524 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 220 msec
OK
Directory      1127433438684721971 4   1127433438684721971 3   /uttest/oss       0     511     caojie  staff   1599545017615   1599545017615              Finalized       WARM
Directory      1127433438684721971 9   1127433438684721971 3   /uttest/oss2      0     511     caojie  staff   1599545017654   1599545017654              Finalized       WARM
Directory      1127433438684721971 6   1127433438684721971 4   /uttest/oss/dir   0     511     caojie  staff   1599545017636   1599545017636              Finalized       WARM
File           1127433438684721971 5   1127433438684721971 4   /uttest/oss/file1 0     420     caojie  staff   1599545017632   1599545017632              Finalized       WARM
File           1127433438684721971 7   1127433438684721971 6   /uttest/oss/dir/file2 0  420     caojie  staff   1599545017642   1599545017642              Finalized       WARM
File           1127433438684721971 8   1127433438684721971 6   /uttest/oss/dir/file3 0  420     caojie  staff   1599545017651   1599545017651              Finalized       WARM
Directory      1127433438684721972 0   1127433438684721971 9   /uttest/oss2/dir  0     511     caojie  staff   1599545017654   1599545017654              Finalized       WARM
File           1127433438684721972 1   1127433438684721972 0   /uttest/oss2/dir/file2 0 420     caojie  staff   1599545017658   1599545017658              Finalized       WARM
File           1127433438684721972 2   1127433438684721972 0   /uttest/oss2/dir/file3 0 420     caojie  staff   1599545017666   1599545017666              Finalized       WARM
Directory      1127433438684721971 3   1767282355743385190 5   /uttest 0         511     caojie  staff   1599545017615   1599545017615              Finalized       WARM
Time taken: 10.734 seconds, Fetched: 10 row(s)
hive>