Optimize OSS Costs with JindoTable Tiered Storage - E-MapReduce

Prerequisites

Before you begin, ensure that you have:

Java Development Kit (JDK) 8 installed on your on-premises machine
An E-MapReduce (EMR) cluster of version 3.30.0 or later (Create a cluster)

How it works

JindoTable commands follow a storage management workflow:

Run -accessStat to identify which tables or partitions are accessed most frequently.
Run -cache to pull hot data onto local disks, or -archive to move cold data to a lower-cost storage class.
Run -status to verify the current storage state of a table or partition.
Run -optimize to optimize the data organization of tables at the storage layer.

Important Specify tables in the format database.table. Specify partitions in the format partitionCol1=val1,partitionCol2=val2,....

Command reference

JindoTable provides 11 commands. In the syntax descriptions, {-flag} indicates a required parameter and [-flag] indicates an optional parameter.

Command	Description
`-accessStat`	Query the most-accessed tables or partitions in a time range
`-cache`	Cache table or partition data to local disks
`-uncache`	Remove cached data from local disks
`-archive`	Move data to Archive or Infrequent Access (IA) storage class
`-unarchive`	Restore archived data to Standard storage class
`-status`	View the storage status of a table or partition
`-optimize`	Optimize table data organization at the storage layer
`-showTable`	List all partitions in a partitioned table, or show storage details of a non-partitioned table
`-showPartition`	Show storage details of a specific partition
`-listTables`	List all tables in a database
`-dumpmc`	Dump a MaxCompute table to an EMR cluster or Object Storage Service (OSS)

-accessStat

Query the tables or partitions with the most access records in a specified time range.

Syntax

jindo table -accessStat {-d} <days> {-n} <topNums>

Parameters

Parameter	Required	Description
`-d <days>`	Yes	Number of days to look back. Must be a positive integer. If set to `1`, all access records from 00:00 local time on the current day to the current time are returned.
`-n <topNums>`	Yes	Number of top results to return. Must be a positive integer.

Example

Return the top 20 most-accessed tables or partitions in the last 7 days:

jindo table -accessStat -d 7 -n 20

-cache

Cache data of a table or partition from OSS or JindoFileSystem (JindoFS) to local disks.

Syntax

jindo table -cache {-t} <dbName.tableName> [-p] <partitionSpec> [-pin]

Parameters

Parameter	Required	Description
`-t <dbName.tableName>`	Yes	Table to cache. Format: `database.table`.
`-p <partitionSpec>`	No	Partition to cache. Format: `partitionCol1=val1,partitionCol2=val2,...`. If omitted, the entire table is cached.
`-pin`	No	If cache space is insufficient, do not delete related data if possible.

Example

Cache the date=2020-03-16 partition of db1.t1:

jindo table -cache -t db1.t1 -p date=2020-03-16

-uncache

Remove cached data of a table or partition from local disks.

Syntax

jindo table -uncache {-t} <dbName.tableName> [-p] <partitionSpec>

Parameters

Parameter	Required	Description
`-t <dbName.tableName>`	Yes	Table whose cache to remove. Format: `database.table`.
`-p <partitionSpec>`	No	Partition whose cache to remove. Format: `partitionCol1=val1,partitionCol2=val2,...`. If omitted, the entire table's cache is removed.

Examples

Remove the cached data of the entire db1.t2 table:

jindo table -uncache -t db1.t2

Remove the cached data of the date=2020-03-16,category=1 partition of db1.t1:

jindo table -uncache -t db1.t1 -p date=2020-03-16,category=1

-archive

Move data of a table or partition to a lower-cost storage class. The default target is the Archive storage class. Add -i to use Infrequent Access (IA) instead.

Syntax

jindo table -archive [-a|-i] {-t} <dbName.tableName> [-p] <partitionSpec>

Parameters

Parameter	Required	Description
`-a`	No	Archive to the Archive storage class (default behavior).
`-i`	No	Archive to the Infrequent Access (IA) storage class instead of Archive.
`-t <dbName.tableName>`	Yes	Table to archive. Format: `database.table`.
`-p <partitionSpec>`	No	Partition to archive. Format: `partitionCol1=val1,partitionCol2=val2,...`. If omitted, the entire table is archived.

Example

Archive the date=2020-10-12 partition of db1.t1:

jindo table -archive -t db1.t1 -p date=2020-10-12

-unarchive

Restore archived data to Standard storage class, or change it to IA storage class.

Syntax

jindo table -unarchive [-o|-i] {-t} <dbName.tableName> [-p] <partitionSpec>

Parameters

Parameter	Required	Description
`-o`	No	Temporarily restore an archived object.
`-i`	No	Change an archived object to IA storage class.
`-t <dbName.tableName>`	Yes	Table to unarchive. Format: `database.table`.
`-p <partitionSpec>`	No	Partition to unarchive. Format: `partitionCol1=val1,partitionCol2=val2,...`. If omitted, the entire table is unarchived.

Examples

Temporarily restore the date=2020-03-16,category=1 partition of db1.t1 from Archive:

jindo table -unarchive -o -t db1.t1 -p date=2020-03-16,category=1

Change the entire db1.t2 table from Archive to IA:

jindo table -unarchive -i -t db1.t2

-status

View the data storage status of a table or partition.

Syntax

jindo table -status {-t} <dbName.tableName> [-p] <partitionSpec>

Parameters

Parameter	Required	Description
`-t <dbName.tableName>`	Yes	Table to inspect. Format: `database.table`.
`-p <partitionSpec>`	No	Partition to inspect. Format: `partitionCol1=val1,partitionCol2=val2,...`. If omitted, the status of the entire table is returned.

Examples

View the storage status of the entire db1.t2 table:

jindo table -status -t db1.t2

View the storage status of the date=2020-03-16 partition of db1.t1:

jindo table -status -t db1.t1 -p date=2020-03-16

-optimize

Optimize the data organization of a table at the storage layer to improve query performance.

Syntax

jindo table -optimize {-t} <dbName.tableName>

Parameters

Parameter	Required	Description
`-t <dbName.tableName>`	Yes	Table to optimize. Format: `database.table`.

Example

Optimize the data organization of db1.t1:

jindo table -optimize -t db1.t1

-showTable

Display all partitions in a partitioned table, or show the data storage details of a non-partitioned table.

Syntax

jindo table -showTable {-t} <dbName.tableName>

Parameters

Parameter	Required	Description
`-t <dbName.tableName>`	Yes	Table to display. Format: `database.table`.

Example

Display all partitions in db1.t1:

jindo table -showTable -t db1.t1

-showPartition

Display the data storage details of a specific partition.

Syntax

jindo table -showPartition {-t} <dbName.tableName> [-p] <partitionSpec>

Parameters

Parameter	Required	Description
`-t <dbName.tableName>`	Yes	Table that contains the partition. Format: `database.table`.
`-p <partitionSpec>`	No	Partition to display. Format: `partitionCol1=val1,partitionCol2=val2,...`.

Example

Display the storage details of the date=2020-10-12 partition in db1.t1:

jindo table -showPartition -t db1.t1 -p date=2020-10-12

-listTables

List all tables in a database.

Syntax

jindo table -listTables [-db] <dbName>

Parameters

Parameter	Required	Description
`-db <dbName>`	No	Database to list tables from. If omitted, tables in the default database are listed.

Examples

List all tables in the default database:

jindo table -listTables

List all tables in db1:

jindo table -listTables -db db1

-dumpmc

Dump a MaxCompute table to an EMR cluster or OSS. Supported output formats are CSV and TFRECORD.

Syntax

jindo table -dumpmc {-i} <accessId> {-k} <accessKey> {-m} <numMaps> {-t} <tunnelUrl> {-project} <projectName> {-table} <tableName> [-p] <partitionSpec> {-f} <csv|tfrecord> {-o} <outputPath>

Parameters

Parameter	Required	Description
`-i <accessId>`	Yes	AccessKey ID of your Alibaba Cloud account.
`-k <accessKey>`	Yes	AccessKey secret of your Alibaba Cloud account.
`-m <numMaps>`	Yes	Number of map tasks.
`-t <tunnelUrl>`	Yes	VPC Tunnel endpoint of MaxCompute.
`-project <projectName>`	Yes	Name of the MaxCompute project.
`-table <tableName>`	Yes	Name of the MaxCompute table.
`-p <partitionSpec>`	No	Partition to dump. Example: `pt=xxx`. Separate multiple partitions with commas: `pt=xxx,dt=xxx`.
`-f <csv\|tfrecord>`	Yes	Output file format. Valid values: `csv`, `tfrecord`.
`-o <outputPath>`	Yes	Destination path. Use a local EMR path (for example, `/tmp/output`) or an OSS path (for example, `oss://bucket/path`).

Examples

Dump a MaxCompute table to an EMR cluster in TFRECORD format:

jindo table -dumpmc -m 10 -project mctest_project -table t1 -t http://dt.xxx.maxcompute.aliyun-inc.com -k xxxxxxxxx -i XXXXXX -o /tmp/outputtf1 -f tfrecord

Dump a MaxCompute table to OSS in CSV format:

jindo table -dumpmc -m 10 -project mctest_project -table t1 -t http://dt.xxx.maxcompute.aliyun-inc.com -k xxxxxxxxx -i XXXXXX -o oss://bucket1/tmp/outputcsv -f csv