EMR V2.4.0 supports unified metadata management. Before V2.4.0, clusters use local MySQL databases to store the metadata for Hive. EMR V2.4.0 and later versions support unified and high-reliability Hive metadatabases.

Overview

Hive metadatabases

The following features are supported by unified metadata management:

  • Persistent metadata storage

    In earlier versions, metadata is stored in MySQL databases that are deployed on clusters and is deleted when clusters are released. You can release a pay-as-you-go cluster if it is no longer needed. To retain the metadata, you need to log on to the cluster and export the metadata manually. Unified metadata management provides persistent metadata storage, which eliminates the needs for manual metadata exporting.

  • Separation of computing from storage

    EMR stores data in Alibaba Cloud OSS, which reduces usage costs significantly for large volumes of data. EMR clusters are mainly used as computing resources and can be released if they are no longer needed. You do not need to migrate metadata before cluster release because data is stored in OSS.

  • Data sharing
    If all data is stored in OSS, all clusters can access data without the need to migrate or restructure metadata. This way, EMR clusters that process different services can directly share data.
    Note Before unified metadata management is supported, metadata is stored in the local MySQL database of each cluster and is cleared when the cluster is released. Unified metadata management does not clear the metadata of released clusters. Before you delete data in OSS or in the HDFS of a cluster or you release a cluster, make sure that the metadata is deleted. That means the tables and database that store the data are also deleted. This prevents a buildup of dirty metadata in the database.

Limits

Currently, a metadatabase can only be accessed by using a public IP address, which means that you must have a public IP address. Do not change the public IP address. Otherwise, the database whitelist becomes invalid.

The table management feature can be used only after the unified metadatabase feature is enabled. A local metadatabase does not support table management in the console. You can use the Hue tool on a cluster for table management.

EMR supports unified metadata management by using ApsaraDB for RDS in the background, which only applies to users who require a small storage capacity. If you need a large storage capacity, we recommend that you use user-created ApsaraDB for RDS for unified metadata management. Default limits are as follows:

  • Total capacity: 200 MiB
  • Maximum number of queries per hour: 720,000
  • Maximum number of updates per hour: 144,000

Create a cluster that uses unified metadata

  • Use web pages

    When you create a cluster, set Type to Unified Metabases in the Basic Settings step. For information about how to create a cluster, see Create a cluster.

  • Use APIs

    See the description of the CreateClusterV2 operation. Set the value of the useLocalMetaDb parameter to false.

Manage metadata

Note The unified metadata management page in the EMR console takes effect only for available clusters with Type set to Unified Metabases.
  • Database operations

    You can search for databases by name. After you click a database name, you can view all tables stored in the database.

    Notice Before you delete a database, you must delete all tables stored in the database.
  • Table operationstable_operate

    Create a table: Only external tables and partitioned tables can be created. You can separate multiple tables with commas, spaces, or one of the following special characters: TAB, ^A, ^B, and ^C. You can also customize separators.

  • Table detailstable_info
    View partitions: For partitioned tables, you can click Partition Information on the Table Details tab to view partition lists. A maximum of 10,000 partitions can be displayed.
    Notice
    • Databases and tables can be created and deleted only in EMR clusters.
    • The HDFS is the internal file system of a cluster and does not support cross-cluster communication without special network settings. Therefore, the EMR table management feature only supports the creation of databases and tables based on the OSS file system.
    • Database and table locations must be in a directory under an OSS bucket, not the OSS bucket.
  • Metadatabase informationdata_info

    On the Metabase Information page, you can view the usage and limits of the current ApsaraDB for RDS instance. Submit a ticket if you need to modify the information.

Troubleshooting

  • Wrong FS: oss://yourbucket/xxx/xxx/xxx or Wrong FS: hdfs://yourhost:9000/xxx/xxx/xxx

    This error occurs because the table data in OSS is deleted but the metadata of the table is not. The table schema persists, but the actual data does not exist or is moved to another location. In this case, change the table location to an existing path and delete the table again.

    alter table test set location 'oss://your_bucket/your_folder'

    This operation can be completed in the EMR interactive console.

    Note oss://your_bucket/your_folder must be an existing OSS path.
  • The message "java.lang.IllegalArgumentException: java.net.UnknownHostException: xxxxxxx" is displayed when a Hive database is deleted.

    This error occurs because a Hive database is created in the HDFS of a cluster but it is not cleared when the cluster is released. As a result, the database data in the HDFS of the released cluster cannot be accessed if another cluster is created. Therefore, remember to clear the databases and tables that are manually created in the HDFS of a cluster when you release the cluster.

    Solution:

    Log on to the master node of the cluster from the command line, and find the address, username, and password used to access the Hive metadatabase in $HIVE_CONF_DIR/hive-site.xml.

    javax.jdo.option.ConnectionUserName //the database username;
    javax.jdo.option.ConnectionPassword //the database password;
    javax.jdo.option.ConnectionURL //the database connection URL and database name;
    delete_hive_database
    Connect to the Hive metadatabase on the master node.
    mysql -h ${DBConnectionURL} -u ${ConnectionUserName} -p [Press Enter]
    [Enter the password]${ConnectionPassword}
    After you log on to the Hive metadatabase, change its location to an OSS path in the region.hive_meta_db