E-MapReduce has supported unified metadata management since V2.4.0. Before V2.4.0, clusters use local MySQL databases to store the metadata for Hive. E-MapReduce V2.4.0 and later versions will support unified and high-reliability Hive metadatabases.

Overview


Hive metadatabases
With unified metadata management, the following features are supported.
  • Provides persistent metadata storage.

    Before, metadata is stored in MySQL databases that are deployed on clusters. Metadata is deleted when clusters are released. You can release a Pay-As-You-Go cluster after using it. For retaining the metadata, you need to log on to the cluster and export the metadata manually. Unified metadata management helps you avoid the issue.

  • You can separate computing and storage.

    E-MapReduce supports storing data in Alibaba Cloud OSS, which reduces the usage cost especially when the data volume is large. Meanwhile, E-MapReduce clusters are mainly used as computing resources and can be released anytime after use. Since data is stored in OSS, the metadata migration issue does not exist.

  • Implement data sharing

    With the central metadatabase, if all data is stored in OSS, all clusters can directly access data without migrating or restructuring metadata. This enables E-MapReduce clusters to provide different services while still ensuring direct data sharing.

Notice

Before unified metadata management is supported, metadata is stored in the local MySQL database of each cluster and is lost when the cluster is released. With unified metadata management, releasing clusters does not clean up metadata. Before you delete the data in OSS or in the HDFS of a cluster or you release a cluster, make sure that the corresponding metadata is already deleted. That means the tables and database that store the data have been dropped. This prevents dirty metadata in the database.

Create a cluster using unified metadata

  • Use webpages

    On the Basic Settings page, turn on Unified Metabases.

  • Use APIs

    See the description of the CreateClusterV2 operation. Set the value of the useLocalMetaDb parameter to false.

Operate metadata

  • Prerequisites

    The EMR console unified metadata management page only supports clusters that use unified metadata.

  • Operate databases
    You can search for databases by name. Click a database name to view all tables stored in the database.
    Notice Before deleting a database, you need to delete all tables that are stored in the database.
  • Operate tables

    Create a table: currently, only creating external tables and partition tables is supported. The separators can be common characters such as commas and spaces, special characters such as TAB, ^A, ^B, and ^C, and custom separators.
  • Table details

    View partitions: For partition tables, you can go to Table Details > Partitions to view the partition lists. A maximum of 10,000 partitions can be displayed.
    Notice
    • Databases and tables can be created and deleted only in E-MapReduce clusters.
    • The HDFS is the internal file system of each cluster and does not support cross-cluster communication without special network settings. Therefore, the E-MapReduce table management function only supports creating databases and tables based on the OSS file system.
    • The location of a database or table must be in a directory under the OSS bucket, rather than the OSS bucket.
  • Metadatabase information

    On the metatabase information page, you can view the usage and restrictions of the current RDS instance. Submit a ticket if you need to modify the information.

Migration

  • Migrate data from an EMR unified metadatabase to a user-created RDS instance
    1. Purchase an RDS instance and make sure the RDS instance can connect to the master node of the cluster. We recommend that you add the RDS instance to the security group of the ECS instances that are in the EMR cluster. By doing this, you can use a private address to connect to the RDS instance.
      Note For security reasons, you need to add the IP addresses of the EMR nodes that are allowed to access the RDS instances to the whitelist.
    2. Create a database in the RDS instance and name it hivemeta. Create a user and grant the user the read/write permission to hivemeta.
    3. Export the unified metadata. We recommend that you stop the Hive metastore service before exporting the data to avoid the changes of metadata during the export progress.
    4. Log on to the Alibaba Cloud E-MapReduce console. Click the Cluster Management tab to go to the Cluster Management page.
    5. Click the cluster ID to go to the Clusters and Services page. On the Services list, click Hive > Configure to go to the Hive configuration page. Locate the values of the following configuration items. For earlier versions of EMR clusters that do not support configuration management, locate the configuration items in the $HIVE_CONF_DIR/hive-site.xml file.
      javax.jdo.option.ConnectionUserName //the database username;
      javax.jdo.option.ConnectionPassword //the database password;
      javax.jdo.option.ConnectionURL //the database connection URL and database name;
    6. Run the following command to log on to the master node of the cluster using SSH.
      mysqldump -t hivemeta -h <unified metadatabase connection URL > -u <unified metadatabase username> -p > /tmp/metastore.sql

      Enter the value of javax.jdo.option.ConnectionPassword as the password.

    7. In the usr/local/emr/emr-agent/run/meta_db_info.json file that is stored on the master node of the cluster, set the value of use_local_meta_db to false. Set the connection URL, username, and password of the RDS instance for the metadatabase.
    8. On the Hive configuration page, set the connection URL, username, and password of the RDS instance for the metadatabase. For earlier versions of EMR clusters that do not support configuration management, locate the configuration items in the $HIVE_CONF_DIR/hive-site.xml file.
    9. On the master node, set the connection URL, username, and password of the RDS instance for the metadatabase in the hive-site.xml file. Run the init schema command.
      cd /usr/lib/hive-current/bin
      ./schematool -initSchema -dbType mysql
    10. Import the exported metadata to the RDS instance. Use the CLI to log on to MySQL.
      mysql -h {The url of rds} -u {RDS username} -p

      In the MySQL CLI, run the source /tmp/metastore.sql file.

    11. On the Clusters and Services page, click RESTART ALL COMPONENTS to restart all Hive components. Run the hive cli command on the master node to check whether the???

FAQ

  • Wrong FS: oss://yourbucket/xxx/xxx/xxx or Wrong FS: hdfs://yourhost:9000/xxx/xxx/xxx

    This error occurs when the table data on OSS is deleted while the metadata of the table is not. The table schema persists, while the actual data does not exist or is moved to another location. In this case, you can change the table location to an existing path and delete the table again.

    alter table test set location 'oss://your_bucket/your_folder'

    This can be completed on the E-MapReduce interactive console.

    Note oss://your_bucket/your_folder is required to an existing OSS path.
  • The message “java.lang.IllegalArgumentException: java.net.UnknownHostException: xxxxxxx” is displayed when the Hive database is deleted.

    This error occurs because the Hive database is created in the HDFS of a cluster and it is not cleaned up when the cluster is released. As a result, its data in the HDFS of the released cluster cannot be accessed after a new cluster is created. Therefore, when releasing a cluster, remember to clean up the databases and tables that are manually created in the HDFS of the cluster.

    Solutions:

    Log on to the master node of the cluster using the command line, and find the address, username, and password for accessing the Hive metadatabase in $HIVE_CONF_DIR/hive-site.xml.

    javax.jdo.option.ConnectionUserName //the database username;
    javax.jdo.option.ConnectionPassword //the database password;
    javax.jdo.option.ConnectionURL //the database connection URL and database name;


    Connect to the Hive metadatabase on the master node.
    mysql -h ${DBConnectionURL} -u ${ConnectionUserName} -p [Press Enter]
    [Enter the password]${ConnectionPassword}
    After logging on to the Hive metadatabase, change its location to an existing OSS path in the region: