This topic describes how to create and run a Hive job in an E-MapReduce (EMR) cluster.

Use Hive to process OSS data

If you want to use Hive to read data from or write data to an Object Storage Service (OSS) bucket, you must first run the following command to create an external table:
CREATE EXTERNAL TABLE eusers (
  userid INT) 
 LOCATION 'oss://emr/users';
If the preceding method is not supported or you want to use the AccessKey pair of another Alibaba Cloud account to access OSS data in other locations, you can run the following command:
CREATE EXTERNAL TABLE eusers (
  userid INT) 
 LOCATION 'oss://${AccessKeyId}:${AccessKeySecret}@${bucket}.${endpoint}/users';
Parameters in the command:
  • ${accessKeyId}: the AccessKey ID of an Alibaba Cloud account.
  • ${accessKeySecret}: the AccessKey secret that matches the AccessKey ID.
  • ${endpoint}: the network endpoint that is used to access OSS. It depends on the region where your cluster resides. The OSS bucket must be in the region where your cluster resides.

    For more information, see OSS endpoints.

Examples

The following examples show how to create and run a Hive job:
  • Example 1
    1. Write the following script, save it as hiveSample1.sql, and then upload it to OSS.

      For more information, see Upload objects.

      USE DEFAULT;
       set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
       set hive.stats.autogather=false;
       DROP TABLE emrusers;
       CREATE EXTERNAL TABLE emrusers (
         userid INT,
         movieid INT,
         rating INT,
         unixtime STRING ) 
        ROW FORMAT DELIMITED 
        FIELDS TERMINATED BY '\t' 
        STORED AS TEXTFILE 
        LOCATION 'oss://${bucket}/yourpath';
       SELECT COUNT(*) FROM emrusers;
       SELECT * from emrusers limit 100;
       SELECT movieid,count(userid) as usercount from emrusers group by movieid order by usercount desc limit 50;
    2. Prepare test data.

      You can download the following test data and upload it to the destination OSS directory.

      Test data: Public test data

    3. Create a job.

      Create a Hive job in the EMR console. For more information, see Configure a Hive job.

      Content of the job:
      -f ossref://${bucket}/yourpath/hiveSample1.sql

      In this example, ${bucket} indicates your OSS bucket, and yourpath indicates a path in the bucket. Replace yourpath with the path where the Hive script is stored.

    4. Run the job.

      Click Run to run the job. You can associate the job with an existing cluster. You can also enable the system to automatically create a cluster and associate the job with the cluster.

  • Example 2

    Use scan in HiBench as an example.

    1. Write the following script, save it as scan.hive, and then upload it to OSS:
      USE DEFAULT;
       set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
       set mapreduce.job.maps=12;
       set mapreduce.job.reduces=6;
       set hive.stats.autogather=false;
       DROP TABLE uservisits;
       CREATE EXTERNAL TABLE uservisits (sourceIP STRING,destURL STRING,visitDate STRING,adRevenue DOUBLE,userAgent STRING,countryCode STRING,languageCode STRING,searchWord STRING,duration INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS SEQUENCEFILE LOCATION 'oss://${bucket}/sample-data/hive/Scan/Input/uservisits';

      For example, you can upload the file to oss://emr/jars/.

    2. Prepare test data.

      You can download the following test data and upload it to the destination OSS directory.

      Test data: uservisits.

    3. Create a Hive job in the EMR console. For more information, see Configure a Hive job.
    4. Run the job.

      Click Run to run the job. You can associate the job with an existing cluster. You can also enable the system to automatically create a cluster and associate the job with the cluster.