This topic describes how to create and run a Hive job in an E-MapReduce (EMR) cluster.

Use Hive to process OSS data

If you want to use Hive to read data from or write data to an Object Storage Service (OSS) bucket, you must first run the following command to create an external table:
CREATE EXTERNAL TABLE eusers (
  userid INT) 
 LOCATION 'oss://emr/users';

Examples

The following examples show how to create and run a Hive job:
  • Example 1
    1. Write the following script, save it as hiveSample1.sql, and then upload it to OSS.

      For more information, see Upload objects.

      USE DEFAULT;
       set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
       set hive.stats.autogather=false;
       DROP TABLE emrusers;
       CREATE EXTERNAL TABLE emrusers (
         userid INT,
         movieid INT,
         rating INT,
         unixtime STRING ) 
        ROW FORMAT DELIMITED 
        FIELDS TERMINATED BY '\t' 
        STORED AS TEXTFILE 
        LOCATION 'oss://${bucket}/yourpath';
       SELECT COUNT(*) FROM emrusers;
       SELECT * from emrusers limit 100;
       SELECT movieid,count(userid) as usercount from emrusers group by movieid order by usercount desc limit 50;
    2. Prepare test data.

      You can download the following test data and upload it to the destination OSS directory.

      Test data: Public test data

    3. Create a job.

      Create a Hive job in the EMR console. For more information, see Configure a Hive job.

      Content of the job:
      -f ossref://${bucket}/yourpath/hiveSample1.sql

      In this example, ${bucket} indicates your OSS bucket, and yourpath indicates a path in the bucket. Replace yourpath with the path where the Hive script is stored.

    4. Run the job.

      Click Run to run the job. You can associate the job with an existing cluster. You can also enable the system to automatically create a cluster and associate the job with the cluster.

  • Example 2

    Use scan in HiBench as an example.

    1. Write the following script, save it as scan.hive, and then upload it to OSS:
      USE DEFAULT;
       set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
       set mapreduce.job.maps=12;
       set mapreduce.job.reduces=6;
       set hive.stats.autogather=false;
       DROP TABLE uservisits;
       CREATE EXTERNAL TABLE uservisits (sourceIP STRING,destURL STRING,visitDate STRING,adRevenue DOUBLE,userAgent STRING,countryCode STRING,languageCode STRING,searchWord STRING,duration INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS SEQUENCEFILE LOCATION 'oss://${bucket}/sample-data/hive/Scan/Input/uservisits';

      For example, you can upload the file to oss://emr/jars/.

    2. Prepare test data.

      You can download the following test data and upload it to the destination OSS directory.

      Test data: uservisits.

    3. Create a Hive job in the EMR console. For more information, see Configure a Hive job.
    4. Run the job.

      Click Run to run the job. You can associate the job with an existing cluster. You can also enable the system to automatically create a cluster and associate the job with the cluster.