Hive development manual

Last Updated: Mar 31, 2017

Use OSS in Hive

To read/write OSS in Hive, you need to add the AccessKeyId, AccessKeySecret and endpoint to the URI of the OSS. The samples below introduce how to create an external table:

  1. CREATE EXTERNAL TABLE eusers (
  2. userid INT)
  3. LOCATION 'oss://emr/users';

To ensure correct access to OSS, you need to modify the OSS URI to:

  1. CREATE EXTERNAL TABLE eusers (
  2. userid INT)
  3. LOCATION 'oss://${AccessKeyId}:${AccessKeySecret}@${bucket}.${endpoint}/users';

Parameter description:

${accessKeyId}: The accessKeyId of your account.

${accessKeySecret}: The password of the accessKeyId.

${Endpoint}: The network used for the access to OSS. It depends on the region of your cluster, and the corresponding OSS should also be in the region of the cluster.

Specific values can be found Here.

Use Tez as the execution engine

From E-MapReduce 2.1.0+, Tez is used. Tez is a computing framework used for optimizing and processing complex directed-acyclic-graph scheduling jobs.

You can set as follows to use Tez in your Hive job:

  1. set hive.execution.engine=tez

Sample 1

For the operation, see the following steps:

  1. Write the following script, save it as hiveSample1.sql, and upload it to OSS.

    1. USE DEFAULT;
    2. set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
    3. set hive.stats.autogather=false;
    4. DROP TABLE emrusers;
    5. CREATE EXTERNAL TABLE emrusers (
    6. userid INT,
    7. movieid INT,
    8. rating INT,
    9. unixtime STRING )
    10. ROW FORMAT DELIMITED
    11. FIELDS TERMINATED BY '\t'
    12. STORED AS TEXTFILE
    13. LOCATION 'oss://${AccessKeyId}:${AccessKeySecret}@${bucket}.${endpoint}/yourpath';
    14. SELECT COUNT(*) FROM emrusers;
    15. SELECT * from emrusers limit 100;
    16. SELECT movieid,count(userid) as usercount from emrusers group by movieid order by usercount desc limit 50;
  2. Data resources for testing. You can download the required resources for Hive jobs from the link below and store it to the OSS directory.

    Download resources: Public Testing Data

  3. Create a job. Create a new job in E-MapReduce using the following parameter configurations:

    1. -f ossref://${bucket}/yourpath/hiveSample1.sql

    Here, “${bucket}” refers to one of your OSS buckets, and “yourPath” refers to a path of the bucket. You need to fill in the address where the Hive script is saved.

  4. Create an execution plan and run it. Create an execution plan. You can associate it with an existing cluster, or create a cluster on demand and associate the plan with the cluster. Save the job as Run manually. You can then return to the interface and click Run now to run the job.

Sample 2

Take scan in HiBench as an example. If the input/output comes from OSS, you need to run the following processes to run the Hive job. Modification to the code involves the AccessKeyId, AccessKeySecret and storage path. Pay attention to the OSS path format: oss://${accesskeyId}:${accessKeySecret}@${bucket}.${endpoint}/object/path.

For the operation, see the following steps:

  1. Write the following script.

    1. USE DEFAULT;
    2. set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
    3. set mapreduce.job.maps=12;
    4. set mapreduce.job.reduces=6;
    5. set hive.stats.autogather=false;
    6. DROP TABLE uservisits;
    7. CREATE EXTERNAL TABLE uservisits (sourceIP STRING,destURL STRING,visitDate STRING,adRevenue DOUBLE,userAgent STRING,countryCode STRING,languageCode STRING,searchWord STRING,duration INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS SEQUENCEFILE LOCATION 'oss://${AccessKeyId}:${AccessKeySecret}@${bucket}.${endpoint}/sample-data/hive/Scan/Input/uservisits';
  2. Prepare the testing data. You can download the required resources for jobs from the link below and store it to the OSS directory.

    Download resources: uservisits

  3. Create a job. Store the script compiled in Step 1 to OSS, for example, to oss://emr/jars/scan.hive. Then create the following job in E-MapReduce:

    Basic configurations

  4. Create an execution plan and run it. Create an execution plan. You can associate it with an existing cluster, or create a cluster on demand and associate the plan with the cluster. Save the job as Run manually. You can then return to the console interface and click Run now to run the job.

Thank you! We've received your feedback.