This topic describes how to create and run a Hive job in an E-MapReduce (EMR) cluster.
Use Hive to process OSS data
CREATE EXTERNAL TABLE eusers (
userid INT)
LOCATION 'oss://emr/users';
CREATE EXTERNAL TABLE eusers (
userid INT)
LOCATION 'oss://${AccessKeyId}:${AccessKeySecret}@${bucket}.${endpoint}/users';
${accessKeyId}
: the AccessKey ID of an Alibaba Cloud account.${accessKeySecret}
: the AccessKey secret that matches the AccessKey ID.${endpoint}
: the network endpoint that is used to access OSS. It depends on the region where your cluster resides. The OSS bucket must be in the region where your cluster resides.For more information, see OSS endpoints.
Examples
- Example 1
- Write the following script, save it as hiveSample1.sql, and then upload it to OSS.
For more information, see Upload objects.
USE DEFAULT; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; set hive.stats.autogather=false; DROP TABLE emrusers; CREATE EXTERNAL TABLE emrusers ( userid INT, movieid INT, rating INT, unixtime STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION 'oss://${bucket}/yourpath'; SELECT COUNT(*) FROM emrusers; SELECT * from emrusers limit 100; SELECT movieid,count(userid) as usercount from emrusers group by movieid order by usercount desc limit 50;
- Prepare test data.
You can download the following test data and upload it to the destination OSS directory.
Test data: Public test data
- Create a job.
Create a Hive job in the EMR console. For more information, see Configure a Hive job.
Content of the job:-f ossref://${bucket}/yourpath/hiveSample1.sql
In this example,
${bucket}
indicates your OSS bucket, andyourpath
indicates a path in the bucket. Replace yourpath with the path where the Hive script is stored. - Run the job.
Click Run to run the job. You can associate the job with an existing cluster. You can also enable the system to automatically create a cluster and associate the job with the cluster.
- Write the following script, save it as hiveSample1.sql, and then upload it to OSS.
- Example 2
Use scan in HiBench as an example.
- Write the following script, save it as scan.hive, and then upload it to OSS:
USE DEFAULT; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; set mapreduce.job.maps=12; set mapreduce.job.reduces=6; set hive.stats.autogather=false; DROP TABLE uservisits; CREATE EXTERNAL TABLE uservisits (sourceIP STRING,destURL STRING,visitDate STRING,adRevenue DOUBLE,userAgent STRING,countryCode STRING,languageCode STRING,searchWord STRING,duration INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS SEQUENCEFILE LOCATION 'oss://${bucket}/sample-data/hive/Scan/Input/uservisits';
For example, you can upload the file to oss://emr/jars/.
- Prepare test data.
You can download the following test data and upload it to the destination OSS directory.
Test data: uservisits.
- Create a Hive job in the EMR console. For more information, see Configure a Hive job.
- Run the job.
Click Run to run the job. You can associate the job with an existing cluster. You can also enable the system to automatically create a cluster and associate the job with the cluster.
- Write the following script, save it as scan.hive, and then upload it to OSS: