This topic describes how to create and run a Pig job in an E-MapReduce (EMR) cluster.

Use Pig to access OSS data

When you use Pig to access Object Storage Service (OSS) data, you must specify an OSS path in the following format:
oss://${accessKeyId}:${accessKeySecret}@${bucket}.${endpoint}/${path}
Parameters:
  • ${accessKeyId}: the AccessKey ID of your Alibaba Cloud account.
  • ${accessKeySecret}: the AccessKey secret that matches the AccessKey ID.
  • ${bucket}: the bucket that matches the AccessKey ID.
  • ${endpoint}: the network endpoint that is used to access OSS. It depends on the region where your cluster resides. The OSS bucket must be in the region where your cluster resides.

    For more information, see OSS endpoints.

  • ${path}: the path of a file in the bucket.

Procedure

Use the script1-hadoop.pig file in Pig as an example. Upload tutorial.jar and excite.log.bz2 in Pig to OSS. For example, the upload paths are oss://emr/jars/tutorial.jar and oss://emr/data/excite.log.bz2.

  1. Prepare a script.
    Modify the JAR file path and the input and output paths in the script based on the preceding OSS paths.
    /*
     * Licensed to the Apache Software Foundation (ASF) under one
    * or more contributor license agreements.  See the NOTICE file
    * distributed with this work for additional information
    * regarding copyright ownership.  The ASF licenses this file
    * to you under the Apache License, Version 2.0 (the
    * "License"); you may not use this file except in compliance
    * with the License.  You may obtain a copy of the License at
    *
    *     http://www.apache.org/licenses/LICENSE-2.0
    *
    * Unless required by applicable law or agreed to in writing, software
    * distributed under the License is distributed on an "AS IS" BASIS,
    * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    * See the License for the specific language governing permissions and
    * limitations under the License.
    */
    -- Query Phrase Popularity (Hadoop cluster)
    -- This script processes a search query log file from the Excite search engine and finds search phrases that occur with particular high frequency during certain times of the day.
    -- Register the tutorial JAR file so that the included UDFs can be called in the script.
    REGISTER oss://${AccessKeyId}:${AccessKeySecret}@${bucket}.${endpoint}/data/tutorial.jar;
    -- Use the PigStorage function to load the excite log file into the ▒raw▒ bag as an array of records.
    -- Input: (user,time,query)
    raw = LOAD 'oss://${AccessKeyId}:${AccessKeySecret}@${bucket}.${endpoint}/data/excite.log.bz2' USING PigStorage('\t') AS (user, time, query);
    -- Call the NonURLDetector UDF to remove records if the query field is empty or a URL.
    clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);
    -- Call the ToLower UDF to change the query field to lowercase.
    clean2 = FOREACH clean1 GENERATE user, time,     org.apache.pig.tutorial.ToLower(query) as query;
    -- Because the log file only contains queries for a single day, we are only interested in the hour.
    -- The excite query log timestamp format is YYMMDDHHMMSS.
    -- Call the ExtractHour UDF to extract the hour (HH) from the time field.
    houred = FOREACH clean2 GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, query;
    -- Call the NGramGenerator UDF to compose the n-grams of the query.
    ngramed1 = FOREACH houred GENERATE user, hour, flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram;
    -- Use the  DISTINCT command to get the unique n-grams for all records.
    ngramed2 = DISTINCT ngramed1;
    -- Use the  GROUP command to group records by n-gram and hour.
    hour_frequency1 = GROUP ngramed2 BY (ngram, hour);
    -- Use the  COUNT function to get the count (occurrences) of each n-gram.
    hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0), COUNT($1) as count;
    -- Use the  GROUP command to group records by n-gram only.
    -- Each group now corresponds to a distinct n-gram and has the count for each hour.
    uniq_frequency1 = GROUP hour_frequency2 BY group::ngram;
    -- For each group, identify the hour in which this n-gram is used with a particularly high frequency.
    -- Call the ScoreGenerator UDF to calculate a "popularity" score for the n-gram.
    uniq_frequency2 = FOREACH uniq_frequency1 GENERATE flatten($0), flatten(org.apache.pig.tutorial.ScoreGenerator($1));
    -- Use the  FOREACH-GENERATE command to assign names to the fields.
    uniq_frequency3 = FOREACH uniq_frequency2 GENERATE $1 as hour, $0 as ngram, $2 as score, $3 as count, $4 as mean;
    -- Use the  FILTER command to move all records with a score less than or equal to 2.0.
    filtered_uniq_frequency = FILTER uniq_frequency3 BY score > 2.0;
    -- Use the  ORDER command to sort the remaining records by hour and score.
    ordered_uniq_frequency = ORDER filtered_uniq_frequency BY hour, score;
    -- Use the  PigStorage function to store the results.
    -- Output: (hour, n-gram, score, count, average_counts_among_all_hours)
    STORE ordered_uniq_frequency INTO 'oss://${AccessKeyId}:${AccessKeySecret}@${bucket}.${endpoint}/data/script1-hadoop-results' USING PigStorage();

    Upload the script1-hadoop.pig script to an OSS path, for example, oss://emr/jars/.

  2. Create a job.

    Create a Pig job on the Data Platform tab of the EMR console. For more information, see Configure a Pig job.

    Content of the job:
    -f ossref://emr/jars/script1-hadoop.pig
  3. Run the job.

    Click Run to run the job. You can associate the job with an existing cluster. You can also enable the system to automatically create a cluster and associate the job with the cluster.