Pig job configuration

Last Updated: May 03, 2017

When you are applying for clusters in E-MapReduce, you are provided with a Pig environment by default. Users can directly create and perform table and data by using Pig. Operation steps are as follows.

  1. Prepare the Pig script in advance, for example:

    1. ```shell
    2. /*
    3. * Licensed to the Apache Software Foundation (ASF) under one
    4. * or more contributor license agreements. See the NOTICE file
    5. * distributed with this work for additional information
    6. * regarding copyright ownership. The ASF licenses this file
    7. * to you under the Apache License, Version 2.0 (the
    8. * "License"); you may not use this file except in compliance
    9. * with the License. You may obtain a copy of the License at
    10. *
    11. * http://www.apache.org/licenses/LICENSE-2.0
    12. *
    13. * Unless required by applicable law or agreed to in writing, software
    14. * distributed under the License is distributed on an "AS IS" BASIS,
    15. * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    16. * See the License for the specific language governing permissions and
    17. * limitations under the License.
    18. */
    19. -- Query Phrase Popularity (Hadoop cluster)
    20. -- This script processes a search query log file from the Excite search engine and finds search phrases that occur with particular high frequency during certain times of the day.
    21. -- Register the tutorial JAR file so the included UDFs can be called in the script.
    22. REGISTER oss://emr/checklist/jars/chengtao/pig/tutorial.jar;
    23. -- Use the PigStorage function to load the excite log file into the “raw” bag as an array of records.
    24. -- Input: (user,time,query)
    25. raw = LOAD 'oss://emr/checklist/data/chengtao/pig/excite.log.bz2' USING PigStorage('\t') AS (user, time, query);
    26. -- Call the NonURLDetector UDF to remove records if the query field is empty or a URL.
    27. clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);
    28. -- Call the ToLower UDF to change the query field to lowercase.
    29. clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as query;
    30. -- Because the log file only contains queries for a single day, we are only interested in the hour.
    31. -- The excite query log timestamp format is YYMMDDHHMMSS.
    32. -- Call the ExtractHour UDF to extract the hour (HH) from the time field.
    33. houred = FOREACH clean2 GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, query;
    34. -- Call the NGramGenerator UDF to compose the n-grams of the query.
    35. ngramed1 = FOREACH houred GENERATE user, hour, flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram;
    36. -- Use the DISTINCT command to get the unique n-grams for all records.
    37. ngramed2 = DISTINCT ngramed1;
    38. -- Use the GROUP command to group records by n-gram and hour.
    39. hour_frequency1 = GROUP ngramed2 BY (ngram, hour);
    40. -- Use the COUNT function to get the occurrences of each n-gram.
    41. hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0), COUNT($1) as count;
    42. -- Use the GROUP command to group records by n-gram only.
    43. -- Now each group corresponds to a distinct n-gram and has the count for each hour.
    44. uniq_frequency1 = GROUP hour_frequency2 BY group::ngram;
    45. -- For each group, identify the hour in which this n-gram is used with a particularly high frequency.
    46. -- Call the ScoreGenerator UDF to calculate a "popularity" score for the n-gram.
    47. uniq_frequency2 = FOREACH uniq_frequency1 GENERATE flatten($0), flatten(org.apache.pig.tutorial.ScoreGenerator($1));
    48. -- Use the FOREACH-GENERATE command to assign names to the fields.
    49. uniq_frequency3 = FOREACH uniq_frequency2 GENERATE $1 as hour, $0 as ngram, $2 as score, $3 as count, $4 as mean;
    50. -- Use the FILTER command to move all records with a score less than or equal to 2.0.
    51. filtered_uniq_frequency = FILTER uniq_frequency3 BY score > 2.0;
    52. -- Use the ORDER command to sort the remaining records by hour and score.
    53. ordered_uniq_frequency = ORDER filtered_uniq_frequency BY hour, score;
    54. -- Use the PigStorage function to store the results.
    55. -- Output: (hour, n-gram, score, count, average_counts_among_all_hours)
    56. STORE ordered_uniq_frequency INTO 'oss://emr/checklist/data/chengtao/pig/script1-hadoop-results' USING PigStorage();
    57. ```
  2. Save this script into a script file, such as “script1-hadoop-oss.pig”, and then upload it to an OSS directory (for example: oss://path/to/script1-hadoop-oss.pig).

  3. Log on to Alibaba Cloud E-MapReduce Console Job List.

  4. Click Create a job in the top right corner to enter the job creation page.

  5. Input the job name.

  6. Select the Pig job type to create a Pig job. This type of job is submitted in the background via the following process.

    1. pig [user provided parameters]
  7. Fill in the Parameters option box with parameters subsequent to Pig commands. For example, if it is necessary to use a Pig script uploaded to OSS, the following shall be filled in:

    1. -x mapreduce ossref://emr/checklist/jars/chengtao/pig/script1-hadoop-oss.pig

    You can click Select OSS path to view and select from OSS, the system will automatically complete the absolute path of Pig script on OSS. Switch Pig script prefix to “ossref” (click Switch resource type) to guarantee E-MapReduce downloading this file correctly.

  8. Select the policy for failed operations.

  9. Click OK to complete the Pig job definition.

Thank you! We've received your feedback.