A scripting language to manipulate large dataset using Hadoop.
Dataflow language: domain specific
No control flow(if/then/else)
Uses an existing Hadoop installation and requires minimal configuration for setting up.
Supports both interactive and batch mode of execution.
PIG script usage with examples:
Pig Latin Statements A Pig Latin statement is an operator that takes a relation as input and produces another relation as output. They are generally organized in the following manner:
1.1. Running Pig Latin
Using grunt shell or command line
In mapreduce mode or local mode :
Pig scripts can be run in 2 modes Local Mode: To run the scripts in local mode, no Hadoop or HDFS installation is required. All files are installed and run from your local host and file system. Mapreduce Mode: To run the scripts in mapreduce mode, you need access to a Hadoop cluster and HDFS installation.
Either interactively or in batch
Eg : ● Grunt Shell - interactive, mapreduce mode (because mapreduce mode is the default you do not need to specify) ● Grunt Shell - batch, local mode (see the exec and run commands) $ pig -x local grunt> exec myscript.pig; or grunt> run myscript.pig; ● Command Line - batch, mapreduce mode $ pig myscript.pig ● Command Line - batch, local mode mode $ pig -x local myscript.pig
. 1.2 Processing Pig Latin statements
Pig validates the syntax and
semantics of all statements.
If Pig encounters a DUMP or STORE, Pig will execute the statements.
Eg: Pig will validate, but not execute, the LOAD and FOREACH statements. A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float); B = FOREACH A GENERATE name; Eg: Pig will validate and then execute the LOAD, FOREACH, and DUMP statements. A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int,gpa:float); B = FOREACH A GENERATE name; DUMP B;
. 1.3 Storing Intermediate Data
Pig stores the intermediate data generated between MapReduce jobs in a temporary location on HDFS. This location must already exist on HDFS prior to use. This location can be configured using the pig.temp.dir property. The property's default value is "/tmp"
Ready to start your tutorial with us? That's great! Send us an email and we will get back to you as soon as possible!