This Data Analysis with Apache Pig training course covers how to use Pig as part of an ETL process in a Hadoop cluster. The course begins with manipulating semi-structured raw data files in Pig, and using the grunt shell and the Pig Latin programming language. Once the raw data has been manipulated into structured tables, they are exported from Pig and imported into Hive.
By attending Data Analysis with Apache Pig workshop, delegates will learn to:
- Define Apache Pig
- Describe how Apache Pig fits in the data pipeline
- Understand data types in Apache Pig
- Load data into Pig relations
- Examine data and debug scripts
- Use FOREACH ... GENERATE on data
- Store data for use with other applications
- Subset data with DISTINCT, FILTER, and SAMPLE
- Combine data with JOIN, UNION, and GROUP
- Manipulate data with ORDER, FLATTEN, and UDFs
- Basic Hadoop knowledge
- Basic to intermediate Linux skills including familiarity with command line options such as ls, cd, cp, and su
- Familiarity with a functional high-level programming language such as Python or SQL
The Data Analysis with Apache Pig class is ideal for:
- Data Analysts, Data Scientists and Developers