Apache Spark for Data Scientists Training Course and Workshop in Bangalore, Mysore, Chennai, Hyderabad, Pune, Mumbai, Delhi, Noida, Gurgaon, Kolkata

Apache Spark is a powerful, open-source processing engine for data in the Hadoop cluster, optimized for speed, ease of use, and sophisticated analytics. The Spark framework supports streaming data processing and complex iterative algorithms, enabling applications to run up to 100x faster than traditional Hadoop MapReduce programs. With Spark, you can write sophisticated applications to execute faster decisions and real-time actions to a wide variety of use cases, architectures, and industries.

This Apache Spark for Data Scientists training course explores using Spark for common data related activities from a data science perspective. You will learn to build unified big data applications combining batch, streaming, and interactive analytics on your data.

By attending Apache Spark for Data Scientists workshop, delegates will learn:

The essentials of Spark architecture and applications
To execute Spark Programs
To create and manipulate both RDDs (Resilient Distributed Datasets) and UDFs (Unified Data Frames)
To integrate machine learning into Spark applications
To use Spark Streaming

Knowledge of Java Programming
Knowledge of SQL (familiarity wits SQL basics)
Basic knowledge of Statistics and Probability
Data Science background

Introduction

Data Science: The State of the Art
Hadoop, Yarn, and Spark
Architectural Overview
Spark and Storm
MLib and Mahout
Distributed vs. Local Run Modes
Hello, Spark

Spark Overview

Spark Core
Spark SQL
Spark and Hive
MLib
Mahout
Spark Streaming
Spark API

DataFrames

DataFrames and Resilient Distributed Datasets (RDDs)
Partitions
DataFrame Types
DataFrame Operations
Map/Reduce with DataFrames

Spark SQL

Spark SQL Overview
Data stores: HDFS, Cassandra, HBase, Hive, and S3
Table Definitions
ETL in Spark
Queries

Spark MLib

MLib overview
MLib Algorithms Overview

Spark Streaming

Streaming overview
Real-time data ingestion
State
Window Operations

Spark GraphX

GraphX overview
ETL with GraphX
Graph computation

Performance and Tuning

Broadcast variables
Accumulators
Memory Management

Cluster Mode

Standalone Cluster
Masters and Workers
Configurations
Working with large data sets

Encarta Labs Advantage

One Stop Corporate Training Solution Providers for over 6,000 various courses on a variety of subjects
All courses are delivered by Industry Veterans
Get jumpstarted from newbie to production ready in a matter of few days

Trained more than 50,000 Corporate executives across the Globe
All our trainings are conducted in workshop mode with more focus on hands-on sessions

Apache Spark for Data Scientists

COURSE AGENDA