Call : (+91) 968636 4243
Mail :

Apache Spark for Data Scientists

( Duration: 4 Days )

Apache Spark is a powerful, open-source processing engine for data in the Hadoop cluster, optimized for speed, ease of use, and sophisticated analytics. The Spark framework supports streaming data processing and complex iterative algorithms, enabling applications to run up to 100x faster than traditional Hadoop MapReduce programs. With Spark, you can write sophisticated applications to execute faster decisions and real-time actions to a wide variety of use cases, architectures, and industries.

This Apache Spark for Data Scientists training course explores using Spark for common data related activities from a data science perspective. You will learn to build unified big data applications combining batch, streaming, and interactive analytics on your data.

By attending Apache Spark for Data Scientists workshop, delegates will learn:

  • The essentials of Spark architecture and applications
  • To execute Spark Programs
  • To create and manipulate both RDDs (Resilient Distributed Datasets) and UDFs (Unified Data Frames)
  • To integrate machine learning into Spark applications
  • To use Spark Streaming

  • Knowledge of Java Programming
  • Knowledge of SQL (familiarity wits SQL basics)
  • Basic knowledge of Statistics and Probability
  • Data Science background




  • Data Science: The State of the Art
  • Hadoop, Yarn, and Spark
  • Architectural Overview
  • Spark and Storm
  • MLib and Mahout
  • Distributed vs. Local Run Modes
  • Hello, Spark

Spark Overview

  • Spark Core
  • Spark SQL
  • Spark and Hive
  • MLib
  • Mahout
  • Spark Streaming
  • Spark API


  • DataFrames and Resilient Distributed Datasets (RDDs)
  • Partitions
  • DataFrame Types
  • DataFrame Operations
  • Map/Reduce with DataFrames

Spark SQL

  • Spark SQL Overview
  • Data stores: HDFS, Cassandra, HBase, Hive, and S3
  • Table Definitions
  • ETL in Spark
  • Queries

Spark MLib

  • MLib overview
  • MLib Algorithms Overview

Spark Streaming

  • Streaming overview
  • Real-time data ingestion
  • State
  • Window Operations

Spark GraphX

  • GraphX overview
  • ETL with GraphX
  • Graph computation

Performance and Tuning

  • Broadcast variables
  • Accumulators
  • Memory Management

Cluster Mode

  • Standalone Cluster
  • Masters and Workers
  • Configurations
  • Working with large data sets

Encarta Labs Advantage

  • One Stop Corporate Training Solution Providers for over 6,000 various courses on a variety of subjects
  • All courses are delivered by Industry Veterans
  • Get jumpstarted from newbie to production ready in a matter of few days
  • Trained more than 50,000 Corporate executives across the Globe
  • All our trainings are conducted in workshop mode with more focus on hands-on sessions

View our other course offerings by visiting

Contact us for delivering this course as a public/open-house workshop/online training for a group of 10+ candidates.