Data Wrangling and Modelling with PySpark

Process big data at scale with Apache Spark and Python.


Data can a valuable asset, especially when there’s a lot of it. Exploratory data analysis, business intelligence, and machine learning can benefit tremendously if such Big Data can be wrangled and modelled at scale. Apache Spark is an open-source distributed engine for querying and processing data. In this three-day hands-on workshop, you will learn how to leverage Spark and Python to process Big Data.

You will start by getting a firm understanding of the Spark 2.0 architecture and how to set up a Python and Jupyter Notebook environment for Spark. You’ll learn about different techniques for collecting and processing data. We’ll begin with Resilient Distributed Datasets (RDDs) and work our way up to DataFrames.

We provide examples of how to read data from files and how to specify schemas using reflection or programmatically. The concept of lazy execution is discussed in detail and we demonstrate various transformations and actions specific to RDDs and DataFrames. We show you how DataFrames can be manipulated using SQL queries.

We’ll show you how to apply supervised machine learning models such as linear regression, logistic regression, decision trees, and random forest. You’ll also see unsupervised machine learning models such as K-means and hierarchical clustering.

By the end of this workshop, you will have a solid understanding of how to process data using PySpark and you will understand how to use Spark’s machine learning library to build and train various machine learning models.


Day 1

  • Introduction to Apache Spark
    • Setting up Spark
    • Spark fundamentals
    • Spark 2.0 Architecture
  • Resilient Distributed Datasets (RDDs)
    • Getting data into Spark
    • Actions
    • Transformations

Day 2

  • DataFrames
    • Speeding up PySpark with DataFrames
    • Creating DataFrames
    • Interoperating with RDDs
    • Querying with the DataFrame API
  • Querying DataFrames with SQL

Day 3

  • ML and MLLib packages
    • API Overview
    • Pipelines
    • Transformers
    • Estimators
  • Applying Machine Learning
    • Validation
    • Classification
    • Regression
    • Recommender system
  • Where to go from here

What You Will Learn

  • Learn about Apache Spark and the Spark 2.0 architecture and its components
  • Work with RDDs and lazy evaluation
  • Build and interact with Spark DataFrames using Spark SQL
  • Use Spark SQL and DataFrames to process data using traditional SQL queries
  • Apply a spectrum of supervised and unsupervised machine learning algorithms
  • Handle issues related to feature engineering, class imbalance, bias and variance, and cross validation for building an optimal fit model


Participants are expected to be familiar with the following Python syntax and concepts:

  • assignment, arithmetic, boolean expression, tuple unpacking
  • bool, int, float, list, tuple, dict, str, type casting
  • in operator, indexing, slicing
  • if, elif, else, for, while
  • range(), len(), zip()
  • print(), str.format()
  • try, except, raise
  • def, keyword arguments, default values
  • import, import as, from import
  • lambda function, list comprehension
  • Jupyter Notebook


We have delivered this course (or a derivative) at the following clients.

KPN ICT Consulting

Photos and Testimonials

Laurens Koppenol
Lead Data Scientist, ProRail

Our DataLab team enjoyed a three-day PySpark course from Jeroen. Jeroen’s approach is personal and professional. I recommend Data Science Workshops to anyone in the field of data science.

Two-day hands-on workshop
Introduction to Programming in Python
We’re currently writing the description of this course. Please contact us for more information.
€ 990 per person
Three-day hands-on workshop
Introduction to Machine Learning
We’re currently writing the description of this course. Please contact us for more information.
€ 1890 per person