Data can a valuable asset, especially when there’s a lot of it. Exploratory data analysis, business intelligence, and machine learning can benefit tremendously if such Big Data can be wrangled and modelled at scale. Apache Spark is an open-source distributed engine for querying and processing data. In this three-day hands-on workshop, you will learn how to leverage Spark and Python to process Big Data.
You will start by getting a firm understanding of the Spark 2.0 architecture and how to set up a Python and Jupyter Notebook environment for Spark. You’ll learn about different techniques for collecting and processing data. We’ll begin with Resilient Distributed Datasets (RDDs) and work our way up to DataFrames.
We provide examples of how to read data from files and how to specify schemas using reflection or programmatically. The concept of lazy execution is discussed in detail and we demonstrate various transformations and actions specific to RDDs and DataFrames. We show you how DataFrames can be manipulated using SQL queries.
We’ll show you how to apply supervised machine learning models such as linear regression, logistic regression, decision trees, and random forest. You’ll also see unsupervised machine learning models such as K-means and hierarchical clustering.
By the end of this workshop, you will have a solid understanding of how to process data using PySpark and you will understand how to use Spark’s machine learning library to build and train various machine learning models.
- Introduction to Apache Spark
- Setting up Spark
- Spark fundamentals
- Spark 2.0 Architecture
- Resilient Distributed Datasets (RDDs)
- Getting data into Spark
- Speeding up PySpark with DataFrames
- Creating DataFrames
- Interoperating with RDDs
- Querying with the DataFrame API
- Querying DataFrames with SQL
- ML and MLLib packages
- API Overview
- Applying Machine Learning
- Recommender system
- Where to go from here
What You Will Learn
- Learn about Apache Spark and the Spark 2.0 architecture and its components
- Work with RDDs and lazy evaluation
- Build and interact with Spark DataFrames using Spark SQL
- Use Spark SQL and DataFrames to process data using traditional SQL queries
- Apply a spectrum of supervised and unsupervised machine learning algorithms
- Handle issues related to feature engineering, class imbalance, bias and variance, and cross validation for building an optimal fit model
Participants are expected to be familiar with the following Python syntax and concepts:
- assignment, arithmetic, boolean expression, tuple unpacking
- bool, int, float, list, tuple, dict, str, type casting
- in operator, indexing, slicing
- if, elif, else, for, while
- range(), len(), zip()
- print(), str.format()
- try, except, raise
- def, keyword arguments, default values
- import, import as, from import
- lambda function, list comprehension
- Jupyter Notebook
We have delivered this course (or a derivative) at the following clients.