This 4-hour hands-on course introduces Apache Spark v2.2, the open-source cluster computing framework with in-memory processing that makes analytics applications up to 100 times faster compared to technologies in wide deployment today. Highly versatile in many environments, and with a strong foundation in functional programming, Spark is known for its ease of use in creating exploratory code that scales up to production-grade quality relatively quickly (REPL driven development).
The main focus will be on newly introduced features in Spark v2.x and on Spark’s integration with Kafka and Cassandra for streaming pipelines.
The plan is to follow the agenda below but if participants want to dive deeper into high-complexity topics I will instead focus on live coding ad-hoc demos.
1. The first part of the workshop covers Spark SQL with Scala, specifically the limited toy examples emphasized by Spark documentation and tutorials. Spark SQL, used in isolation, can realistically only be used for such didactic use cases. As a practitioner I know from experience that when ingesting real-world datasets, Spark SQL will very quickly show its limitations and therefore some more powerful techniques are needed.
2. The second part of the workshop covers the techniques mentioned above, without which Spark SQL is largely ineffective. This section of the workshop is about sharing lessons learned the hard way, and experience gathered in the trenches of the real world.
3. The third part of the workshop, titled “Machine Learning By Example”, covers multiclass classification using SparkML’s Pipeline API with Scala. SparkML is the machine learning module that ships with Spark.
4. During the remaining time, we’ll focus on a Scala / Spark Streaming application that ingests data from Apache Kafka (an open-source, high-performance, distributed message queue), performs streaming analytics, then saves the analytics results back into Kafka as well as into a Cassandra datastore. This section will begin with an explanation of how to model Cassandra schemas for analytics.
All examples will be in Scala.
Please bring your laptop with you.
The workshop is free of charge and seating is first-come-first-serve.
The workshop has some requirements. Please consider the following:
1. Bring your own laptop.
2. Have Docker already installed before the workshop.
3. Have the Docker image already pulled and available locally.
Here are the necessary instructions (prefix these commands with sudo if required):
2. Install Docker
Ubuntu: apt-get -y install docker.io
CentOS: yum -y install docker
Linux / Other: curl -fsSL https://get.docker.com/ | sh
Mac and Windows: https://www.docker.com/products/docker-toolbox
3. docker pull dserban/dockersparknotebook