Programming with RDDs

RDDs

Introduction

At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.

credit : Apache Spark

We will examine

  • What RDDs are

  • How to create RDDs

  • Transformations and Actions

  • Lazy Evaluation

  • Passing Functions to Spark

Basic RDDS - Slides and Notebooks

Paired RDDs - Slides and Notebooks

Last updated