Apache Spark Ecosystem

Introduction to Apache Spark

Apache Spark™ is a unified analytics engine for large-scale data processing developed at UC Berkeley in 2009. It has received rapid acceptance from a wide range of industries, especially those that process at massive scale. Apache Spark can process multiple petabytes of data residing on over 8,000 nodes. It is an open source project supported by over 1000 contributors from over 250 organizations.

Slides

slides - pdf

slides - pptx

Apache Spark is known for:

  • Speed

    Run workloads 100x faster. Apache Spark achieves high performance for both batch and streaming data, using a state-of-the-art DAG scheduler, a query optimizer, and a physical execution engine.

  • Ease of Use

    Write applications quickly in Java, Scala, Python, R, and SQL.

  • Generality

    Combines SQL, streaming, and complex analytics. Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.

  • Runs Everywhere

    Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources.

    You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. Access data in HDFS, Alluxio, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources.

Credit : Apache Spark

Last updated