Apache Spark is designed as an alternative to MapReduce which addresses some of its performance concerns.
It is meant as a general-purpose framework for highly-parallel data analytics.
MapReduce is a computation model popularized by Google for parallel operations on chunked data. It consists primarily of two kinds of tasks:
MapReduce nodes share and store data using a distributed filesystem like the Hadoop Distributed Filesystem (HDFS). Every task stores its output on this filesystem.
Because every task involves IO, spinning up many tasks in sequence incurs a large performance penalty.
Spark was designed to allow iterative algorithms to keep their results in memory.
Spark's computation is based around Resilient Distributed Datasets (RDDs).
RDDs are chunks of data which may reside on disk, in memory, or may not have been computed at all yet.
By default, applying an operation like map to a dataset simply adds that map to a chain of computations.
RDDs are made by using standard higher-order functions on parallel data structures!
(rank)_p/|(neighbors)_p
to its neighbors
Bagel Simple
mem contrast: 8.8GB/node 700MB/nodetime contrast: 168sec 82sec