Search This Blog


Tuesday, April 28, 2015

Apache Ignite vs Apache Spark

Complimentary to my earlier post on Apache Ignite in-memory file-system and caching capabilities I would like to cover the main differentiation points of the Ignite and Spark. I see questions like this coming up repeatedly. It is easier to have them answered, so you don't need to fish around the Net for the answers.

 - The main different is, of course, that Ignite is an in-memory computing system, e.g. the one that treats RAM as the primary storage facility. Whereas others - Spark included - only use RAM for processing. The former, memory-first approach, is faster because the system can do better indexing, reduce the fetch time, avoid (de)serializations, etc.

 - Ignite's mapreduce is fully compatible with Hadoop MR APIs which let everyone to simply reuse existing legacy MR code, yet run it with >30x performance improvement. Check this short video demoing an Apache Bigtop in-memory stack, speeding up a legacy MapReduce code

 - Also, unlike Spark's the streaming in Ignite isn't quantified by the size of RDD. In other words, you don't need to form an RDD first before processing it; you can actually do the real streaming. Which means there's no delays in a stream content processing in case of Ignite

 - Spill-overs are a common issue for in-memory computing systems: after all memory is limited. In Spark where RDDs are immutable, if an RDD got created with its size > 1/2 node's RAM then a transformation and generation of the consequent RDD' will likely to fill all the node's memory. Which will cause the spill-over. Unless the new RDD is created on a different node. Tachyon was essentially an attempt to address it, using old RAMdrive tech. with all its limitations.
Ignite doesn't have this issue with data spill-overs as its caches can be updated in atomic or transactional manner. However, spill-overs are still possible: the strategies to deal with it are explained here

 - as one of its components Ignite provides the first-class citizen file-system caching layer. Note, I have already addressed the differences between that and Ignite, but for some reason my post got deleted from their user list. I wonder why? ;)

 - Ignite's uses off-heap memory to avoid GC pauses, etc. and does it highly efficiently

 - Ignite guarantees strong consistency

 - Ignite supports full SQL99 as one of the ways to process the data w/ full support for ACID transactions

- Ignite supports in-memory SQL indexes functionality, which lets to avoid full-scans of data sets, directly leading to very significant performance improvements (also see the first paragraph)

 - with Ignite a Java programmer shouldn't learn new ropes of Scala. The programming model also encourages the use of Groovy. And I will withhold my professional opinion about the latter in order to keep this post focused and civilized ;)

I can keep on rumbling for a long time, but you might consider reading this and that, where Nikita Ivanov - one of the founders of this project - has a good reflection on other key differences. Also, if you like what you read - consider joining Apache Ignite (incubating) community and start contributing!


  1. Indeed the in-memory computing solution that Ignite offers seems unique through the combination of off-heap memory, guaranteed consistency and SQL99 access among other features.

    I am interested to implement a solution for R's annoying issue of expecting all data to be loaded in memory first.
    See Ryan Rosario's, slide 2 for a glimpse.

    But lost of R packages use C++ for memory management.
    I see GridGain has portable objects ( but wondering what would be the performance tradeoffs compared to a native C++ solution.

    Please provide any references to better learn about these aspects.

    1. I would recommend to getting on the list and discuss possibilities to add R-bindings to the Ignite. That'd be real great! Thanks!