For the past few months, Jacques-Alexandre has been working with a new customer who needs very deep analytics on fairly large datasets. By deep, I mean 4 to 5 levels of nested aggregations. By large, I mean 100 million records. This is something that ElasticSearch was not really designed to handle, and we’re starting to realize that we might need a backup plan.
In the meantime, I’ve been researching the space quite diligently, and I managed to convince myself that the best way to handle such scenarios is to have an in-memory columnar database. Unfortunately, these aren’t particularly common, especially with an open source license. Therefore, I’ve decided to build one, by forking Datavore.
This in-memory columnar database would be used by loading all the data required for an analytics session from the disk-based database (ElasticSearch) to memory. In a first version, the dataset would have to be small-enough that it can fit within the 1.4GB heap limit of V8. Down the road, we could go beyond this limitation by distributing our dataset across a cluster of Node.js servers, then use simple MapReduce techniques to process our queries.
If you play with this benchmark a little bit, you’ll find that Datavore is capable of processing an aggregation query on 5 million records in 100ms. And when you look at its code, you’ll see that such goodness is delivered with nothing more than 600 short lines of eminently-readable code (550 if you remove the random number generator).
This level of performance boils down to a handful of things:
- Put the entire dataset in memory
- Use a columnar representation (much better for performing aggregations)
- Compress the data as much as possible (V8 does not like sparse arrays)
What I find particularly promising with this approach is that we could create our own custom aggregate operators directly from our beloved datatype families. It would not be entirely trivial (Cf. Extensibility), but it is definitely possible. Therefore, we would be able to take advantage of the fact that our data is very strongly typed, and that our datatype families can define their own custom aggregation models.
Of course, I do not know what I do not know, and I am sure that I am missing many aspects of the problem that will make the development of a working solution a lot more complex than I am anticipating. That being said, ignorance is bliss, and we need to educate ourselves on the topic of real-time analytics. So we might as well develop a little prototype to gather a few datapoints.
Wish me luck!
Somehow, Tumblr has started to aggressively downsample my screenshots. I have no idea how to fix it. Please accept my apologies for their poor quality while I figure out a fix or eventually decide to migrate to another tool.
The last Excel function has been added to FormulaJS. Doing a build now…
Jim is almost done implementing the last missing functions in FormulaJS. Four to go…