My name is Ismael Chang Ghalimi. I build the STOIC platform. I am a stoic, and this blog is my agora.

Victory! After close to two years of work, we’ve re-implemented every single formula function offered by the mighty Microsoft Excel. Every single one. And for most of them, we’ve made sure to replicate Excel’s bugs, because some of them are codified in the law (ACCRINT for example). The last functions that we added are:

  • CHISQDISTRT
  • CHISQINVRT
  • CHISQTEST
  • CHOOSE
  • COLUMN
  • COLUMNS
  • FDISTRT
  • FINVRT
  • FTEST
  • GAMMADIST
  • GAMMAINV
  • GAMMALNPRECISE
  • GROWTH
  • ISERR
  • LOGEST
  • MATCH
  • NEGBINOMDIST
  • ROW
  • ROWS
  • TDIST2T
  • TDISTRT
  • TINV2T
  • TRANSPOSE
  • TREND
  • TTEST

Congratulations to Jim and Zhipeng for the very last (and painful) mile!

ColumnarJS

For the past few months, Jacques-Alexandre has been working with a new customer who needs very deep analytics on fairly large datasets. By deep, I mean 4 to 5 levels of nested aggregations. By large, I mean 100 million records. This is something that ElasticSearch was not really designed to handle, and we’re starting to realize that we might need a backup plan.

In the meantime, I’ve been researching the space quite diligently, and I managed to convince myself that the best way to handle such scenarios is to have an in-memory columnar database. Unfortunately, these aren’t particularly common, especially with an open source license. Therefore, I’ve decided to build one, by forking Datavore.

This in-memory columnar database would be used by loading all the data required for an analytics session from the disk-based database (ElasticSearch) to memory. In a first version, the dataset would have to be small-enough that it can fit within the 1.4GB heap limit of V8. Down the road, we could go beyond this limitation by distributing our dataset across a cluster of Node.js servers, then use simple MapReduce techniques to process our queries.

If you play with this benchmark a little bit, you’ll find that Datavore is capable of processing an aggregation query on 5 million records in 100ms. And when you look at its code, you’ll see that such goodness is delivered with nothing more than 600 short lines of eminently-readable code (550 if you remove the random number generator).

This level of performance boils down to a handful of things:

  • Put the entire dataset in memory
  • Use a columnar representation (much better for performing aggregations)
  • Compress the data as much as possible (V8 does not like sparse arrays)

What I find particularly promising with this approach is that we could create our own custom aggregate operators directly from our beloved datatype families. It would not be entirely trivial (Cf. Extensibility), but it is definitely possible. Therefore, we would be able to take advantage of the fact that our data is very strongly typed, and that our datatype families can define their own custom aggregation models.

Of course, I do not know what I do not know, and I am sure that I am missing many aspects of the problem that will make the development of a working solution a lot more complex than I am anticipating. That being said, ignorance is bliss, and we need to educate ourselves on the topic of real-time analytics. So we might as well develop a little prototype to gather a few datapoints.

Wish me luck!

Bad screenshots

Somehow, Tumblr has started to aggressively downsample my screenshots. I have no idea how to fix it. Please accept my apologies for their poor quality while I figure out a fix or eventually decide to migrate to another tool.

This one is huge: meet the new View Editor. It’s still very much a work in progress, but it’s already much better than the old one. You open it by simply clicking on the binoculars icon at the top right of the Object View. From there, you can configure filtering, sorting, and grouping rules, using any perspective (not just the Grid perspective). All this courtesy of François.
This one is huge: meet the new View Editor. It’s still very much a work in progress, but it’s already much better than the old one. You open it by simply clicking on the binoculars icon at the top right of the Object View. From there, you can configure filtering, sorting, and grouping rules, using any perspective (not just the Grid perspective). All this courtesy of François.
This one is huge: meet the new View Editor. It’s still very much a work in progress, but it’s already much better than the old one. You open it by simply clicking on the binoculars icon at the top right of the Object View. From there, you can configure filtering, sorting, and grouping rules, using any perspective (not just the Grid perspective). All this courtesy of François.
This one is huge: meet the new View Editor. It’s still very much a work in progress, but it’s already much better than the old one. You open it by simply clicking on the binoculars icon at the top right of the Object View. From there, you can configure filtering, sorting, and grouping rules, using any perspective (not just the Grid perspective). All this courtesy of François.
This one is huge: meet the new View Editor. It’s still very much a work in progress, but it’s already much better than the old one. You open it by simply clicking on the binoculars icon at the top right of the Object View. From there, you can configure filtering, sorting, and grouping rules, using any perspective (not just the Grid perspective). All this courtesy of François.
This one is huge: meet the new View Editor. It’s still very much a work in progress, but it’s already much better than the old one. You open it by simply clicking on the binoculars icon at the top right of the Object View. From there, you can configure filtering, sorting, and grouping rules, using any perspective (not just the Grid perspective). All this courtesy of François.
This one is huge: meet the new View Editor. It’s still very much a work in progress, but it’s already much better than the old one. You open it by simply clicking on the binoculars icon at the top right of the Object View. From there, you can configure filtering, sorting, and grouping rules, using any perspective (not just the Grid perspective). All this courtesy of François.

This one is huge: meet the new View Editor. It’s still very much a work in progress, but it’s already much better than the old one. You open it by simply clicking on the binoculars icon at the top right of the Object View. From there, you can configure filtering, sorting, and grouping rules, using any perspective (not just the Grid perspective). All this courtesy of François.

Now that our last major bug related to ElasticSearch has been fixed, we’ve started to work on the refactoring of our Advanced Relationhips. We will split this datatype into two: Dynamic and Multiple. The former will deal with relationships that have multiple target objects but a single target record, while the latter will deal with relationships that have both multiple target objects and multiple target records.

Both types of relationships will support the caching of the Name, Color, and Iconography of target records, so that joins are not required for displaying related lists. And they will both support the definition of custom relation attributes, which will allow us to support RDF triples. Hugues is gone for a well-deserved vacation for two weeks, therefore this piece of work will start only when he is back. We should expect a first prototype by the second week of October.

We’ve recently added a couple of objects that can be used to externalize the definition of options for fields that are using the Category datatype. This will be used to let users customize or extend some fields without having to fork their definition in our meta-data, thereby making it easier to apply upgrade patches. It will also be used to support the definition of formula-driven categories, which will enable the definition of custom meta rules (buzzword alert!) for things like workflow actors and due dates. More on this next week…
We’ve recently added a couple of objects that can be used to externalize the definition of options for fields that are using the Category datatype. This will be used to let users customize or extend some fields without having to fork their definition in our meta-data, thereby making it easier to apply upgrade patches. It will also be used to support the definition of formula-driven categories, which will enable the definition of custom meta rules (buzzword alert!) for things like workflow actors and due dates. More on this next week…
We’ve recently added a couple of objects that can be used to externalize the definition of options for fields that are using the Category datatype. This will be used to let users customize or extend some fields without having to fork their definition in our meta-data, thereby making it easier to apply upgrade patches. It will also be used to support the definition of formula-driven categories, which will enable the definition of custom meta rules (buzzword alert!) for things like workflow actors and due dates. More on this next week…

We’ve recently added a couple of objects that can be used to externalize the definition of options for fields that are using the Category datatype. This will be used to let users customize or extend some fields without having to fork their definition in our meta-data, thereby making it easier to apply upgrade patches. It will also be used to support the definition of formula-driven categories, which will enable the definition of custom meta rules (buzzword alert!) for things like workflow actors and due dates. More on this next week…