My name is Ismael Chang Ghalimi. I build the STOIC platform. I am a stoic, and this blog is my agora.

Embedded Redis sort of working

Yves managed to compile an embedded version of Redis by just changing a single line in the original codebase. With it, we’re managing to get about 250,000 SET transactions per second, but it crashes or deadlocks if we try to get any faster. We’re suspecting that this is due to some nasty multi-threading issues, since Redis is mostly single-threaded, yet uses a couple more threads for slow IO operations like disk snapshots.

At this stage of the game, we’re really not sure where we should go with this aspect of our project. On one hand, having an embedded version of Redis would reduce the latency of transactions from 250 microsecond to 1 microsecond. On the other hand, there are ways to work around this issue by using pipelining, Lua scripting, and custom commands.

We have yet to confirm this, but we strongly believe that we won’t face such issues if we use the embedded Redis data pathway only for READ transactions, while using the standard client-server pathway for all WRITE transactions. This might be good enough because most of the use cases that require low latency, such as graph traversals, usually require WRITE transactions only.

By the same token, such transactions would be a lot faster in native C or C++, and we’re likely to use a C++ library like Boost for it anyway. Therefore the case for an embedded version of Redis is getting less and less obvious as our architecture is solidifying.

For the time being, Yves will implement a custom data type for our hexastore and will migrate our handful of graph traversal functions from JavaScript to C. This will give us a good indication of how much performance improvement we can expect from native C for such applications, which will in turn help us decide whether embedding really makes sense or not.

Hypercube deployment

Following today’s work with Jim, Yves is working on turning our current codebase as a Node.js application that will then be packaged by Jim as a Docker container. This will allow us to deploy it as a set of workers on customer instances.

Check this out! Our new Pivot perspective. The data is still fake, but the rest is fully working. Once we have a working set of transient indexes in Hypercube, we will be able to populate it with real data. In the meantime, François and Florian will work on adding support for charting.

Hypercube dataflow

Jim and I spent the last six hours reviewing our architecture for Hypercube. As a result, we came up with some major simplifications of its dataflow. This will make its implementation and maintenance quite a bit simpler, without any major loss of functionality or performance.

Static Sharding

The first major decision was to have a static number of shards. This will make it easier to isolate core.stoic from our hypercube workers through a static set of queues (one queue per shard). And this will ensure that we do not have to implement complex resharding operations, which Redis is not really designed for.

Instead of redistributing records across shards (resharding) whenever a new shard would need to be added, shards will be redistributed across cores whenever more cores are needed. To do so, we will use presharding, which is described on this article. With such a model, we will initially deploy multiple shards per core, on an initial set of cores. Then, as shards increase in size, we will add more cores and moves shards around, while keeping a constant number of shards.

In order to properly size things up and reduce to the absolute minimum the number of occurrences during which we would have to do resharding, we will offer different classes of instances that will be optimized for different sizes of datasets:

  • Small: 10GB to 100GB with 16 shards
  • Medium: 100GB to 1TB with 256 shards
  • Large: 1TB to 10TB with 2,048 shards

According to this basic analysis, it means that shards will never be smaller than 0.391GB and never be larger than 6.250GB. Knowing that Node.js requires a minimum of 64MB and Redis requires just 1MB, our marginal overhead with our smallest shards will be 65MB over 391MB, or about 17%. But with the largest shards, it will be just over 1%. This is essentially the price to pay for a full order of magnitude of dynamic elasticity.

This overhead could be further reduced by running multiple Redis servers for every Node.js worker, in which case we could easily increase the number of pre-allocated shards by an order of magnitude, thereby offering two full orders of magnitude of elasticity without requiring any kind of resharding. If we go for this solution, we will have multiple shards per worker, otherwise, we will have a single shard per worker.

Central Redis Queuing Server

The second major decision was to deploy a central Redis queuing server between core.stoic and our workers, without using the Bull job manager. Instead, we will use standard Redis lists to implement basic FIFO queues (first in, first out), following a common pattern described in this article. With regular hardware and no clustering, we should have no problem achieving a throughput of about 250,000 messages per second (one LPUSH and one RPOP per message), which should be plenty enough for most use cases.

According to this architecture, core.stoic will push updates into queues managed by this server, and workers will pull updates from these queues. Therefore, core.stoic won’t have to know anything about the workers, the workers won’t have to know anything about core.stoic, and the queues won’t have to know anything about anything.

To further simplify things, we’ve decided to use one queue (one Redis list) per shard. Since the number of shards will be set in advance and will never change, the number of queues will be the same. And since sharding will be done by simple UUID pooling, core.stoic will know which queue a record update should be pushed into, while workers will know which queues they should pull updates from. It really could not get any simpler.

Updates Only

The third major decision was to only implement support for record-level updates, instead of supporting both record-level updates and bulk updates. The reason for it is that bulk updates that would include many records at once (millions or more) would require a different API with a direct connection from the workers to core.stoic. This would complexify our architecture while creating a tight coupling between core.stoic and the workers, which is something that we would like to avoid if possible.

Instead, we will take advantage of the fact that core.stoic can stream updates in batches, and we will batch our updates using pre-defined time windows (every second for example). According to this model, core.stoic will push batches of updates to every shard-level queue at regular interval. At full throttle, if we were to stream 250,000 records/second across 256 shards (medium instance), we would have about 1,000 records per message. Our recent tests show that we can index about 50,000 records per second on a single worker, therefore indexing 1,000 records will take about 20ms. As a result, such an architecture should allow us to get records from ElasticSearch to the Hypercube in about 1 second (latency), at a rate of a quarter million records per second.

With this architecture, bulk loading of records after mass import will be done by streaming small batches of about 1,000 records at a time, and loading a billion records (1TB) should take no more than 4,000 seconds, or just over an hour. Clearly, we should not be the bottleneck here. Your network connection will be…

Another benefit of this architecture is that we will optimize our dataflow for batches of a few hundreds of records. To do so, we will refactor our stores and indexes loading API in order to take advantage of Redis pipelining, in case we fail in our attempt of developing an embedded version of Redis. Through pipelining, we will pay the 250 microsecond client-server overhead only once per batch of records. In other words, if we have a batch of 250 records, our average overhead per record will be about 1 microsecond, which is what it would have been with an embedded Redis. Conclusion: having an embedded Redis is just a nice-to-have now.

Dedicated Hexastore Worker

Finally, we will deploy a dedicated worker for Hexstore alongside our shard-related workers. This is due to the fact that graph structures are really difficult to parallelize, and complex graph traversal algorithms do not deal well with MapReduce (read this article for a striking example). As a result, our hexastore will be deployed using a single Redis server. On Amazon, this will give us up to 240GB of RAM. Knowing that our hexastore requires roughly 100 bytes per relation, this will allow us to store about 2.5B relations. And if you were to use a dedicated server with 2TB of RAM, you’d be able to manage 20B relations. Unless you’re Facebook or LinkedIn, this should be plenty enough to get started.

Let’s build all this now…

Joins with ElasticSearch

Hugues recently added a plugin to ElasticSearch allowing us to perform joins with a reasonable level of performance. This will allow us to do things like sorting on relationship fields. It will also improve the performance of the Grid perspective quite a bit.

Victory! We now have a complete hexastore fully powered by Redis. After some further thinking and a walk to Camera Lane in Melbourne, I have decided to implement its core data structures using standard Redis sets. This is ensuring that all operations can be performed in O(1), including deletes. The only drawback is a slight increase in memory usage, but only to the tune of Log(n), n being the number of statements managed by the hexastore.
With such an approach, everything is reasonably fast, at the exception of bulk loading of multiple statements, which needs to be optimized. Presently, every statement is loaded individually. Loading statements in batches would be quite complex, and I don’t think that we will want to venture there. Instead, we should eventually create a custom Redis data type to be used by the six stores of our hexastore, and implement custom batch loading commands for it. But this is something that could be done much later.
For the time being, what we have is plenty good enough, and this was the last store that needed to be migrated to Redis. We do not have a single JavaScript in-memory data structures left anymore. Finally, time has come to work on the indexes that are needed for our pivot table, namely Filindex, Incindex, and Pivindex. 
Victory! We now have a complete hexastore fully powered by Redis. After some further thinking and a walk to Camera Lane in Melbourne, I have decided to implement its core data structures using standard Redis sets. This is ensuring that all operations can be performed in O(1), including deletes. The only drawback is a slight increase in memory usage, but only to the tune of Log(n), n being the number of statements managed by the hexastore.
With such an approach, everything is reasonably fast, at the exception of bulk loading of multiple statements, which needs to be optimized. Presently, every statement is loaded individually. Loading statements in batches would be quite complex, and I don’t think that we will want to venture there. Instead, we should eventually create a custom Redis data type to be used by the six stores of our hexastore, and implement custom batch loading commands for it. But this is something that could be done much later.
For the time being, what we have is plenty good enough, and this was the last store that needed to be migrated to Redis. We do not have a single JavaScript in-memory data structures left anymore. Finally, time has come to work on the indexes that are needed for our pivot table, namely Filindex, Incindex, and Pivindex. 
Victory! We now have a complete hexastore fully powered by Redis. After some further thinking and a walk to Camera Lane in Melbourne, I have decided to implement its core data structures using standard Redis sets. This is ensuring that all operations can be performed in O(1), including deletes. The only drawback is a slight increase in memory usage, but only to the tune of Log(n), n being the number of statements managed by the hexastore.
With such an approach, everything is reasonably fast, at the exception of bulk loading of multiple statements, which needs to be optimized. Presently, every statement is loaded individually. Loading statements in batches would be quite complex, and I don’t think that we will want to venture there. Instead, we should eventually create a custom Redis data type to be used by the six stores of our hexastore, and implement custom batch loading commands for it. But this is something that could be done much later.
For the time being, what we have is plenty good enough, and this was the last store that needed to be migrated to Redis. We do not have a single JavaScript in-memory data structures left anymore. Finally, time has come to work on the indexes that are needed for our pivot table, namely Filindex, Incindex, and Pivindex. 

Victory! We now have a complete hexastore fully powered by Redis. After some further thinking and a walk to Camera Lane in Melbourne, I have decided to implement its core data structures using standard Redis sets. This is ensuring that all operations can be performed in O(1), including deletes. The only drawback is a slight increase in memory usage, but only to the tune of Log(n), n being the number of statements managed by the hexastore.

With such an approach, everything is reasonably fast, at the exception of bulk loading of multiple statements, which needs to be optimized. Presently, every statement is loaded individually. Loading statements in batches would be quite complex, and I don’t think that we will want to venture there. Instead, we should eventually create a custom Redis data type to be used by the six stores of our hexastore, and implement custom batch loading commands for it. But this is something that could be done much later.

For the time being, what we have is plenty good enough, and this was the last store that needed to be migrated to Redis. We do not have a single JavaScript in-memory data structures left anymore. Finally, time has come to work on the indexes that are needed for our pivot table, namely Filindex, Incindex, and Pivindex

Implementing an hexastore with Redis

Currently, our in-memory hexastore is made of six stores, each implemented in JavaScript as an object of objects of arrays. In other words, we could migrate this to Redis by implementing each store as a hash of hashes of arrays, thanks to our custom Redis Array data type.

The only problem with this basic mapping is that it would create many hashes and many arrays, which is something that Redis is not designed to handle particularly well. If we want to avoid implementing a custom data type for it, we could use a single Redis hash for every store of the hexastore. Keys for this hash would be the concatenation of the two keys for the first and second objects in our current in-memory stores, and values would be a string joining the values of the arrays in our current stores.

Such an approach would make get and set operations very fast, but it would dramatically slow down the del operations. Since the former are much more frequent than the later, this might be an acceptable trade off, but it is certainly far from ideal.

I’ll give this one a few more hours before firing my code editor…

The count of statements stored in our hexastore is now captured by Redis using the standard INCR command, which is nicely implemented because it automatically initializes new keys to 0. Right now, it looks like all our hexastore lifecycle management functions are idempotent, meaning that executing them twice has exactly the same result as executing them only once. I did not really expect that to be the case from the get go, but it looks like our original code was decently well written. Sweet…
The count of statements stored in our hexastore is now captured by Redis using the standard INCR command, which is nicely implemented because it automatically initializes new keys to 0. Right now, it looks like all our hexastore lifecycle management functions are idempotent, meaning that executing them twice has exactly the same result as executing them only once. I did not really expect that to be the case from the get go, but it looks like our original code was decently well written. Sweet…

The count of statements stored in our hexastore is now captured by Redis using the standard INCR command, which is nicely implemented because it automatically initializes new keys to 0. Right now, it looks like all our hexastore lifecycle management functions are idempotent, meaning that executing them twice has exactly the same result as executing them only once. I did not really expect that to be the case from the get go, but it looks like our original code was decently well written. Sweet…

All our hash tables have been migrated to Redis, using both hashes and arrays. Once again, this is leading to a temporary performance reduction, which will be solved by using an embedded version of Redis and by implementing certain accessors and graph traversal functions in native C. It’s also worth noting that moving from synchronous to asynchronous functions reduced performance by about 4x. That being said, once most of the graph traversal functions will have been re-implemented in C, this overhead should become negligible.
All our hash tables have been migrated to Redis, using both hashes and arrays. Once again, this is leading to a temporary performance reduction, which will be solved by using an embedded version of Redis and by implementing certain accessors and graph traversal functions in native C. It’s also worth noting that moving from synchronous to asynchronous functions reduced performance by about 4x. That being said, once most of the graph traversal functions will have been re-implemented in C, this overhead should become negligible.

All our hash tables have been migrated to Redis, using both hashes and arrays. Once again, this is leading to a temporary performance reduction, which will be solved by using an embedded version of Redis and by implementing certain accessors and graph traversal functions in native C. It’s also worth noting that moving from synchronous to asynchronous functions reduced performance by about 4x. That being said, once most of the graph traversal functions will have been re-implemented in C, this overhead should become negligible.

Here we go! A first data structure used by Hexstore has been migrated to Redis using regular hashes. As expected, this is slowing everything down (by a 10x to 20x factor, mind you), mostly because of the client-server overhead that we’ll be able to get rid off once we run an embedded version of Redis.
Here we go! A first data structure used by Hexstore has been migrated to Redis using regular hashes. As expected, this is slowing everything down (by a 10x to 20x factor, mind you), mostly because of the client-server overhead that we’ll be able to get rid off once we run an embedded version of Redis.

Here we go! A first data structure used by Hexstore has been migrated to Redis using regular hashes. As expected, this is slowing everything down (by a 10x to 20x factor, mind you), mostly because of the client-server overhead that we’ll be able to get rid off once we run an embedded version of Redis.

Finally! All hash tables of Hexstore have been converted into asynchronous functions. Now that we have a clean codebase, we can start migrating all its in-memory data structures to Redis. Doing so should reduce performance quite a bit until we have an embedded version of Redis. But even that won’t take us back to the level of performance that we’re enjoying with plain in-memory JavaScript objects. To get back to that level (and beyond), we will have to migrate our graph traversal functions to native C. We’ll do that only when we really have to though, because it will be quite a bit of work…

Here we go! The last function used to implement our hexastore has been made asynchronous, and everything seems to be working fine, even though our code is starting to look like some Persian carpet. The last step will be to turn three hash tables into asynchronous functions, at which time we will be able to fully migrate Hexstore to Redis.
Here we go! The last function used to implement our hexastore has been made asynchronous, and everything seems to be working fine, even though our code is starting to look like some Persian carpet. The last step will be to turn three hash tables into asynchronous functions, at which time we will be able to fully migrate Hexstore to Redis.

Here we go! The last function used to implement our hexastore has been made asynchronous, and everything seems to be working fine, even though our code is starting to look like some Persian carpet. The last step will be to turn three hash tables into asynchronous functions, at which time we will be able to fully migrate Hexstore to Redis.

Victory! All our graph traversal functions have been made asynchronous. Now, we need to do the same for all the functions of hexstore.js that deal with data structures that will be migrated to Redis. Once this is done, we can complete the migration of Hexstore from in-memory JavaScript to Redis. Quite frankly, I did not expect this part of the project to take so long, but an hexastore is actually quite complex form a data structure standpoint, with 10 different hash tables having to be constructured and maintained, 6 of them having three levels of nesting. When you do everything synchronously in memory, the code is quite straightforward. But when you need to turn all this into asynchronous calls to an external datastore, things get interesting…