My name is Ismael Chang Ghalimi. I build the STOIC platform. I am a stoic, and this blog is my agora.

SocketCluster

SocketCluster looks really cool.

"SocketCluster is a fast, highly scalable HTTP + WebSocket (engine.io) server which lets you build multi-process realtime systems/apps that make use of all CPU cores on a machine/instance. It removes the limitations of having to run your Node.js server as a single thread and makes your backend resilient (auto-respawn, aggregated error logging). SC is designed to scale horizontally too - By simply listening to socket events on the client-side, your clients can be made to automatically (and securely) subscribe to distributed message queues on the backend - SC only delivers particular events to clients who actually need them. You just have to hook up your store to a message queue like RabbitMQ."

Via Jima de Melbourne.

Refactoring of advanced relationships

After some discussions with Hugues and Pascal, we’ve decided to re-factor the way we’re implementing advanced relationships. So far, these have been used for relationships that can have multiple target objects and/or multiple target records, and have been implemented by storing a dumb JSON object directly within records, without any native support at the middleware level. In other words, it was a temporary hack…

While this was enough for a while, it is creating some performance issues for certain types of queries, and it is limiting our ability to add support for more advanced features, such as the addition of custom attributes to relations in order to support triples (RDF's atomic data entity).

In order to work around these limitations, we have decided to implement advanced relationships with dedicated tables on Elasticsearch (one table per relationship). This will dramatically speed up the execution of complex queries on advanced relationships, while allowing us to process a brand new class of queries that cannot be handled with our current implementation. It will also allow us to add schema-driven relation attributes, thereby supporting the native import/export of RDF data structures (I know a few people who will go crazy about that one).

Hugues will work on this as soon as he is done with permission restrictions.

Nashorn integrated into Elasticsearch

Victory! Hugues and our friend Kin Wah managed to integrate Nashorn into Elasticsearch.

Why does it matter? Well, we now have a powerful JavaScript runtime deployed on top of the Java virtual machine on which Elasticsearch is running. As a result, we will be able to execute our powerful FormulaJS expressions directly within the database. This should improve the performance of our authorization engine quite dramatically, while allowing us to perform queries for analytics that were simply impossible to handle before. Think of it as stored procedures on steroids, with a fully extensible functional language.

Great work team Singapore!

Making sense of our JS soup

If you’ve been reading this blog on a regular basis, you might be getting confused with the explosion of JS projects and repositories that we’ve been creating. There is some method to our madness though, and I actually think it all makes sense. Let’s take a look at our major initiatives:

  • FormulaJS, JavaScript implementation of all Excel formula functions
  • ExpressionJS, functional language built on top of FormulaJS
  • ProcessorJS, declarative event-driven controller built on top of ExpressionJS
  • CircularJS, server-side templating engine built on top of AngularJS and ProcessorJS
  • WorkerJS, asynchronous event processor built on top of Bull and ProcessorJS

As you can see, ProcessorJS is built on top of ExpressionJS, which itself is built on top of FormulaJS. On top of it, we build CircularJS for developing the front-end of web applications, and WorkerJS for developing their back-end.

What’s so great about this architecture is that CircularJS and WorkerJS share the same ProcessorJS engine, which means that you have only one thing to learn, and we have only one pattern to support, especially when it comes to the development of graphical tools.

And when we start letting you reuse the same directives across AngularJS and CircularJS, while using FormulaJS expressions that can be applied at any level of the stack, our overall level of genericity should be pretty phenomenal…

Distributed architecture

A year ago, Hugues and I had a passionate discussion about the benefits of a distributed architecture, whereby a collection of small applications collaborate toward a common goal. Hugues was of the opinion that it was the only way to scale in the cloud. I was of the opinion that it did not really matter as long as you had proper support for clustering.

I was wrong.

Earlier today, we deployed a first version of CircularJS on Cloud Foundry for our new website. It turns out that our core server crashed soon thereafter, but CircularJS kept running just fine, because it does not rely on the server responsible for providing our web user interface. Instead, it connects directly to our Elasticsearch database, using the exact same middleware as the one used for our web application.

Today, I’m convinced that with proper provisioning, lots of smaller applications is better.

Hugues: respect!

Integrating Bull

In order to support some critical use cases, we need to execute the FormulaJS expressions defined by the fields of imported objects. This could slow down the import process to the point where we need a proper job management system in order to keep things running smoothly.

Somehow, we managed to do everything we’re doing without such a queuing system, but I don’t think this is sustainable anymore, and this opinion is shared by Hugues as well. I did not get a chance to discuss it with Pascal or Jim, but I’d be surprised if they did not agree with our analysis.

In the spirit of keeping things as simple as possible, we decided to give the Bull Job Manager a spin. It’s powered by Redis, which we already use for managing user sessions, and it’s inspired by Kue, which we really liked when it came out.

Hugues will build a new application for it that can be used to deploy as many workers on a cluster as we need. This application will then be used to manage our batch import process for complex spreadsheets, as well as for implementing our email facet.

If it works well, we will then refactor our Batch and Jobs objects to take advantage of it.

Toughts on Git integration

Over the past couple of days, I’ve been using STOIC Pages to develop a blog publishing application and rewrite the documentation for Formula.js. As described on this earlier post, I’m thoroughly enjoying the experience of using a database-driven application to manage pages, templates, fragments, and their inter-relationships.

By the same token, I’m also experiencing the frustration of not having direct access to resources like JavaScript libraries, CSS stylesheets, and images through a regular filesystem. Of course, something like STOIC Drive will address this issue for binary resources like images, but it won’t do much good for text-based resources that currently require a database back-end.

What I’m really experiencing is the fundamental dichotomy that traditional IT systems usually create between flat files and database records. It’s as if you always have to chose one over the other. Either you get the convenience of files but you lose the power of a database, or you get the power of a database but you lose the convenience of files. No matter which one you pick, you can never have both at the same time.

Well, I don’t take no for an answer easily, and I really want both. To me, flat files and database records are nothing more than two materialized artifacts for the same abstract entities. If I’m dealing with pages, I’d like to manipulate them as files when I’m in the mood for some serious hacking, and I’d like to visualize them as records when I’m trying to make sense of their relationships. And I want to be able to switch back and forth between the two at any time. In other words, I want files and records to be two facets of the same entity.

What this means is that we really need a file-based datastore alongside our record-based ones. And we need these datastores to be synchronized with each other at all times. Here is how it would work: First, we would define a canonical mapping between database records and flat files. Second, we would create connectors for various file-based protocols like Git. Third, we would map our commit process to the commit processes of these protocols.

The canonical mapping between database records and flat files could look like that:

  • Record name mapped to file name
  • Record parent mapped to parent folder when a hierarchy is defined for the object
  • Record owner mapped to file owner
  • Record update timestamp mapped to file update timestamp

What this means is that all the pages of my website that are currently managed by our Pages object and are stored in our database would also be available as files and folders on our filesystem. Whenever I would modify a page from our user interface, its record would be updated on the database, and its corresponding file would be refreshed on the file system. Similarly, if I were to edit the page directly from its file on the filesystem, its corresponding database record would be updated automatically.

From a connector standpoint, we should try to stay as close to a generic filesystem model as possible, instead of dealing with the idiosyncracies of source control systems. The last thing we want to do is to develop an abstraction for various source control systems like Git or CVS, because such a thing does not exist. And if it did, it would be utterly useless. Instead, we have to consider that different datastores serve different purposes, and that data does not have to be replicated in full across the different datastores that make our hybrid back-end.

As a result, all the business logic inherent to things like versioning and branch management should be kept within the source control system, and should be hidden from the rest of the platform. If I’m interested by these considerations, I’ll use my favorite Git client to manipulate my entities as files. But when I’m done and I’m switching to the STOIC user interface, I do not want to see anything related to versioning or branching, because with the hat that I’m wearing right now, these things do not make sense to me anymore.

Clearly, these ideas are still in their infancy, and we’ll need to refine our thinking before we start implementing anything. But the more I’m thinking about them, the more I believe that we’re onto something really interesting there. From an implementation standpoint, we would certainly start with Git, using GitHub as testing target. And if we need a Git client, we would use js-git. Or we could ignore the source control system altogether, and just provide a file interface, using the regular file system as interface between our middleware and any source control system you like.

Simple is beautiful…

Modularizing our spreadsheets

Originally, all our meta-data was defined and stored into a single spreadsheet. Over time, we felt that we had to modularize things a bit better and we decided to use one spreadsheet per application. After a while, some of our meta-data became too large for the Platform spreadsheet, and we started to pull some of it out and store it into a separate Resources spreadsheet. Yesterday, we decided to fully modularize our spreadsheets and to allow users to externalize any piece of content they want into individual files.

To better understand how it will work, one needs to understand the different kinds of data and meta-data we have to handle. Now that we’re having a better grasp of it all, we’ve started to classify our structured data into four main categories:

  • Über-data (Objects, Fields, Relationships)
  • Meta-data (44 objects of Platform)
  • Reference data (Countries, Currencies, etc.)
  • Business data (Companies, Contacts, etc.)

With our current design, all über-data and meta-data is defined in the Platform spreadsheet. When packaged into a single Excel spreadsheet, its size is 765KB, which is not tiny, but is not large either. Using the ODS Open Document format, it’s even smaller, taking only 623KB of space. And if we were to deduplicate all Bootstrap fields, its size would be a third of what it is right now.

Then, we have 32 spreadsheets stored in a Resources folder on Google Drive that contain records for reference data objects, as well as test records for business data, which we refer to as test data. Because the reference data needs to be deployed on all customer instances, we reference it from the Platform spreadsheet, by using a new field of the Objects object called Datasource.

We deal with test data in a totally different manner, because it’s not really part of the product that we ship. Unlike reference data, it’s not referenced from the Platform spreadsheet. Instead, each file used to externalize test data references its related object, currently by storing a small JSON object as a note added to the A1 cell of the single sheet contained by any test data spreadsheet.

This modular packaging allows us to implement a very simple import process for our structured data, starting with the Platform spreadsheet which references its required reference data, then optionally adding all test data spreadsheets to our internal testing instances.

All this should work next week.

Distributed Connector Architecture

This morning, Jim and Jacques-Alexandre have started prototyping a ground-breaking architecture for distributed connectors. The idea is that some connectors might require lots of hardware resources, and you don’t want to overload your primary cluster with them. In order to make them scale better, we defined an architecture allowing the deployment of individual connectors on external servers, either locally or remotely. And to make things even more scalable, connectors that need scheduling can run their own Cron scheduler, so that the scheduler of the main cluster does not become a bottleneck.

Fancy…

Live Updates Coming

Jim committed a piece of code that will allow us to push any changes made to data and meta-data onto all connected clients, instantly. Once we take advantage of this brand new feature from our user interface, it will give us a user experience similar to the one offered by Google Apps, where any change made by a user to a document is instantly visible to all other users looking at the same document. Our user interface will be refactored incrementally in order to implement this feature, and we hope to be done with it sometime in October or November.

Data vs. Meta-Data

Since we started working on the STOIC platform eighteen months ago, we’ve been very keen on making sure that meta-data behaves pretty much the same way as business data. In fact, for the longest time, there was no way to really distinguish one from the other.

As the platform matured though, their respective life-cycles started to diverge. For example, when we added support for meta-data caching, we had to explicitly indicate which objects would be included into this cache. This implicitly considered some objects as being part of the meta-data. Similarly, when we started to implement our Commit process, we had to identify a subset of these meta-data objects as special cases that require explicit commit operations.

Coming back to our original idea, treating meta-data and business data alike had clear benefits. For one, it allowed us to use the same canonical user interface for both. In other words, from the viewpoint of developers and users, meta-data and business data are the same thing. But from the viewpoint of the implementers of the platform (STOIC employees), they’re quite different, for rather good reasons. Clearly, we needed a way to reconcile both sets of requirements.

Today, we know that we want them to be both different and the same, all at once.

Then, as we started to implement our meta-data update framework to support cascading levels of meta-data custody, we realized that such a capability was required only for meta-data, not business data. The reason for it is very simple: while the platform vendor (STOIC), software vendors developing packaged applications on top of it, systems integrators customizing these applications to suit the needs of their customers, and customers configuring these applications could all make changes to meta-data, only customers (referred to as end custodians) would need to create and manage actual business data. As a result, the multi-custodian meta-data life-cycle could be applied to meta-data only, and we could pretty much ignore the concept of custodian for business data. This sudden reduction of scope opened the door to many opportunities for simplification and optimization, which we’re now taking full advantage of.

This is especially important because we’re doing all that work while finishing the implementation of our distributed meta-data cache and adding support for clustering. Taken individually, caching, clustering, and custody are hard-enough to implement. Put together, they’re like rocket science, and the more you can simplify, the better a chance you have of ever making it work.

With that in mind, we’re now streamlining the end-to-end data lifecycle. Here is how it will work.

First, we’re separating data from meta-data entirely. Meta-data is defined as the records of the objects for which Cached (a field of the Objects object) is set to TRUE. All records of these objects will be part of our meta-data cache (mdCache). This cache will have two versions, one for servers containing all fields of cached objects, and one for clients only containing the fields for which Cached (a field of the Fields object) is set to TRUE.

Second, we’re creating one schema on PostgreSQL or one index on Elasticsearch for each and every custodian, according to the architecture described on this previous post, but these schemas or indexes are used for meta-data only. We then create a separate schema or index for business data, used by the end custodian only.

Third, we acknowledge the fact that any changes made to meta-data by upstream custodians (custodians other than the end-custodian) follow a different lifecycle than changes made by the end custodian. The former are traditionally called upgrades, while the latter are called configurations, customizations, or extensions. The former happen rather infrequently in a very controlled environment, while the latter happen on a daily basis, in a very ad hoc fashion. For this reason, they can be implemented very differently: the former is implemented by simply replacing a schema or index file by a new one, while the latter is implemented with incremental updates.

Fourth, we implement a cluster-friendly incremental update process for all updates made to meta-data by the end custodian. For these, we build an aggregated image of the meta-data by combining the meta-data schemas or indexes of all custodians, according to simple overloading rules. Usually, precedence is taken by changes made by the custodian who is the most downstream in the custody chain. Then, we deploy this meta-data in memory on all servers and clients, and make sure that they remain synchronized at all times.

Fifth, to keep everything synchronized, incremental updates to meta-data are first applied to a persistent copy of the aggregated meta-data stored by PostgreSQL or Elasticsearch. The meta-data is kept consistent through locking, which today is implemented in an optimistic fashion through the use of Change Identifiers (CIDs), but might be migrated to a pessimistic locking mechanism if we decide that it would improve the overall end-user experience. And we make sure that the internal structure of our meta-data cache supports incremental updates in a robust and high performance fashion, by getting rid of extraneous cross-references that were added to it.

As a result of this architecture, the complete refreshing of our meta-data cache will happen a lot less frequently than it has so far. In fact, it will be limited to instances where meta-data needs to be upgraded by upstream custodians, or when clients go back online after some period of offline activity (once we add full support for offline access). This should improve performance while reducing the latency of both server operations and client interactions.

That’s pretty much all for now. If you followed me so far, good for you. If you did not, don’t worry. You don’t really have to understand any of this, unless you’re planning to deploy the STOIC platform at a very large scale. All you should know is that this stuff is what makes it work.