The ABCs of Building IMDGs
Building Resilient In-Memory Data Grids with Hazelcast
Ranvirsinh Raol BlockedUnblockFollowFollowing Mar 29
In today's world, data is of paramount importance. As developers or data scientists, you may be sourcing data from various systems. Likewise, your data may be sourced by various systems. With the adoption of big data technologies, both of these scenarios simultaneously play out at large scale within enterprise systems. In-memory databases are one of the ways we can crunch large datasets and perform actions in milliseconds or less. A critical component for building any platform, in-memory data grids (IMDG from now on), are key to maintaining data resiliency.
There are several aspects to building highly resilient data solutions with them. And like a chain, a single weak link will compromise the entire length. In this blog post, I am going to talk about things you should consider while building highly resilient IMDG solutions. We will look into various resiliency aspects for a In-Memory Data Grid, like Infrastructure, Network, Data, Processes, Backup, Monitoring etc in detail below.
For this post, we'll also be considering a use case where "ABC Mega Corporation" has a significant user base in America, East and West. AWS Regions from North Virginia and California are being used as their datacenters. We will use this use case throughout the article to talk about various aspects of data resiliency.
对于这篇文章，我们还将考虑一个用例，其中"ABC Mega Corporation"在美国，东部和西部拥有重要的用户群。来自北弗吉尼亚州和加利福尼亚州的AWS区域被用作他们的数据中心。我们将在整篇文章中使用此用例来讨论数据弹性的各个方面。
But first, what is Hazelcast?
What is Hazelcast In-Memory Datagrid?
Let's see how Hazelcast themselves define it --- " Hazelcast IMDG® is the leading open source in-memory data grid (IMDG). IMDGs are designed to provide high-availability and scalability by distributing data across multiple machines. " It is well known as cache for SQL databases, however it is also an excellent solution for distributed cache, along with having the capability to compute where data is located.
So without further ado, let's build a resilient platform for Hazelcast IMDG.
As we are building clusters for ABC Mega Corporation, one of the first things to consider is how data is spread across the datacenter(s). Building a highly resilient solution means that you want to isolate your data from any datacenter outage. That way your system, along with its data, won't fall to its knees if a datacenter's connectivity is lost, or machines on that particular datacenter are affected by hardware issues.
在我们为ABC Mega Corporation构建集群时，首先要考虑的是数据如何在数据中心之间传播。构建高度灵活的解决方案意味着您希望将数据与任何数据中心中断隔离开来。这样，如果数据中心的连接丢失，或者该特定数据中心的计算机受到硬件问题的影响，您的系统及其数据将不会瘫痪。
Let's consider different options for building IMDG clusters for Hazelcast.
Single Cluster That Spans Across Datacenter(s)
Generally speaking, you want to avoid building a single 'cluster' spanning across datacenter(s). The biggest drawback is it's a 'single' cluster, which should ring alarm bells if your goal is building a resilient solution. In this setup, any issue that impacts the single cluster will mean your whole system, across both different locations, will be out of luck. You would need to build explicit backup configurations to make sure your primary and backup copies don't reside on a single datacenter.
Latency : If your datacenters are geographically distant from each other, your overall latency for both' 'read' and write' would be impacted as the primary copy of the data may not be in the same datacenter as the application requesting the data. if your system is configured to wait until data is written to both the primary and backup nodes, you would also incur high write latency.
Replication: Given you have only one cluster in this scenario, you do not have replication concerns.
Data Consistency : Given you have only one cluster in this scenario, you would normally have high data consistency (because again, there isn't any other cluster to compare against for inconsistencies).
Note --- As of the writing of this article, Hazelcast officially does not support this configuration, the main reason for this is 'distance brings unexpected behaviors.'
Single Cluster Per Datacenter
This solution is our other extreme, where each datacenter has its own IMDG cluster. This solution would provide multiple clusters and any issue impacting any single datacenter would not cause system-wide failures.
Latency : Your application should be configured to read/write from a cluster in the same datacenter. This should give you the best read and write latency. Given your backup copies would be maintained within the same datacenter, your 'write' latency would be manageable for synchronous backup writes.
Replication : On the flip side, you would need to replicate data between all the clusters. Replication itself is not necessarily bad, however you are using additional heap memory and utilizing the same network along with additional compute. Refer to 'WAN Replication' for more details.
Data Consistency : If your data can be updated in any of the clusters, then it would come in via the replication processes. You may run into data inconsistency for a small period of time, until data is replicated. This model would lead to eventual consistency. If all four clusters are taking equal traffic, any cluster can expect ¾~= 75% of the data updates to come via the replication process alone. Note --- you should take care of potential data collision during replication, which may cause data integrity issues.
Single Clusters Per Region
This solution lies in between the two options discussed above where you have multiple clusters where nodes are not too physically distant from each other. You would typically have low latency moving data from primary to backup nodes within the same 'region'. Thus your 'write' latency would not be severely impacted, and the number of clusters where data needs to be replicated would be considerably smaller.
Latency : Your application should be configured to talk to IMDG clusters in the same region, and should be given reasonable read/write latency. Given your backup copies are maintained within the same region, but in a different datacenter, your 'write' latency would fall in between the two options above for synchronous backup writes.
Replication : You would need to replicate data between all the clusters. Given you are only managing two clusters in two different regions, your overall impact on memory, network and CPU would be far less compared to the above option.
Data Consistency: If your data can be updated in any of the clusters, then it would come in via the replication processes. You may run into data inconsistency for small periods of time until the data is replicated. Given you only have one incoming replication stream, and if both the clusters share the same load, 50% of your data updates would come in via replication; less than the option above. This model would lead to eventual consistency. And, like the option above, you should note to take care of potential data collisions during replication, which may cause data integrity issues.
Data Resiliency --- Backup Count, Split Brain Protection and Reconciliation
Now, that we have built our cluster, let's protect the data inside. In Hazelcast IMDG, nodes constantly communicate with each other to understand which members are part of the cluster. If a node goes down, other members in the cluster will automatically detect the failure and adjust data across the remaining nodes. This process is referred to as "cluster rebalancing" .
Let's visit certain scenarios for ABC Mega Corporation which may arise due to different types of failures.
让我们访问ABC Mega Corporation的某些情况，这些情况可能是由于不同类型的故障而引起的。
Failure Scenario #1
Let's start with the simplest first. I will use the example of a cluster having eight nodes where one node goes down. Other members of the cluster will detect the change and the cluster would rebalance itself. The key configuration that allows your cluster to protect against a single or multi-node failure is backup count .
The key takeaway here is that you want to configure backup count per your expected node failure scenario. The higher the backup count, the more backup copies you would need available across machines, at the cost of additional space and network consumption.
Failure Scenario #2
Let's look into another type of failure. Communication failures within network may lead to parts of the network being unreachable to other parts. Let's say our cluster is comprised of eight nodes with a backup count of two. Somehow, connectivity between two groups of nodes is lost, and two groups of five nodes are formed.
Individually, each of those two groups will act as if some part of the cluster has been lost and will immediately start to rebalance data within the cluster. As you can see, it will result in lost data for those clusters and clients who are connecting to either group would see compromised data. The problem mentioned here is referred to as the Split Brain Problem and Hazelcast provides a feature called Quorum configuration to make sure you have the minimum number of machines available in network for the data structure to respond back. If cluster size drops below the configured number, it would result in 'QuorumException'.
Almost all highly resilient systems have some sort of self healing capabilities built in. When you are dealing with situations where data integrity is of the utmost importance, with multiple clusters spread across different geographic locations, you might want to look into data reconciliation options. Largely, there are two options available here.
- The first option is to reconcile the data periodically with the source system. For example, nightly batch process can keep the data synced in the Hazelcast cluster. This option would help reconcile any data which did not make it into the Hazelcast cluster in the first place.
- The second option is to leverage Hazelcast's reconciliation featurewhich allows you to verify if two clusters are in sync, and initiate a transfer of data if it sees data missing on either side.
So far, we have achieved infrastructure and data specific resiliency for ABC Mega Corporation's clusters. However, you still may have blind spots around availability of the Hazelcast cluster. This is especially the case if operational processes do not account for the availability required, and mandate that Hazelcast clusters be totally rebuilt for every change.
到目前为止，我们已经为ABC Mega Corporation的集群实现了基础架构和数据特定的弹性。但是，您仍然可能对Hazelcast群集的可用性存在盲点。如果操作流程不考虑所需的可用性，并且要求Hazelcast集群针对每个更改进行完全重建，则情况尤其如此。
Here are two different ways to achieve resilience during version upgrades and data upgrades.
- Leverage rolling upgrade feature: Refer to the documentation about how you can upgrade Hazelcast versions without having to kill the entire Hazelcast cluster first. Refer to the documentation here.
- Provide automatic failover configuration for clients: It helps to have High Availability(HA) Configuration built into the client's configuration. That way if the Hazelcast cluster where they are communicating changes state, it will automatically detect and connect to the HA cluster, which could be your another cluster in another location or DR region. This would not only save the system during planned scenarios but it comes in handy for unplanned system outages like EC2 failure where clients automatically reconnect to another cluster without downtime.
For our use case of the ABC Mega Corporation, client applications deployed in the US East region refer to the Hazelcast cluster deployed within AWS East as their primary server. During an event of server rehydration or planned deployment, the operations console would trigger the cluster configuration upgrade to point to the AWS West cluster. The operation console could be built in the application, which is triggered manually, or it could have logic to detect potential issues with the IMDG cluster.
对于ABC Mega Corporation的使用案例，部署在美国东部地区的客户端应用程序将部署在AWS East中的Hazelcast集群称为其主服务器。在服务器重新合并或计划部署的事件期间，操作控制台将触发集群配置升级以指向AWS West集群。操作控制台可以在应用程序中构建，可以手动触发，也可以使用逻辑来检测IMDG集群的潜在问题。
If your process mandates that you must rebuild an entire cluster, you should take into consideration how you bring clusters back up-to-date with data. There are crude solutions like running a separate process which reads the data from the system of record and adds it to the Hazelcast cluster. These may work. However, a simpler option would be to leverage the existing cluster and use it to replicate data in the newly built cluster (if your data size permits this).
If your data size is huge, a better option is to leverage the Hot Restart feature which allows the use of disk to bring data into memory upon restart or upgrade.
Data Format Upgrades
After taking into consideration all the options above and more, a critical piece to the puzzle is still missing for the ABC Mega Corporation's IMDG clusters --- how data structure updates are introduced. Let's run through a scenario to find out why it is so important.
在考虑了上述所有选项之后，ABC Mega公司的IMDG集群仍然缺少关于这个难题的关键部分 - 如何引入数据结构更新。让我们通过一个场景来找出它为何如此重要。
You have a map for storing 'transaction' information which includes typical fields like timestamp, amount, vendor, zip code etc. The system is running in BAU mode with multiple clusters handling the traffic. A change is introduced to allow the zip code to contain alphanumeric values. Cluster A is upgraded with zip codes in alphanumeric value while Cluster B is still using the old five digit numeric value notation. Now, as the data from Cluster A gets replicated into Cluster B, it generates data parsing exceptions.
The golden rule for such types of changes are, 'Add new Field/Don't modify existing'.
Storing data in proto format comes in handy in such scenarios. In the scenario above, if we have a new field added to the map's data structure, both the proto formats are still compatible and wouldn't cause replication to break when both versions of the map exist simultaneously.
If a persistence mechanism such as 'hot restart' is enabled, it becomes even more critical that such breaking changes with updates should be avoided.
Always Prepare for Disaster
Resiliency isn't just about preventing failures, it is also about the ability to recover from failure. So ABC Mega Corporation must build controls in place which would help them recover in a timely manner.
Clusters can be configured to generate backup at a different location which can then be used to mount/copy data into the cluster before bringing it up. Refer to the Hazelcast documentation here.
Seed Data From Another Cluster
Depending on your use case, if you have multiple clusters with replication enabled, you can afford to build a new cluster from scratch from another healthy cluster in a reasonable amount of time.
However, there are scenarios in which you still want to have backup/restore enabled. For example:
- Multi cluster scenario where your cluster contains data which is 'state' specific (means, you would like to keep the data in the same state and can't afford to load everything from another cluster or backup).
- Multi cluster scenario where cluster size prevents replication as a feasible option due to the size of data that needs to be synchronized.
Hazelcast clusters can be enabled to persist data on the disc, which then would be used to 'seed' data when the cluster comes back up. If the Hazelcast cluster is running on fixed machines where the same disc would always be available for the same instance, this is fairly simple and straightforward. However, for cloud applications you would need to detach the EBS volume before the cluster goes down, then attach the same when the new instances are back before loading the data.
Monitoring Systems --- Detect Failures
It is almost impossible for ABC Mega Corporation's IMDG clusters to claim that they have highly resilient systems without having adequate monitoring in place. Hazelcast has built a great console tool to help visualize internal details of the cluster, but each organization will have their favorite monitoring and alerting mechanism. Image taken from : https://docs.hazelcast.org/docs/management-center/latest/manual/html/#status-page
ABC Mega公司的IMDG集群几乎不可能声称他们拥有高弹性系统而没有适当的监控。 Hazelcast构建了一个很棒的控制台工具来帮助可视化集群的内部细节，但每个组织都有自己喜欢的监控和警报机制。
I have found the JMX beans to be quite useful, not only to understand the state of the cluster and the maps, but to observe the patterns as well. Of course, a lot of what JMX provides is also visible via the ManCenter console itself. However JMX, or a similar solution, would provide the integration required to integrate with your choice of monitoring tool. Things like, "Do I see a spike in the maximum latency for the 'Get' or 'Put' operations at a particular time?" or " When a particular process runs, Do I have memory use creeping up and will the cluster soon complain about not having enough native or heap memory?" There are so many data points available to consume here, and having JMX based beans provides a simpler mechanism for collecting them, which then can be fed into whatever monitoring systems has a plugin for JMX.
As I mentioned earlier, building a solid in-memory data grid should take into consideration all aspects of the grid from infrastructure creation to data location to backups within and without clusters to processes that are followed during processing and updates. Monitoring and alerting are essential to building resilient systems and should not be overlooked. These are just some of my thoughts based on my experience building IMDGs both with and without Hazelcast.
I want to thank Srinivas Alladi, Director of Software Engineering at Capital One, for all his reviews and expert comments.