The ABCs of Building IMDGs

Building Resilient In-Memory Data Grids with Hazelcast

Go to the profile of Ranvirsinh Raol Ranvirsinh Raol BlockedUnblockFollowFollowing Mar 29

In today's world, data is of paramount importance. As developers or data scientists, you may be sourcing data from various systems. Likewise, your data may be sourced by various systems. With the adoption of big data technologies, both of these scenarios simultaneously play out at large scale within enterprise systems. In-memory databases are one of the ways we can crunch large datasets and perform actions in milliseconds or less. A critical component for building any platform, in-memory data grids (IMDG from now on), are key to maintaining data resiliency.

各种系统。随着大数据技术的采用,这两种方案同时在企业系统内大规模发挥作用。内存数据库是我们可以在几毫秒或更短时间内处理大型数据集并执行操作的方法之一。构建任何平台,内存数据网格(从现在开始的IMDG)的关键组件是维护数据弹性的关键。

在当今世界,数据至关重要。作为开发人员或数据科学家,您可能正在寻找数据

There are several aspects to building highly resilient data solutions with them. And like a chain, a single weak link will compromise the entire length. In this blog post, I am going to talk about things you should consider while building highly resilient IMDG solutions. We will look into various resiliency aspects for a In-Memory Data Grid, like Infrastructure, Network, Data, Processes, Backup, Monitoring etc in detail below.

使用它们构建高弹性数据解决方案有几个方面。就像链条一样,单个薄弱环节会影响整个长度。在这篇博文中,我将讨论在构建高弹性IMDG解决方案时应该考虑的事项。我们将在下面详细介绍内存数据网格的各种弹性方面,如基础架构,网络,数据,进程,备份,监控等。

For this post, we'll also be considering a use case where "ABC Mega Corporation" has a significant user base in America, East and West. AWS Regions from North Virginia and California are being used as their datacenters. We will use this use case throughout the article to talk about various aspects of data resiliency.

对于这篇文章,我们还将考虑一个用例,其中"ABC Mega Corporation"在美国,东部和西部拥有重要的用户群。来自北弗吉尼亚州和加利福尼亚州的AWS区域被用作他们的数据中心。我们将在整篇文章中使用此用例来讨论数据弹性的各个方面。

But first, what is Hazelcast?

What is Hazelcast In-Memory Datagrid?

Let's see how Hazelcast themselves define it --- " Hazelcast IMDG® is the leading open source in-memory data grid (IMDG). IMDGs are designed to provide high-availability and scalability by distributing data across multiple machines. " It is well known as cache for SQL databases, however it is also an excellent solution for distributed cache, along with having the capability to compute where data is located.

众所周知,它是SQL数据库的缓存,但它也是分布式缓存的出色解决方案,同时还具有计算数据所在位置的能力。

So without further ado, let's build a resilient platform for Hazelcast IMDG.

所以,不用多说,让我们为Hazelcast IMDG构建一个弹性平台。

Infrastructure-Based Resiliency

As we are building clusters for ABC Mega Corporation, one of the first things to consider is how data is spread across the datacenter(s). Building a highly resilient solution means that you want to isolate your data from any datacenter outage. That way your system, along with its data, won't fall to its knees if a datacenter's connectivity is lost, or machines on that particular datacenter are affected by hardware issues.

在我们为ABC Mega Corporation构建集群时,首先要考虑的是数据如何在数据中心之间传播。构建高度灵活的解决方案意味着您希望将数据与任何数据中心中断隔离开来。这样,如果数据中心的连接丢失,或者该特定数据中心的计算机受到硬件问题的影响,您的系统及其数据将不会瘫痪。

Let's consider different options for building IMDG clusters for Hazelcast.

让我们考虑为Hazelcast构建IMDG集群的不同选项。

Single Cluster That Spans Across Datacenter(s)

Generally speaking, you want to avoid building a single 'cluster' spanning across datacenter(s). The biggest drawback is it's a 'single' cluster, which should ring alarm bells if your goal is building a resilient solution. In this setup, any issue that impacts the single cluster will mean your whole system, across both different locations, will be out of luck. You would need to build explicit backup configurations to make sure your primary and backup copies don't reside on a single datacenter.

群集,如果您的目标是构建弹性解决方案,它应该响起警钟。在此设置中,任何影响单个群集的问题都将意味着您在不同位置的整个系统都将失去运气。您需要构建显式备份配置,以确保主副本和备份副本不驻留在单个数据中心上。

一般来说,您希望避免跨数据中心构建单个"群集"。最大的缺点是它是一个

Latency : If your datacenters are geographically distant from each other, your overall latency for both' 'read' and write' would be impacted as the primary copy of the data may not be in the same datacenter as the application requesting the data. if your system is configured to wait until data is written to both the primary and backup nodes, you would also incur high write latency.

:如果您的数据中心在地理位置上彼此远离,则"读取"和"写入"的总体延迟将受到影响,因为数据的主副本可能与请求数据的应用程序不在同一数据中心。如果您的系统配置为等待数据写入主节点和备份节点,则还会导致高写入延迟。

Replication: Given you have only one cluster in this scenario, you do not have replication concerns.

鉴于此方案中只有一个群集,您没有复制问题。

Data Consistency : Given you have only one cluster in this scenario, you would normally have high data consistency (because again, there isn't any other cluster to compare against for inconsistencies).

:假设您在此方案中只有一个群集,则通常具有高数据一致性(因为同样,没有任何其他群集可以与不一致进行比较)。

Note --- As of the writing of this article, Hazelcast officially does not support this configuration, the main reason for this is 'distance brings unexpected behaviors.'

Single Cluster Per Datacenter

This solution is our other extreme, where each datacenter has its own IMDG cluster. This solution would provide multiple clusters and any issue impacting any single datacenter would not cause system-wide failures.

此解决方案是我们的另一个极端,每个数据中心都有自己的IMDG集群。此解决方案将提供多个群集,任何影响任何单个数据中心的问题都不会导致系统范围的故障。

Latency : Your application should be configured to read/write from a cluster in the same datacenter. This should give you the best read and write latency. Given your backup copies would be maintained within the same datacenter, your 'write' latency would be manageable for synchronous backup writes.

:您的应用程序应配置为从同一数据中心的群集读取/写入。这应该为您提供最佳的读写延迟。鉴于您的备份副本将保留在同一数据中心内,因此对于同步备份写入,您的"写入"延迟将是可管理的。

Replication : On the flip side, you would need to replicate data between all the clusters. Replication itself is not necessarily bad, however you are using additional heap memory and utilizing the same network along with additional compute. Refer to 'WAN Replication' for more details.

:另一方面,您需要在所有集群之间复制数据。复制本身并不一定是坏的,但是您使用额外的堆内存并使用相同的网络以及其他计算。有关更多详细信息,请参阅"WAN复制"。

Data Consistency : If your data can be updated in any of the clusters, then it would come in via the replication processes. You may run into data inconsistency for a small period of time, until data is replicated. This model would lead to eventual consistency. If all four clusters are taking equal traffic, any cluster can expect ¾~= 75% of the data updates to come via the replication process alone. Note --- you should take care of potential data collision during replication, which may cause data integrity issues.

一致性。如果所有四个集群都采用相同的流量,那么任何集群都可以预期仅仅通过复制过程来~75%的数据更新。注意---您应该在复制期间处理潜在的数据冲突,这可能会导致数据完整性问题。

:如果您的数据可以在任何群集中更新,那么它将通过复制过程进入。您可能会在一小段时间内遇到数据不一致,直到复制数据。这种模式会导致

Single Clusters Per Region

This solution lies in between the two options discussed above where you have multiple clusters where nodes are not too physically distant from each other. You would typically have low latency moving data from primary to backup nodes within the same 'region'. Thus your 'write' latency would not be severely impacted, and the number of clusters where data needs to be replicated would be considerably smaller.

这个解决方案位于上面讨论的两个选项之间,在这两个选项中,您有多个集群,其中节点彼此之间的距离不太远。您通常会在同一"区域"内从主节点到备份节点进行低延迟移动数据。因此,您的"写入"延迟不会受到严重影响,并且需要复制数据的集群数量会相当小。

Latency : Your application should be configured to talk to IMDG clusters in the same region, and should be given reasonable read/write latency. Given your backup copies are maintained within the same region, but in a different datacenter, your 'write' latency would fall in between the two options above for synchronous backup writes.

:您的应用程序应配置为与同一区域中的IMDG群集通信,并应给予合理的读/写延迟。鉴于您的备份副本在同一区域内维护,但在不同的数据中心,您的"写入"延迟将介于上述两个选项之间,用于同步备份写入。

Replication : You would need to replicate data between all the clusters. Given you are only managing two clusters in two different regions, your overall impact on memory, network and CPU would be far less compared to the above option.

:您需要在所有集群之间复制数据。鉴于您只管理两个不同区域中的两个集群,与上述选项相比,您对内存,网络和CPU的总体影响将大大减少。

Data Consistency: If your data can be updated in any of the clusters, then it would come in via the replication processes. You may run into data inconsistency for small periods of time until the data is replicated. Given you only have one incoming replication stream, and if both the clusters share the same load, 50% of your data updates would come in via replication; less than the option above. This model would lead to eventual consistency. And, like the option above, you should note to take care of potential data collisions during replication, which may cause data integrity issues.

如果您的数据可以在任何群集中更新,那么它将通过复制过程进入。在复制数据之前,您可能会在很短的时间内遇到数据不一致。假设您只有一个传入复制流,并且如果两个群集共享相同的负载,则50%的数据更新将通过复制进入;少于上面的选项。这种模式将导致最终的一致性。并且,与上面的选项一样,您应该注意在复制期间处理潜在的数据冲突,这可能会导致数据完整性问题。

Data Resiliency --- Backup Count, Split Brain Protection and Reconciliation

Now, that we have built our cluster, let's protect the data inside. In Hazelcast IMDG, nodes constantly communicate with each other to understand which members are part of the cluster. If a node goes down, other members in the cluster will automatically detect the failure and adjust data across the remaining nodes. This process is referred to as "cluster rebalancing" .

现在,我们已经构建了我们的集群,让我们保护内部的数据。在Hazelcast IMDG中,节点不断地相互通信以了解哪些成员是群集的一部分。如果节点发生故障,群集中的其他成员将自动检测故障并调整其余节点上的数据。这个过程被称为

Let's visit certain scenarios for ABC Mega Corporation which may arise due to different types of failures.

让我们访问ABC Mega Corporation的某些情况,这些情况可能是由于不同类型的故障而引起的。

Failure Scenario #1

Let's start with the simplest first. I will use the example of a cluster having eight nodes where one node goes down. Other members of the cluster will detect the change and the cluster would rebalance itself. The key configuration that allows your cluster to protect against a single or multi-node failure is backup count .

让我们从最简单的开始吧。我将使用具有八个节点的集群的示例,其中一个节点发生故障。群集的其他成员将检测到更改,群集将自行重新平衡。允许群集防止单节点或多节点故障的关键配置是

The key takeaway here is that you want to configure backup count per your expected node failure scenario. The higher the backup count, the more backup copies you would need available across machines, at the cost of additional space and network consumption.

这里的关键点是您希望根据预期的节点故障情况配置备份计数。备份计数越高,跨机器可用的备份副本就越多,代价是额外的空间和网络消耗。

Failure Scenario #2

Let's look into another type of failure. Communication failures within network may lead to parts of the network being unreachable to other parts. Let's say our cluster is comprised of eight nodes with a backup count of two. Somehow, connectivity between two groups of nodes is lost, and two groups of five nodes are formed.

让我们看看另一种类型的失败。网络内的通信故障可能导致网络的某些部分无法访问其他部分。假设我们的集群由8个节点组成,备份数为2。不知何故,两组节点之间的连接丢失,并且形成两组五个节点。

Individually, each of those two groups will act as if some part of the cluster has been lost and will immediately start to rebalance data within the cluster. As you can see, it will result in lost data for those clusters and clients who are connecting to either group would see compromised data. The problem mentioned here is referred to as the Split Brain Problem and Hazelcast provides a feature called Quorum configuration to make sure you have the minimum number of machines available in network for the data structure to respond back. If cluster size drops below the configured number, it would result in 'QuorumException'.

确保您拥有网络中可用的最少数量的机器,以便数据结构进行响应。如果群集大小低于配置的数量,则会导致"QuorumException"。

单独地,这两个组中的每一个都将表现为群集的某些部分已丢失,并将立即开始在群集内重新平衡数据。如您所见,它将导致这些群集的数据丢失,并且连接到任一组的客户端都会看到受损数据。这里提到的问题被称为

Almost all highly resilient systems have some sort of self healing capabilities built in. When you are dealing with situations where data integrity is of the utmost importance, with multiple clusters spread across different geographic locations, you might want to look into data reconciliation options. Largely, there are two options available here.

几乎所有高弹性系统都内置了某种自我修复功能。当您处理数据完整性至关重要的情况时,如果多个群集分布在不同的地理位置,您可能需要查看数据协调选项。很大程度上,这里有两个选项。

  • The first option is to reconcile the data periodically with the source system. For example, nightly batch process can keep the data synced in the Hazelcast cluster. This option would help reconcile any data which did not make it into the Hazelcast cluster in the first place.
  • The second option is to leverage Hazelcast's reconciliation featurewhich allows you to verify if two clusters are in sync, and initiate a transfer of data if it sees data missing on either side.

Operational Resiliency

So far, we have achieved infrastructure and data specific resiliency for ABC Mega Corporation's clusters. However, you still may have blind spots around availability of the Hazelcast cluster. This is especially the case if operational processes do not account for the availability required, and mandate that Hazelcast clusters be totally rebuilt for every change.

到目前为止,我们已经为ABC Mega Corporation的集群实现了基础架构和数据特定的弹性。但是,您仍然可能对Hazelcast群集的可用性存在盲点。如果操作流程不考虑所需的可用性,并且要求Hazelcast集群针对每个更改进行完全重建,则情况尤其如此。

Here are two different ways to achieve resilience during version upgrades and data upgrades.

以下是在版本升级和数据升级期间实现弹性的两种不同方法。

  • Leverage rolling upgrade feature: Refer to the documentation about how you can upgrade Hazelcast versions without having to kill the entire Hazelcast cluster first. Refer to the documentation here.
  • Provide automatic failover configuration for clients: It helps to have High Availability(HA) Configuration built into the client's configuration. That way if the Hazelcast cluster where they are communicating changes state, it will automatically detect and connect to the HA cluster, which could be your another cluster in another location or DR region. This would not only save the system during planned scenarios but it comes in handy for unplanned system outages like EC2 failure where clients automatically reconnect to another cluster without downtime.

For our use case of the ABC Mega Corporation, client applications deployed in the US East region refer to the Hazelcast cluster deployed within AWS East as their primary server. During an event of server rehydration or planned deployment, the operations console would trigger the cluster configuration upgrade to point to the AWS West cluster. The operation console could be built in the application, which is triggered manually, or it could have logic to detect potential issues with the IMDG cluster.

对于ABC Mega Corporation的使用案例,部署在美国东部地区的客户端应用程序将部署在AWS East中的Hazelcast集群称为其主服务器。在服务器重新合并或计划部署的事件期间,操作控制台将触发集群配置升级以指向AWS West集群。操作控制台可以在应用程序中构建,可以手动触发,也可以使用逻辑来检测IMDG集群的潜在问题。

Cluster Rehydration

If your process mandates that you must rebuild an entire cluster, you should take into consideration how you bring clusters back up-to-date with data. There are crude solutions like running a separate process which reads the data from the system of record and adds it to the Hazelcast cluster. These may work. However, a simpler option would be to leverage the existing cluster and use it to replicate data in the newly built cluster (if your data size permits this).

如果您的流程要求您必须重建整个群集,则应考虑如何使群集与数据保持同步。有一些原始的解决方案,比如运行一个单独的进程,从记录系统中读取数据并将其添加到Hazelcast集群。这些可能有效。但是,更简单的选择是利用现有集群并使用它来复制新构建的集群中的数据(如果您的数据大小允许)。

If your data size is huge, a better option is to leverage the Hot Restart feature which allows the use of disk to bring data into memory upon restart or upgrade.

允许在重启或升级时使用磁盘将数据带入内存。

如果您的数据量很大,那么更好的选择是利用

Data Format Upgrades

After taking into consideration all the options above and more, a critical piece to the puzzle is still missing for the ABC Mega Corporation's IMDG clusters --- how data structure updates are introduced. Let's run through a scenario to find out why it is so important.

在考虑了上述所有选项之后,ABC Mega公司的IMDG集群仍然缺少关于这个难题的关键部分 - 如何引入数据结构更新。让我们通过一个场景来找出它为何如此重要。

You have a map for storing 'transaction' information which includes typical fields like timestamp, amount, vendor, zip code etc. The system is running in BAU mode with multiple clusters handling the traffic. A change is introduced to allow the zip code to contain alphanumeric values. Cluster A is upgraded with zip codes in alphanumeric value while Cluster B is still using the old five digit numeric value notation. Now, as the data from Cluster A gets replicated into Cluster B, it generates data parsing exceptions.

您有一个用于存储"交易"信息的地图,其中包括典型字段,如时间戳,金额,供应商,邮政编码等。系统在BAU模式下运行,多个集群处理流量。引入了更改以允许邮政编码包含字母数字值。群集A使用字母数字值的邮政编码进行升级,而群集B仍使用旧的五位数字值表示法。现在,当来自群集A的数据被复制到群集B中时,它会生成解析异常的数据。

The golden rule for such types of changes are, 'Add new Field/Don't modify existing'.

此类更改的黄金法则是"添加新字段/不修改现有"。

Storing data in proto format comes in handy in such scenarios. In the scenario above, if we have a new field added to the map's data structure, both the proto formats are still compatible and wouldn't cause replication to break when both versions of the map exist simultaneously.

在这种情况下,以原型格式存储数据会派上用场。在上面的场景中,如果我们在地图的数据结构中添加了一个新字段,则两个原型格式仍然兼容,并且当两个版本的地图同时存在时不会导致复制中断。

If a persistence mechanism such as 'hot restart' is enabled, it becomes even more critical that such breaking changes with updates should be avoided.

如果启用了诸如"热重启"之类的持久性机制,则应该避免使用更新这样的重大更改变得更加重要。

Always Prepare for Disaster

Resiliency isn't just about preventing failures, it is also about the ability to recover from failure. So ABC Mega Corporation must build controls in place which would help them recover in a timely manner.

弹性不仅仅是防止故障,也是关于从故障中恢复的能力。因此,ABC Mega公司必须建立适当的控制措施,以帮助他们及时恢复。

Backup/Restore

Clusters can be configured to generate backup at a different location which can then be used to mount/copy data into the cluster before bringing it up. Refer to the Hazelcast documentation here.

可以将群集配置为在不同位置生成备份,然后可以在启动之前将数据装入/复制到群集中。请参阅Hazelcast文档

Seed Data From Another Cluster

Depending on your use case, if you have multiple clusters with replication enabled, you can afford to build a new cluster from scratch from another healthy cluster in a reasonable amount of time.

根据您的使用情况,如果您有多个启用了复制的群集,则可以在合理的时间内从另一个健康群集中从头开始构建新群集。

However, there are scenarios in which you still want to have backup/restore enabled. For example:

但是,在某些情况下您仍希望启用备份/还原。例如:

  1. Multi cluster scenario where your cluster contains data which is 'state' specific (means, you would like to keep the data in the same state and can't afford to load everything from another cluster or backup).
  2. Multi cluster scenario where cluster size prevents replication as a feasible option due to the size of data that needs to be synchronized.

Hot-Restart

Hazelcast clusters can be enabled to persist data on the disc, which then would be used to 'seed' data when the cluster comes back up. If the Hazelcast cluster is running on fixed machines where the same disc would always be available for the same instance, this is fairly simple and straightforward. However, for cloud applications you would need to detach the EBS volume before the cluster goes down, then attach the same when the new instances are back before loading the data.

可以启用Hazelcast集群来保留光盘上的数据,然后在集群重新启动时用于"播种"数据。如果Hazelcast群集在固定计算机上运行,同一个磁盘始终可用于同一个实例,则这非常简单明了。但是,对于云应用程序,您需要在群集关闭之前分离EBS卷,然后在加载数据之前新实例返回时附加相同的卷。

Monitoring Systems --- Detect Failures

It is almost impossible for ABC Mega Corporation's IMDG clusters to claim that they have highly resilient systems without having adequate monitoring in place. Hazelcast has built a great console tool to help visualize internal details of the cluster, but each organization will have their favorite monitoring and alerting mechanism. Image taken from : https://docs.hazelcast.org/docs/management-center/latest/manual/html/#status-page

图片来自:https://docs.hazelcast.org/docs/management-center/latest/manual/html/#status-page

ABC Mega公司的IMDG集群几乎不可能声称他们拥有高弹性系统而没有适当的监控。 Hazelcast构建了一个很棒的控制台工具来帮助可视化集群的内部细节,但每个组织都有自己喜欢的监控和警报机制。

I have found the JMX beans to be quite useful, not only to understand the state of the cluster and the maps, but to observe the patterns as well. Of course, a lot of what JMX provides is also visible via the ManCenter console itself. However JMX, or a similar solution, would provide the integration required to integrate with your choice of monitoring tool. Things like, "Do I see a spike in the maximum latency for the 'Get' or 'Put' operations at a particular time?" or " When a particular process runs, Do I have memory use creeping up and will the cluster soon complain about not having enough native or heap memory?" There are so many data points available to consume here, and having JMX based beans provides a simpler mechanism for collecting them, which then can be fed into whatever monitoring systems has a plugin for JMX.

这里有很多可供使用的数据点,并且使用基于JMX的bean提供了一种更简单的收集机制,然后可以将其提供给任何具有JMX插件的监控系统。

非常有用,不仅要了解集群和地图的状态,还要观察模式。当然,通过ManCenter控制台本身也可以看到JMX提供的很多内容。但是,JMX或类似的解决方案将提供与您选择的监视工具集成所需的集成。像,

Conclusion

As I mentioned earlier, building a solid in-memory data grid should take into consideration all aspects of the grid from infrastructure creation to data location to backups within and without clusters to processes that are followed during processing and updates. Monitoring and alerting are essential to building resilient systems and should not be overlooked. These are just some of my thoughts based on my experience building IMDGs both with and without Hazelcast.

正如我之前提到的,构建一个可靠的内存数据网格应该考虑网格的所有方面,从基础架构创建到数据位置,到集群内外的备份,再到处理和更新过程中遵循的过程。监控和警报对于构建弹性系统至关重要,不应忽视。根据我在使用和不使用Hazelcast时构建IMDG的经验,这些只是我的一些想法。

I want to thank Srinivas Alladi, Director of Software Engineering at Capital One, for all his reviews and expert comments.

查看英文原文

查看更多文章

公众号:银河系1号

联系邮箱:public@space-explore.com

(未经同意,请勿转载)