Recommendations

This is a compilation of various recommendations to consider when deploying an Atoti cluster.

Keep in mind that there's no out-of-the-box universal configuration. An Atoti application depends on the dataset, how it is partitioned, how frequently it is updated, what calculations are run, how many concurrent users start queries, but it is also dependent on the underlying infrastructures, network, cloud provider, use of Kubernetes, and so on.

Before being deployed in production, a distributed application must be stress-tested with real-life conditions.

Avoid High-Cardinality Hierarchies

An Atoti cluster is composed of several data and query nodes. For the sake of simplicity, and because most use cases are built like this, this section describes a single query node.

The user sends queries to the query node. The query node then determines which data cube calculates the intermediate results, based on its knowledge of the different data cube topologies. It then propagates sub-queries to the concerned nodes. Finally, it assembles the partial results before returning the result to the user.

In order to support interactive multidimensional analysis, the query node must have instant access to each data cube's hierarchies and their corresponding members. To make that possible, when the data nodes join the cluster they send the content of all their hierarchies to the query node.

For more information on this process, see the Communication flow topic.

For large clusters with tens of nodes and cubes with high-cardinality hierarchies, this initial phase can become resource-intensive, as it consumes CPU to serialize and deserialize the data, network bandwidth, and transient memory on the query node that temporarily holds many copies of the hierarchies.

When pushed too far, it can create cluster instability, so we recommend avoiding high-cardinality hierarchies (100K+ members) in clusters with more than ten nodes.

A solution is to entirely conceal hierarchies from the distributed application. In that case, the data cube adapts the discovery message sent to the query node, so that it won't be aware of their existence. The members of these hierarchies are not sent on the network. This strongly improves the stability of the JGroups cluster by reducing the network bandwidth required, and the transient memory on the query node, while it is merging the contributions of many data cubes, within the limit set in the Pending Initial Discoveries property.

When the query cube forwards queries to this data node, the data node fills the coordinates of the concealed hierarchies with the paths of their default members to create complete locations. Then it removes them just before returning the result.

We recommend concealing high cardinality hierarchies when possible. We also recommend concealing any hierarchy that was created to add a field and compute a value within a post-processor, but that end users do not need to see.

You can conceal measures that are only executed on data node. This reduces the list of measures in the UI, and prevents those measures from being evaluated in the query node.

The following code block shows how these concepts are defined in Atoti:

.asDataCube()
.withClusterDefinition()
.withClusterId("SupplyChainCubeCluster")
.withMessengerDefinition()
.withProtocolPath("jgroups-protocols/protocol-tcp.xml")
.end()
.withApplicationId("supply-chain")
.withConcealedHierarchies()
.withConcealedHierarchy("Truck", "TruckId")
.endHierarchyListing()
.withAllMeasures()
.withConcealedBranches()
.end()
.build();

Conceal branches

To reduce the data sent by the data node to its query nodes, one can conceal the branches. Only the data of the master branch will be sent.

See example above.

It is also possible to choose which branches should be exposed to the query nodes while the application is running:

final IMultiVersionDataActivePivot dataActivePivot =
    (IMultiVersionDataActivePivot) manager.getActivePivot("pivotId");
// myBranch will be visible in the query cube
dataActivePivot.exposeBranch("myBranch");

Avoid Large Garbage Collection Pauses

In Java applications, memory allocation is managed by the JVM and its 'garbage collection' algorithms. Garbage collection regularly has to pause the application in order to reclaim unused memory.

Short pauses usually go unnoticed, but pauses that last more than a few seconds can cause service interruptions, as the application stops responding. During a pause, the whole JVM becomes irresponsive.

If a stand-alone Atoti instance stops responding for a few seconds during memory reclamation, there are no consequences, apart from an exceptionally long delay for the user.

However, within a distributed cluster, a long GC pause is much more serious: when a JVM is irresponsive, other nodes start suspecting that it is gone, and after a delay they update the state of the cluster.

See Removal Delay Optimization on the removalDelay property for more details.

The query cube calculates the contributions the irresponsive cube made to the application's global topology, and removes them. If the JVM then starts working again, the cluster must be updated once more. At scale this creates cluster-wide instability, so garbage collection must be monitored and long pauses prevented.

To operate a stable cluster, long GC pauses (> 5-10 sec.) must be avoided at all cost. It might require tuning the JVMs of the data nodes.

The query node doesn't use off-heap memory. It only stores the content of the hierarchies (on heap), and the transient data needed to handle queries. For that reason and because long GC pauses must be avoided, do not over-allocate memory to the JVM on which it is running. Java garbage collectors are not good at handling hundreds of gigabytes of memory without long GC pauses, even with G1GC. Therefore, do not set the -Xmx parameter above 100GB for the query node.

Removal Delay Optimization

Even with good tuning and monitoring, a data node may infrequently drop out of the cluster due to a network issue, because it was saturated with calculations, or because it entered a garbage collection pause and stopped answering heartbeats for too long. When the data node is back to normal, it has to join the cluster again, and must resend its hierarchies to the query node. If several data nodes go through this at the same time, it can create a peak of activity and make the cluster unstable.

Atoti has a mechanism to alleviate the problem. When a data node drops out of the cluster, the query node maintains its hierarchies for a period of time, called removalDelay.

After this period, the query cube removes the failing cube's contributions. As this operation can be costly, we want to avoid triggering it as much as possible.

If the data node comes back later and the query node sees it has not changed based on the latest epoch of the data node, there is no need to resend the hierarchies.

When the removal delay is set to zero, the mechanism is deactivated, which is not recommended for a large cluster.

You should set the removalDelay property higher than the longest GC pause that has been witnessed across the cluster.

Take Charge of JGroups Configuration

Atoti queries and loads are highly parallel and can saturate every CPU core for long periods. During those peaks, the OS scheduler must still allocate time slices to the JGroups protocol threads that send heartbeats, verify suspicions, and install views. Long JVM pauses (GC) or CPU starvation can delay those threads enough for failure detectors to suspect the node. Do not rely on any sample JGroups configuration. Tune it to your deployment and workload.

Below are some of the core JGroups protocols and how they relate to Atoti:

pbcast.GMS is the group membership service. It installs views (the list of members) and drives join/leave/merge handling. Its timeouts decide how quickly Atoti reacts to a network partition or crash, and overly short values can force unnecessary cube rediscovery during CPU spikes.
VERIFY_SUSPECT confirms suspicions before a view change. Keeping it just below pbcast.GMS in the protocol file reduces false resets when a node is briefly busy (e.g., parallel query fan-out, GC).
FD_SOCK uses a dedicated socket to check liveness. It detects crashes quickly and is resilient to query-induced heartbeat jitter, but it requires stable, reachable TCP ports between members.
FD_ALL exchanges periodic heartbeats over the regular channels. It fits environments where extra ports are blocked or remapped (NAT, Kubernetes hostPort/NodePort, strict firewalls), but it is slower to detect failures and more sensitive to GC or CPU saturation.

Additionally, there are special JGroups stack components for specific environments, such as AWS, Azure, GCP, or Kubernetes.

Prevent Excessive MergeView

Sometimes, the cluster may split into subgroups that will need to be merged back afterward. In such cases, a MergeView will be received by the application. This behavior can be caused by network partitioning, network issues, or by large workloads causing nodes to become unresponsive (e.g. huge data loading in the data nodes).

MergeView events that can be linked to high CPU usage in a data node are strong indicators of a poorly configured failure detection JGroups stack.

Atoti handles MergeView by applying a FullReset, that is, all members of the new view are considered as new members. This results in applying the entire discovery process again. This recovery process is extremely costly and should be avoided as much as possible.

When observing a lot of MergeView or FullReset, which can be spotted in the logs with the keyword MergeView, tune the JGroups failure detector and membership protocols:

pbcast.GMS drives view installation; set its timeouts high enough to absorb transient CPU or GC pauses caused by parallel query bursts.
VERIFY_SUSPECT should sit immediately below pbcast.GMS so that suspicions are confirmed before views are installed.
Enabling both FD_SOCK and FD_ALL is rarely useful; pick one based on your network constraints and tune its timeouts generously.

To increase the timeouts and number of retries, start from one of the following stack fragments, keeping the order pbcast.GMS → VERIFY_SUSPECT → failure detector:

<!-- Favor this stack when direct TCP connectivity between members is stable. -->
<pbcast.GMS
  print_local_addr="true"
  join_timeout="10000"
  leave_timeout="10000"
  merge_timeout="10000"
  num_prev_mbrs="200"
  view_ack_collection_timeout="10000"/>
<VERIFY_SUSPECT timeout="10000" num_msgs="5"/>
<FD_SOCK
  get_cache_timeout="10000"
  cache_max_elements="300"
  cache_max_age="60000"
  suspect_msg_interval="10000"
  num_tries="10"
  sock_conn_timeout="10000"/>
<FD_HOST interval="10000" timeout="35000"/>

<!-- Prefer this variant when node-to-node sockets are constrained (e.g. NAT, container port
     mapping). It piggybacks on regular messaging and avoids extra ports, but detection is slower
     and more sensitive to GC pauses or CPU spikes because it relies on periodic heartbeats. Tune
     timeouts generously to avoid false suspicions. -->
<pbcast.GMS
  print_local_addr="true"
  join_timeout="10000"
  leave_timeout="10000"
  merge_timeout="10000"
  num_prev_mbrs="200"
  view_ack_collection_timeout="10000"/>
<VERIFY_SUSPECT timeout="10000" num_msgs="5"/>
<FD_ALL timeout="60000" interval="20000"/>
<FD_HOST interval="10000" timeout="35000"/>

Splitting the loading process into smaller data portions may also help solve the issue.

Delay Messenger Startup in Data Nodes

As explained in the Communication flow topic, the most important messages in an Atoti cluster are the discovery messages sent from the data nodes to the query nodes. These messages are expected to have a significant size, take time to be transmitted, and consume a large portion of memory at serialization time.

For these reasons, ActiveViam optimized the communication between a query node and its associated data nodes to regulate the flow of discovery messages.

To leverage this optimization, we highly recommend starting each data node's messenger after the initial load. To do this, deactivate the messenger's autostart feature, and start it manually after the initial load by calling: ((IMultiVersionDataActivePivot) multiVersionActivePivot).startDistribution()

Below is a code snippet where we deactivate the autostart feature:

.asDataCube()
.withClusterDefinition()
.withClusterId("myClusterId")
.withMessengerDefinition()
.withKey(IMessengerDefinition.NETTY_PLUGIN_KEY)
.withProperty(IMessengerDefinition.AUTO_START, "false")
.end()
.withProtocolPath("jgroups-protocols/protocol-tcp.xml")
.end()
.withApplicationId("myApplicationId")
.withAllHierarchies()
.withAllMeasures()
.end()

After the data loading has completed, the messenger can be turned back on.

for (final IMultiVersionActivePivot multiVersionActivePivot :
    activePivotManager.getActivePivots().values()) {
  if (multiVersionActivePivot instanceof IMultiVersionDataActivePivot) {
    ((IMultiVersionDataActivePivot) multiVersionActivePivot).startDistribution();
  }
}

Avoid High-Cardinality Hierarchies​

Conceal branches​

Avoid Large Garbage Collection Pauses​

Removal Delay Optimization​

Take Charge of JGroups Configuration​

Prevent Excessive MergeView​

Delay Messenger Startup in Data Nodes​