Skip to main content

Protocol Configuration

Atoti uses JGroups for its cluster-coordination layer in its distributed architecture. Cluster stability is tightly coupled to how JGroups is configured, particularly the protocols that detect node failures and install membership views. This page covers the non-security protocols and how they interact with Atoti's workloads.

note

Atoti's distributed architecture uses two transports.

  • JGroups carries the cluster control plane: membership views, discovery, heartbeats, view-install messages, suspect/verify exchanges.
  • Netty carries the application data plane: query requests and results, hierarchy payloads during initial discovery, commit broadcasts, data loads, and any message the application code sends to another node.

This page covers JGroups configuration only. Whenever it discusses message sizes, buffer tuning, or network bandwidth, the scope is the JGroups control plane. Application-scale traffic travels over Netty and is documented separately.

For cluster authentication and traffic encryption (AUTH, SYM_ENCRYPT, ASYM_ENCRYPT), see the Cluster Security page.

There is no universal JGroups stack. The right configuration depends on the dataset, workload, infrastructure, and network. Production deployments must be stress-tested with real-life conditions before going live.

Whichever stack is chosen, every member of the cluster must share the same protocol list in the same order. Attribute values that participate in inter-member negotiation (AUTH credentials, ENCRYPT keystores, transport address and port range) must also match exactly. Local-only settings such as timeouts and thread-pool sizes can differ per member without breaking the cluster. The simplest way to keep the shared parts in sync is to ship a single protocol.xml with the application and reference it from every member's configuration.

Protocol stack basics

JGroups XML files are written bottom-up: the first <protocol> element (typically the transport) sits at the bottom of the stack, and each following element sits higher up, with the last element at the top. Whenever this page mentions a protocol as "below" another in the stack, it means the protocol appears earlier in the XML file: the two orderings are reversed.

Protocol categories

A JGroups stack is built from small, single-purpose protocols, each solving one problem in group communication. An Atoti configuration typically uses one or two protocols from each of the following categories:

  • Transport: puts bytes on the wire between members. Exactly one transport per cluster. See the Transport protocols section.
  • Discovery: finds existing cluster members at startup so a new node knows how to join.
  • Partition recovery: handles cluster splits and the following re-merges. Network partitions are unavoidable, these are protocols for handling recovery.
  • Reliable delivery and message ordering: turns a best-effort transport into a reliable one, using per-sender sequence numbers and retransmission.
  • Message stability: ensures the cleanup of the retransmission buffers held by the reliable-delivery protocols (previous category) once a message has been acknowledged by every member.
  • Fragmentation: splits large messages into reasonably sized fragments and reassembles them on the receiver.
  • Flow control: bounds the volume of in-flight messages a fast sender can push at slow receivers.
  • Failure detection: probes cluster members and raises SUSPECT events when one stops responding. See the Failure detection section.
  • Membership: installs views (the current list of live members) and drives join, leave, and merge events.
  • Security: cluster authentication and traffic encryption. Covered on the specific Cluster Security page.
warning

Protocol order within the stack matters: several protocols have strict placement rules. The dedicated section for each category describes the placement rules it cares about.

Transport protocols

The transport protocol sits at the bottom of the JGroups stack and is responsible for sending bytes on the network. JGroups ships several transports; UDP and TCP are the common choices for Atoti deployments. (TCP_NIO2 is a non-blocking variant of TCP, and TUNNEL routes through an external GossipRouter; neither is typical for Atoti deployments.)

A JGroups cluster uses exactly one transport, and every member must use the same one.

UDP

UDP sends packets over the network without any prior setup: there is no connection between sender and receiver, and the sender gets no confirmation that a packet arrived. Each packet can be lost, duplicated, reordered, or silently dropped upon error, and the transport layer does nothing to recover.

Messages addressed to all cluster members are sent using UDP multicast: a network-level feature where the sender emits a single packet and the network (routers and switches) delivers a copy to every host that has joined the multicast group, avoiding the cost of one unicast packet and connection per recipient. Node-to-node messages use plain UDP unicast.

This makes UDP the traditional choice for clusters that run within a single subnet with IP multicast enabled.

warning

Even though JGroups traffic on an Atoti cluster is limited to small cluster-control messages (views, heartbeats, discovery, suspect/verify exchanges), those messages must still be delivered reliably and in FIFO order per sender. A lost or out-of-order view-install from pbcast.GMS, for example, can cause members to disagree about cluster membership, momentarily causing queries to throw (if the query node is able to detect that some data nodes are missing), before triggering a costly FullReset of the cluster.

A UDP-based stack is only viable because the reliability-layer protocols described below build those guarantees on top of the transport. Configuring them correctly is not optional.

The following protocols for reliable transmission are effectively mandatory on any UDP-based Atoti cluster:

  • pbcast.NAKACK2 and pbcast.STABLE together, for reliable multicast. The first one adds lossless and ordered delivery of multicast messages. It numbers every outgoing multicast message with a per-sender sequence number, detects gaps on the receiver side, and requests retransmission from the sender. To do so, it adds to each cluster member a buffer of sent messages. The second protocol is responsible for the cleanup of these buffers. Using the fist one without STABLE will lead to a memory error during the lifecycle of the application.
  • UNICAST3 for reliable unicast. Same reliable, FIFO semantics, for point-to-point messages.
  • FRAG2, FRAG3, or FRAG4, for message fragmentation. It splits messages that are larger than what the network can carry in a single packet (the path MTU, typically around 1500 bytes) into smaller pieces and reassembles them on the receiver. JGroups control messages on an Atoti cluster are usually well below the MTU, but FRAG* is still required as insurance: any occasional oversized message (a pbcast.GMS view in a large cluster, for example) would otherwise be silently dropped by the OS, breaking cluster coordination.

Multicast support also depends on the underlying network. Within a single subnet, switches forward multicast packets automatically. Between subnets, routers do not forward multicast traffic by default: they silently drop it, with no error and no log line. Crossing subnets therefore requires the network team to enable it explicitly (typically PIM on the routers and IGMP snooping on the switches). Without that, a cluster that works perfectly within one VLAN can silently fracture into independent sub-clusters as soon as a member is moved to a different subnet, and members simply never see each other's broadcasts.

TCP

TCP opens a direct TCP connection between each pair of cluster members. There is no multicast: every broadcast is turned into a fan-out on the sender's side, sending the same message once per peer-to-peer connection. TCP is the common choice for environments where IP multicast is disabled, unavailable, or unreliable: Kubernetes, multi-region VPCs, strict datacenter network policies...

Compared to UDP, TCP brings a lot for free. Each connection is ordered, lossless, and flow-controlled at the transport level, so most of the reliability concerns that dominate a UDP configuration are handled directly by the transport. TCP also exposes connection state to JGroups: when a peer's socket closes abnormally, the transport can raise a failure suspicion directly. See the failure detection section.

TCP does not make the reliability-layer protocols redundant, though. A broadcast still travels over N separate TCP connections whose writes can complete in different orders, so higher protocols still handle the ordering and concurrency properties that TCP alone cannot:

  • pbcast.NAKACK2 and pbcast.STABLE for ordering of multicast messages across the fan-out, plus cleanup of the retransmission buffers.
  • UNICAST3 stays in the stack even though TCP already provides reliable unicast. It also enforces that messages from the same sender are delivered one at a time. Removing it allows concurrent delivery, which may work for Atoti but hasn't been thoroughly tested (Atoti does implement its own sequencing for messages at the Application level).

TCP requires every member to be able to open an outgoing TCP connection to every other and to accept incoming connections in return. In practice, the configured bind_port (plus any port_range the member falls back to) must be reachable in both directions between every pair of nodes, and container networking must preserve direct pod-to-pod routability. Unlike UDP multicast, there is no router-level configuration to enable: if a TCP connection can be established between two IP addresses, the transport works.

The trade-off is connection count. A cluster of N members maintains up to N × (N−1) directional TCP connections (one per ordered pair of members). Atoti deployments with a low node count are not a concern, but huge clusters may require additional network setup.

DNS and address resolution

Any address attribute on the JGroups stack (the transport's bind_addr, external_addr, TCPPING's initial_hosts, FD_SOCK2's bind_addr, and so on) accepts a hostname in place of a literal IP. No flag is needed to enable this: hostnames are resolved via standard Java DNS at stack startup, and JGroups operates on the resulting IPs for the rest of the JVM's lifetime. Every member must therefore be able to resolve every other member's hostname when the stack comes up.

warning

Long-running JVMs cache positive DNS lookups. In a default OpenJDK configuration with a security manager installed, the positive-cache TTL is effectively infinite: once an Atoti process resolves data-node-3.cluster.local to 10.0.42.17 at startup, it keeps using that IP for the lifetime of the JVM, even if DNS is later updated to point the hostname at a different address. In environments where member IPs change routinely (Kubernetes pod restarts, cloud auto-scaling), this silently breaks the cluster. Cap the positive-cache TTL by setting networkaddress.cache.ttl in $JAVA_HOME/conf/security/java.security to a reasonable value (30 seconds is typical), or pass -Dsun.net.inetaddr.ttl=30 on the JVM command line.

Reverse DNS is occasionally triggered by JGroups for logging and diagnostics (for example, when print_local_addr="true" is set on pbcast.GMS). If reverse DNS is slow or broken in the environment, it can insert latency into otherwise fast code paths. Setting the negative-cache TTL similarly (via networkaddress.cache.negative.ttl or -Dsun.net.inetaddr.negative.ttl) helps avoid repeatedly retrying broken lookups.

warning

JGroups has a second cache of its own, in org.jgroups.util.Util, that holds the host's local network interfaces and addresses. It is populated the first time JGroups resolves an interface-matching bind_addr token (SITE_LOCAL, GLOBAL, NON_LOOPBACK, match-interface, and so on). Unlike the JVM DNS cache above, this one has no TTL and no configuration knob: once populated, it is held until the JVM exits or Util.resetCachedAddresses(true, true) is called programmatically.

The assumption is that a running Atoti JVM does not see its own network interfaces change until it is restarted. If a specific use-case requires such a change, a custom endpoint will need to be added to reset this cache. This is only automatically handled by Atoti upon a CRaC checkpoint.

Discovery

Discovery runs at startup to find the existing cluster members. It produces a list of candidate addresses that the new node contacts with a JOIN_REQUEST; once joined, the node relies on pbcast.GMS for all subsequent membership changes, and discovery falls idle.

JGroups ships a wide range of discovery protocols. The most common ones on Atoti clusters are:

  • PING: IP multicast discovery. UDP transports only.
  • TCPPING: static list of host:port entries set via the initial_hosts attribute. The natural fit for small TCP clusters with stable addresses.
  • TCPGOSSIP: delegates discovery to an external GossipRouter daemon. Useful when addresses are dynamic.
  • JDBC_PING / FILE_PING: each member registers its address in a shared database table or filesystem; discovery reads it back. Convenient when the cluster already has a shared store.

Cloud and container environments ship purpose-built variants, typically via jgroups-extras:

  • NATIVE_S3_PING for AWS.
  • AZURE_PING for Azure.
  • KUBE_PING for Kubernetes, discovering pods via the Kubernetes API.
  • DNS_PING also fits Kubernetes well when paired with a headless service.
  • JDBC_PING is a viable and stable solution for all cloud environments, worth a second mention here.

Despite the *PING naming convention, these protocols do not continuously probe peers for liveness. They run at startup to find cluster members, and may re-run when pbcast.GMS needs to rebuild the roster (for example, during a merge). Ongoing liveness checking ("is this peer still alive?") is entirely the job of the failure detectors.

See the JGroups manual for the full catalog and each protocol's configuration attributes.

Partition recovery

A network partition is a situation where the cluster temporarily splits into disconnected subgroups that cannot see each other. Partitions are unavoidable in distributed systems: a switch fails, a node is overwhelmed and stops answering, a cloud region loses connectivity for a few seconds. When connectivity is restored, the cluster must detect that there are multiple subgroups and reconcile them into a single view.

Not every partition JGroups reports is caused by the network itself. A data node that is fully occupied loading a large dataset, or paused by a long garbage collection, may stop responding to heartbeats long enough for the failure detector to conclude it has crashed. The cluster installs a view without that member, and when the node recovers, reconciling the two views triggers the same merge protocol as a real network split. From the cluster's perspective, the two cases are indistinguishable.

The protocol MERGE3 drives reconciliation. Every member periodically sends a small INFO message carrying its address, logical name, and current ViewId. Every coordinator, in turn, periodically inspects the INFO messages it has received and looks for a ViewId that disagrees with its own. When inconsistencies are found, one member takes the role of merge leader, asks each subgroup's coordinator for its full view, and passes a MERGE event up the stack to pbcast.GMS protocol, which handles the actual merge and installs a single MergeView containing the union of all surviving members.

Note that MERGE3 reconciles cluster membership only: merging the application state accumulated by each subgroup during the split is the application's responsibility (for Atoti, the FullReset described below).

Typical configuration uses conservative timing: infrequent enough not to overlap with transient state changes but frequent enough to pick up genuine partitions within a reasonable window.

<MERGE3 min_interval="a value in ms" max_interval="another value in ms" check_interval="yet another value in ms" />

Atoti handles every Merge event by asking for a FullReset of the cluster. Every member of the merged view is treated as a brand-new member, and the entire initial discovery process runs again: each data node re-serializes all of its hierarchies and re-sends them to the query node over Netty.

This is extremely expensive at the application layer. In clusters with tens of data nodes and hierarchies of several hundred megabytes each, a single Merge can trigger hundreds of megabytes of serialization and deserialization in the JVM, transient memory allocation, and Netty-side network transfers, all at the same time and on every node. The JGroups side of the event is cheap (the view change itself is a handful of kilobytes of coordination), but the Atoti-driven recovery that follows drives CPU, memory, and network near capacity. The cluster remains functional during recovery, but throughput drops sharply until the process completes.

warning

Merge events that coincide with high CPU usage in a data node are strong indicators of a poorly configured failure-detection JGroups stack, not a real network partition. The data node was too busy to respond to heartbeats or socket checks, the failure detector concluded it was gone, a view was installed without it, and when it recovered, a merge was needed. Fix the protocol stack by raising the timeouts (see Tuning timeouts for Atoti workloads) and verifying the protocol order rather than concluding the cluster encountered a network problem.

Use MergeView as a diagnostic keyword when searching the logs.

Failure detection

A failure-detection protocol probes cluster members and, when one stops responding, emits a SUSPECT event that travels up the protocol stack. Exclusion itself is the job of pbcast.GMS: the detector only raises the alarm.

VERIFY_SUSPECT may sit between the detector and pbcast.GMS to double-check before a suspicion becomes a view change.

JGroups 5 offers several detectors, each targeting a different failure mode. Picking the right combination for an Atoti workload matters: an overly eager stack ejects healthy nodes during transient CPU or GC pressure, and every false ejection can cascade into a costly Merge.

JGroups ships several failure-detection protocols, grouped into families by detection mechanism. Each family has been refined over JGroups releases; when choosing one, prefer the most recent variant unless there is a specific reason to pick an older one.

FD_SOCK family (socket-based)

FD_SOCK and its modern rewrite FD_SOCK2 detect crashed peers by watching dedicated TCP sockets between pairs of neighboring members. When a peer crashes, its operating system closes the socket, and the neighbor sees the close event immediately and emits a SUSPECT.

Whether the main transport is TCP or UDP, FD_SOCK runs its own set of TCP connections: its detection relies on the OS signaling an abnormal socket close, which only TCP produces. On a TCP main transport, these connections are additional to the full-mesh TCP connections the transport itself maintains: FD_SOCK adds O(N) extra sockets across the cluster, on different ports from the main transport, which is much cheaper than the main transport's O(N²) mesh but still consumes extra ports per member. On a UDP main transport, the same TCP sockets run alongside the UDP traffic. This is worth knowing for environments where UDP was chosen specifically to avoid TCP constraints, since FD_SOCK still needs TCP reachability between neighboring members.

Prefer FD_SOCK2 in new configurations: it has a cleaner implementation based on a non-blocking NIO server, and drops the neighbor-ring cache that the original FD_SOCK had to maintain.

note

These protocols are immune to CPU pressure and Garbage Collector pressure: a busy or paused peer keeps its socket open, so no false positive comes from a JVM stall. This is paramount for an Atoti application in which a very intense query or transaction may cause a GC stop-the-world event, for instance.

FD_ALL family (heartbeat-based)

FD_ALL, FD_ALL2, and FD_ALL3 detect unresponsive peers using periodic heartbeat messages. Every member broadcasts a heartbeat on a configured interval, and each receiver tracks when it last heard from each peer. If a peer goes silent for longer than the configured timeout, it is suspected. Heartbeats travel over the same channel as the regular JGroups traffic, which makes this protocol family a natural choice in NAT, Kubernetes hostPort / NodePort, or firewall-restricted environments.

note

These protocols are sensitive to CPU pressure and Garbage Collector pressure: a paused JVM stops sending heartbeats even when the node is still alive. See Sources of thread pressure for how each pressure type interacts with JGroups timing. Generous timeouts (depending on production workloads) are needed to absorb typical GC pauses. These timeouts can be tuned together with the GC configuration, if customized. See Tuning timeouts for Atoti workloads for concrete guidance.

Prefer FD_ALL3 in new configurations: it tracks per-member heartbeats with better timing precision and lower memory overhead than FD_ALL or FD_ALL2.

FD_HOST (host-level reachability)

FD_HOST detects the loss of an entire physical host (power cut, kernel panic, OS shutdown, ...) by probing it with InetAddress.isReachable() (or a configurable external command such as ping). It is not a peer-level detector and does not tell that a single JVM is down, only that the whole host has become unreachable.

Relevant only when multiple JGroups members share a host. If every Atoti node has its own machine or its own container with its own IP, FD_HOST adds no signal a peer-level detector would not already catch.

Transport-level suspect events (TCP / TCP_NIO2)

Not a protocol family of its own, but a property of the TCP and TCP_NIO2 transports: setting enable_suspect_events="true" on the transport element causes the transport itself to emit a SUSPECT event when a peer's TCP connection closes abnormally. The attribute defaults to false and must be set explicitly on the TCP or TCP_NIO2 element in the protocol stack.

This provides peer-level crash detection directly at the transport layer, without the extra per-neighbor socket ring of FD_SOCK2. The JGroups manual explicitly recommends using it in place of FD_SOCK / FD_SOCK2 on TCP-based stacks, paired with FD_ALL3 as a second line of defense for hangs and kernel panics.

VERIFY_SUSPECT family

VERIFY_SUSPECT and its simpler refactor VERIFY_SUSPECT2 are not detectors, they are a verification step that sits between any detector and pbcast.GMS. When a SUSPECT event travels up from below, VERIFY_SUSPECT pings the suspected member one more time and waits a short timeout for a response. If the member responds, the SUSPECT is dropped. If it does not, the event passes up to pbcast.GMS, which installs a view change.

  • Protects against transient false positives. A SUSPECT raised during a GC pause that ends just before the verification round completes is correctly dropped.
  • Should always be in the stack. Every Atoti failure-detection configuration should include VERIFY_SUSPECT (or VERIFY_SUSPECT2); it is the single most important guard against cluster churn from GC pauses and brief CPU spikes.
  • Placement is load-bearing. VERIFY_SUSPECT must sit above the failure detector and below pbcast.GMS.

Sample failure detection stacks for Atoti

Atoti workloads stress JGroups timing: parallel queries, transaction commits, and initial data loads routinely saturate CPU for seconds at a time. A detection stack that is fine in a quiescent cluster can still raise false positives under a production load. Below are three practical setups, from simplest to most redundant.

In all three,

  • add FD_HOST only if multiple JGroups members share a physical host.
  • add VERIFY_SUSPECT no matter what.

Option 1: Transport-level detection + FD_ALL3 (recommended for TCP clusters). When the transport is TCP or TCP_NIO2 with enable_suspect_events="true" (see Transport-level suspect events), abnormal socket closes emit SUSPECT events at the transport layer, and FD_ALL3 serves as a second line of defense for hangs and failures that do not close sockets. This matches the JGroups manual's explicit recommendation: "a stack that has failure detection in TCP / TCP_NIO2 enabled does not necessarily need FD_SOCK / FD_SOCK2; the recommendation in this case is to remove FD_SOCK / FD_SOCK2 and only leave FD_ALL3 in the stack, as a second line of defense." This option minimizes the number of detectors that need explicit tuning.

Option 2: Single peer-level detector. Pick exactly one of FD_SOCK2 or FD_ALL3 and tune its timeout generously:

  • FD_SOCK2 when direct TCP connectivity between peers is stable. Event-driven, immune to GC and CPU pressure.
  • FD_ALL3 when peer-to-peer sockets are constrained (NAT, Kubernetes hostPort/NodePort, strict firewalls).

One detector means one source of suspicion, so tuning is straightforward and false-positive risk is low. This is the safest default for Atoti deployments that may not have the time or telemetry to tune a combined stack precisely.

Option 3: FD_SOCK2 + FD_ALL3 combined. The JGroups manual documents this as the classic belt-and-suspenders pattern: FD_SOCK2 catches immediate crashes, FD_ALL3 (with a higher timeout) catches hangs and kernel panics that do not close sockets. The combination doubles the sources of suspicion, so both detectors must be tuned with the workload in mind.

warning

Combining FD_SOCK2 and FD_ALL3 with default timeouts is a common cause of false node ejections and excessive MergeViews in Atoti clusters. The thread pressure caused by parallel queries, large commits, and initial discovery routinely exceeds default FD_ALL3 timeouts. If this combination is selected, set FD_ALL3's timeout to at least the longest observed GC pause plus a healthy margin. See Tuning timeouts for Atoti workloads for the full picture.

Reliable message transmission

The transport alone does not guarantee that every message reaches every recipient in order: UDP cannot, and TCP can only guarantee it between a single pair of endpoints. Atoti relies on two properties that are built above the transport, by a tightly coupled group of protocols:

  1. Reliable delivery: every message a sender produces is eventually delivered to every intended recipient. No silent drops. No partially received messages (see Fragmentation section).
  2. FIFO order per sender: messages from a given sender are delivered in the order they were sent. (Messages from different senders can interleave arbitrarily.)

Three protocols provide these properties: pbcast.NAKACK2 for multicast messages, UNICAST3 for point-to-point messages, and pbcast.STABLE to keep the retransmission buffers they produce from growing without boundaries.

NAKACK2 (reliable multicast)

pbcast.NAKACK2 covers every message addressed to all cluster members. It assigns a per-sender sequence number to each outgoing multicast message, and each receiver tracks, for every other sender in the cluster, the highest contiguous sequence number it has delivered to the application.

When a receiver notices a gap (seqno 7 arrives but seqnos 5 and 6 are missing) it queues the later messages and sends a NAK (negative acknowledgement) to the sender asking for the missing ones. The sender retransmits from its retransmission buffer, the receiver fills the gap, and the queued messages are then delivered in order.

The protocol is purely NAK-based: the sender never receives a positive "I got message N" from any peer. This avoids the ack-storm problem (N receivers acknowledging every multicast individually), but it means the sender has no way, on its own, to tell when a given message has been received by everyone in the cluster. That is the problem pbcast.STABLE solves, which is why the two protocols are always paired on any NAKACK2 stack.

UNICAST3 (reliable unicast)

UNICAST3 covers point-to-point messages between one sender and one receiver. It uses the same per-sender sequence numbering and gap-based retransmission model as NAKACK2, but because an unicast channel has exactly one receiver, it can also exchange positive acks without the ack-storm problem. The receiver periodically sends a cumulative ack to the sender ("I have everything up to seqno N from you") and the sender can then drop everything at or below N from its retransmission buffer on the spot, without involving the rest of the cluster.

The practical consequence: UNICAST3 handles its own buffer cleanup per sender-receiver pair, and does not rely on pbcast.STABLE. A UNICAST3-only stack can safely omit STABLE; a stack that includes NAKACK2 cannot.

Even on TCP, where the transport itself already guarantees reliable, ordered byte-stream delivery between a pair of members, UNICAST3 stays in the stack, because it also enforces that messages from the same sender are delivered to the application one at a time (per-sender serialization), which TCP alone does not do. Removing it allows concurrent delivery of messages from the same sender, which is untested with an Atoti application.

pbcast.STABLE (retransmission-buffer garbage collection)

Every cluster member holds messages in a retransmission buffer so it can answer NAKs from lagging receivers. Without cleanup, that buffer grows forever. pbcast.STABLE is the protocol that decides when messages are safe to forget. STABLE runs on two triggers, both configurable:

  • desired_avg_gossip: time-based. The average interval between checks.
  • max_bytes: volume-based. When a member has received more than max_bytes since the last check, it initiates a new one.

On an Atoti cluster, where JGroups traffic is limited to cluster-control messages, the volume-based trigger rarely fires in practice; desired_avg_gossip is what effectively drives stability rounds. The max_bytes setting is worth keeping as a safety net, but is not a useful tuning lever for Atoti's workload.

warning

A stack that includes pbcast.NAKACK2 without pbcast.STABLE will leak memory. NAKACK2 has no other mechanism for dropping messages from its retransmission buffer, so every multicast message ever sent stays in memory on every member until the process restarts. Under any realistic traffic, this ends in an OutOfMemoryError.

UNICAST3 does not have this problem (it drains its own buffer from peer acks), so a UNICAST3-only stack can safely omit STABLE. But any NAKACK2 in the stack requires STABLE.

Modern alternative: pbcast.NAKACK4

pbcast.NAKACK4 is a newer protocol that replaces the NAKACK2 + STABLE pair with a single element, using positive acknowledgements instead of the NAK-based scheme of NAKACK2. Modern JGroups applications can use it directly. It is functionally equivalent for Atoti's purposes, simpler to reason about, and is the direction the JGroups project is moving.

Group membership (pbcast.GMS)

pbcast.GMS (Group Membership Service) owns the cluster's view: the authoritative, versioned list of members currently participating in the cluster. Every change to cluster membership flows through GMS, which produces a new view and broadcasts it to every member.

A single member, the coordinator, drives all view changes. By default, it is the oldest member in the cluster. Every other member is a passive follower: it acknowledges incoming views but does not propose them.

GMS responds to four kinds of events:

  • Join: a new node sends a JOIN_REQUEST to the coordinator. AUTH (see the Cluster Security page) inspects the request on its way in. If accepted, the coordinator adds the new node to the view and broadcasts the updated view.
  • Leave: a node sends a LEAVE_REQUEST during graceful shutdown. The coordinator removes it from the view and broadcasts a new one. Graceful leaves do not involve any failure detector.
  • Crash or unresponsiveness: a failure detector raises SUSPECT, VERIFY_SUSPECT confirms, and the coordinator installs a view without the suspected member. See Failure detection.
  • Merge: MERGE3 detects divergent views across sub-clusters after a partition, the merge leader collects full views from subgroup coordinators, and GMS installs a single MergeView containing the union. See Partition recovery.

pbcast.GMS sits near the top of the protocol stack, above the reliability and failure-detection layers. AUTH sits directly below it (so join requests can be filtered before GMS processes them), with VERIFY_SUSPECT and the failure detector below AUTH.

In an XML file, which is written bottom-up, <pbcast.GMS> is therefore one of the last elements declared, above <AUTH>, <VERIFY_SUSPECT>, and the failure-detection protocol.

Every view change has a real cost at the Atoti level:

  • A view that loses a data node can cause the query node to throw on in-flight queries or to return partial results, depending on how the query node reacts to missing data nodes.
  • A MergeView triggers the FullReset described in Partition recovery, which is extremely expensive at the application layer (megabytes of serialization and Netty-side transfers on every node).
  • Frequent view changes under load keep the cluster in perpetual recovery.

Tight GMS timeouts amplify every one of these costs: any transient CPU burst, GC pause, or network jitter that exceeds a configured timeout causes a spurious view change. The rule of thumb is to set GMS timeouts comfortably above the longest realistic GC pause and the slowest coordinator round-trip under load, rather than aiming for the fastest possible membership response. See Tuning timeouts for Atoti workloads for concrete guidance on setting these on an Atoti cluster.

Fragmentation

Fragmentation splits a single outgoing Message whose payload exceeds a configured frag_size into several smaller Message objects at the JGroups level. Each fragment travels independently down the rest of the stack, and the fragmentation protocol reassembles them into the original message on the receiver side before it reaches the application.

JGroups 5 ships three interchangeable variants; later ones are more memory-efficient: FRAG2, FRAG3, FRAG4. The main attribute is frag_size: messages larger than this are split, smaller ones are sent as-is. The JGroups default is 60K.

On UDP, fragmentation is mandatory. A UDP packet has a hard size limit (the path MTU, typically around 1500 bytes). A message larger than the MTU cannot travel as a single packet: without a FRAG* protocol, it is silently dropped by the OS. Every UDP-based stack must include one.

On TCP, fragmentation is optional for delivery but does not cost anything.

Tuning timeouts for Atoti workloads

The default JGroups timeouts are tuned for typical enterprise applications and are too aggressive for Atoti workloads. Atoti applications favor a stable cluster, in which view changes rarely occur outside intentional startup and shutdown. Recovering from a node leaving the cluster when it should not have is the most expensive event that can hit an Atoti cluster: if that node's lifecycle (data transactions, simulation creations, and so on) continues during the transient false-ejection, recovery can prove difficult or impossible, and the node is then asked to reset its state and re-send its entire initial discovery.

This scenario should be avoided.

Sources of thread pressure

Two categories of thread pressure on an Atoti node are likely to interact with JGroups timing and the failure-detection stack responds to each differently.

CPU saturation from application work. Atoti queries and data loads are highly parallel and can saturate every CPU core for seconds at a time. A large query produces many tasks executed in parallel, and data loads drive commits that occupy similar thread pools. Two Atoti events in particular are intense enough to matter:

  • Initial transactions, unload/load data cycles.
  • Initial discovery and hierarchy serialization. When a data node first joins the cluster, it serializes and sends every hierarchy content to the query node over Netty. The transfer itself does not touch the JGroups stack, but the CPU and memory pressure it creates in the JVM does: the data node's JVM is saturated by serialization, and the query node's JVM is saturated by deserialization and merging. See Delay Messenger Startup in Data Nodes for a technique that separates the initial load from the join sequence, and prevents submerging a query node by having the data nodes join in sequence.

Garbage-collection pauses. A stop-the-world GC pause freezes every thread in the JVM, including JGroups timer threads. During the pause, the node cannot send heartbeats, cannot process incoming heartbeats, and cannot respond to VERIFY_SUSPECT. A 30-second G1 pause is 30 seconds of total JGroups silence as seen by every peer. This is why GC tuning, JGroups timeouts tuning, and JGroups failure detection protocol choices matter for an Atoti Application.

How JGroups responds to each

CPU saturation from application work usually does not cause false suspicions. JGroups timer threads, transport I/O threads, and message-dispatch threads are ordinary Java threads, mapped 1:1 to OS threads. The OS scheduler does not know that one thread belongs to ForkJoinPool and another belongs to JGroups. Under Linux's scheduler, all runnable threads get a proportional share of CPU time, and the kernel preempts running threads at short time slice boundaries (a few milliseconds). Even with 32 hot query workers on 8 cores, the JGroups timer thread is scheduled within a few milliseconds of becoming runnable, and its heartbeats go out roughly on time.

The only way application work truly starves JGroups is by exhausting JGroups' own message-dispatch pool (for example, by blocking those threads on large deserialization), not by competing for CPU. Because Atoti uses Netty for Application messages, this is irrelevant to Atoti.

GC pauses are the main cause of false suspicions in Atoti clusters. There is no trick the OS can play: a paused JVM cannot send or process any JGroups message until the pause ends. This is why every tuning recommendation in this page boils down to raising timeouts past the longest observed GC pause on the JVM.

Example stack fragments

warning

The below examples are not exhaustive. Do have a look at the list of available attributes for each protocol.

The following fragments show the detection-chain section of a JGroups XML stack. In both cases, the order is pbcast.GMSVERIFY_SUSPECT2 → failure detector, which in the XML means pbcast.GMS appears last (highest in the stack), then VERIFY_SUSPECT2, then the failure detector, each one earlier in the file than the one above it.

FD_SOCK2 variant: favor this when direct TCP connectivity between cluster members is stable:

<!-- FD_SOCK2: socket-based failure detector.
- offset: port offset from the main transport's bind_port
- port_range: number of ports to scan from the offset for a free one. Must be greater than the number of cluster members.
- connect_timeout: max wait for an outgoing FD_SOCK2 connect (ms)
- suspect_msg_interval: how often an unresolved SUSPECT is rebroadcast (ms) -->
<FD_SOCK2 offset=""
port_range=""
connect_timeout=""
suspect_msg_interval=""/>

<!-- VERIFY_SUSPECT2: verifies a SUSPECT before it reaches pbcast.GMS
- timeout: max wait for the suspect to respond (ms)
- num_msgs: number of verification ping attempts -->
<VERIFY_SUSPECT2 timeout="" num_msgs=""/>

<!-- pbcast.GMS: Group Membership Service.
- join_timeout: wait for the coordinator to ack a JOIN_REQUEST (ms)
- leave_timeout: wait for a graceful leave to complete (ms)
- merge_timeout: wait for the merge protocol to collect subgroup views (ms)
- max_join_attempts: a failure to join will cause the member to act as a standalone cluster. Cannot happen.
- num_prev_mbrs: how many past members to remember
- view_ack_collection_timeout: wait for every member to ack an installed view (ms) -->
<pbcast.GMS join_timeout=""
max_join_attempts="..."
leave_timeout=""
merge_timeout=""
num_prev_mbrs=""
view_ack_collection_timeout=""/>

FD_ALL3 variant: prefer this when node-to-node sockets are constrained (NAT, container port mapping, strict firewalls). Detection is timeout-based and sensitive to GC pauses; tune timeouts generously:

<!-- FD_ALL3: heartbeat-based failure detector
- timeout: silence threshold before a peer is suspected (ms)
- interval: heartbeat broadcast interval (ms) -->
<FD_ALL3 timeout="" interval=""/>
<!-- See above -->
<VERIFY_SUSPECT2 timeout="" num_msgs=""/>
<pbcast.GMS max_join_attempts="..."
join_timeout=""
leave_timeout=""
merge_timeout=""
num_prev_mbrs=""
view_ack_collection_timeout=""/>

These fragments show only the detection chain. A complete protocol file also needs a transport, a discovery protocol, pbcast.NAKACK4 (which covers both reliable delivery and stability), UNICAST3, MERGE3, BARRIER, and the security protocols (AUTH, SYM_ENCRYPT or ASYM_ENCRYPT). For the security protocols, see the Cluster Security page.

Deployment environments

All members within a single subnet or VLAN

The simplest case. Switches forward both multicast and unicast freely within a VLAN, there is typically no firewall between members, and member IP addresses are stable. UDP multicast, TCP, and FD_SOCK* all work out of the box. Any combination of transport and failure detector is viable. This is Atoti's happy-path networking scenario.

Behind NAT

FD_ALL3 is preferred because it travels over the main transport's existing channel and does not require per-member port-forwarding rules at the NAT device. FD_SOCK* is workable but requires static port-forwarding on the NAT, which is painful to maintain in dynamic environments.

Kubernetes

Kubernetes has several overlapping networking concerns that interact with JGroups:

  • Pod-to-pod networking. By default, every pod in a Kubernetes cluster can reach every other pod by pod IP, on any port. If all Atoti members run as pods in the same cluster and nothing restricts pod-to-pod traffic, FD_SOCK2 and the main transport work without special configuration: point members at each other's pod IPs and keep the relevant bind_port range open.
  • NetworkPolicy. A Kubernetes-native firewall at the pod level. A NetworkPolicy object defines which pods may receive traffic from which other pods, selected by label. A cluster with no NetworkPolicy allows all pod-to-pod traffic; adding a policy turns traffic into opt-in. For an Atoti deployment, the relevant rule is "allow all Atoti pods to allow inbound & outbound connections on the JGroups transport port, and on any FD_SOCK* port range if used, from other Atoti pods selected by label". Without that rule, the cluster will not form.
  • hostPort / NodePort. These expose a pod's port on the node's network interface (hostPort) or on every node of the cluster through kube-proxy (NodePort). They are needed only when something outside the Kubernetes pod network must reach the pods (for example, when mixing Kubernetes-hosted and VM-hosted Atoti members, or when two Kubernetes clusters need to talk to each other). They are not needed for pod-to-pod communication within a single cluster.
  • Relationship between NetworkPolicy and hostPort / NodePort. NetworkPolicy governs traffic inside the cluster pod network; hostPort / NodePort expose pods outside it. They solve different layers of the problem, and a deployment can use one, both, or neither. For a pure-Kubernetes Atoti deployment, only NetworkPolicy is usually relevant; hostPort / NodePort become a concern only when crossing the Kubernetes boundary.

FD_ALL3 remains the safer default for Kubernetes environments because it does not require opening additional ports at any of the layers above.

Side note: Network Policies must also be configured to allow communication through Netty. See here.

Corporate firewalls

"Corporate firewall" is a catch-all for the security devices between VLANs, between environment tiers, or at the corporate perimeter. Unlike a home NAT, corporate firewalls typically work by explicit allow-rules: traffic that does not match an allow-rule is dropped.

For a JGroups cluster spanning a firewall boundary (members on different VLANs or across tiers), the firewall team needs to allow:

  • The main transport port (TCP or UDP, depending on the transport in use), between every pair of cluster members, in both directions.
  • For UDP multicast, the multicast group address and port, plus the cross-subnet multicast routing described in the UDP subsection.
  • Any discovery port (for example, the TCPGOSSIP GossipRouter port, if the deployment uses one).
  • Any failure-detection ports, depending on which detector is chosen:
    • FD_SOCK: one ephemeral port per member, picked at runtime. Not firewall-friendly, since the port is not known in advance.
    • FD_SOCK2: a predictable range starting at bind_port + offset, with port_range fallback slots (up to 1 + port_range ports per member, five by default). The range is static and easy to pre-declare.
    • FD_ALL3: no extra ports at all. Heartbeats travel over the main transport port, which the firewall already allows. The natural choice when firewall-rule changes are slow or expensive.

Security layer ports (AUTH, ENCRYPT) do not require separate firewall rules: they run inside the transport's port.

Members on opposite sides of a firewall should be provisioned with stable IPs (so allow-rules do not rot as members are recycled) and a documented port list. Using DNS indirection in JGroups config does not remove this requirement: corporate firewalls filter by IP, not by hostname, so the stability concern simply moves from the application config to the DNS layer. Expect to tune JGroups timeouts slightly higher than for a single-subnet cluster, since corporate firewalls can introduce extra latency from packet inspection.

If all cluster members sit on the same VLAN and never cross the firewall, the corporate firewall is not in the path at all, and the single-subnet story applies.