Skip to main content

Data Overlap

In a horizontal distribution setup, it is sometimes useful to replicate all or part of the data across multiple data nodes. This can be achieved using the data overlap feature.
In a distributed setup, each member of the distributing level represents a partition of the data.
The query node automatically splits queries across data nodes without duplicating computations on replicated partitions.

When to use Data Overlap?

Data overlap is mainly used in three scenarios:

  • Failover support: Each data node can be fully replicated so that if one node goes offline, another can take over.

  • DirectQuery hybrid setup: A DirectQuery data node can hold all the data, while a Datastore data node contains only the most recent partitions. The Datastore node acts as a query accelerator for some partitions. If it goes offline, the DirectQuery node can serve the queries. In this setup, data overlap facilitates the data roll over process.

  • Load balancing: Frequently accessed partitions can be replicated across multiple data nodes to distribute query load. When multiple nodes have the same priority, queries are randomly distributed across them, providing automatic load balancing for replicated partitions.

Constraints and limitations

  • The application must have at least one distributing level.
  • Replicated partitions must be identical across all data nodes. Improper synchronization will lead to inconsistent results.
danger

Atoti does not enforce correct data replication. It is the user’s responsibility to keep replicated partitions synchronized at all times.

  • Data overlap is defined at the query node level. All applications in the distributed setup must comply with these rules.

Query dispatching with data overlap

The query node maintains a map of which data node contains which partitions of the distributing level. It knows when a partition is replicated. When executing a query:

  • Non-replicated partitions are processed as usual.
  • For replicated partitions, the query node selects one data node based on priority.

Priority is represented by an integer and lower values indicate higher priority. You can define priority using IDataClusterDefinition#DATA_NODE_PRIORITY. For details, see priority configuration section.

If no priority is defined by the user:

  • The dispatching algorithm selects the node with the fewest distributing level members.

If at least one priority is defined:

  • Nodes without priority default to Integer.MAX_VALUE (lowest priority).
  • If nodes have the same priority, the dispatching algorithm randomly selects one at query time.

Example

This example illustrates how query dispatching works in a distributed setup with overlapping data across two nodes.

Setup

Distributing level: Country

Data node priority: Node A has higher priority than Node B

Data distribution:

CountryNode ANode B
France
Germany
Italy
Spain

Overlapping partitions: Germany, Italy (present in both nodes)

Query examples

Query with point coordinate Country = "Germany"

  • Germany exists in both Node A and Node B.
    • Node A has higher priority → Query sent to Node A only.

Result:

CountryRetrieved from
GermanyNode A

Query with wildcard coordinate Country = "*"

  • Query sent to both Node A and Node B.
    • For overlapping partition (Germany, Italy): retrieved from Node A (higher priority)
    • For unique partition: retrieved from their respective nodes

Result:

CountryRetrieved from
FranceNode A
GermanyNode A
ItalyNode A
SpainNode B

Query with list coordinate Country IN ["Italy", "Spain"]

  • Italy exists in both nodes → retrieved from Node A (higher priority)
  • Spain exists only in Node B → retrieved from Node B

Result:

CountryRetrieved from
ItalyNode A
SpainNode B

How to

Enable data overlap

Activate the feature at startup:

StartBuilding.cube("MyQueryCube")
.asQueryCube()
.withClusterDefinition()
.withClusterId("MyCluster")
.withMessengerDefinition()
.withLocalMessenger()
.withNoProperty()
.end()
.withApplication("MyApplication")
.withDistributingLevels(distributingLevels)
.withProperty(
IQueryClusterDefinition.HORIZONTAL_DATA_DUPLICATION_PROPERTY, Boolean.toString(true))
.end()
.build();

This activation applies to all distributed applications. All applications must respect the constraints defined above.

Configure data node priority

Set the priority at startup:

StartBuilding.cube("MyDataCube")
.withDimensions(dimensionsAdder)
.asDataCube()
.withClusterDefinition()
.withClusterId("MyCluster")
.withMessengerDefinition()
.withLocalMessenger()
.withNoProperty()
.end()
.withApplicationId("MyApplication")
.withAllHierarchies()
.withAllMeasures()
.withProperty(IDataClusterDefinition.DATA_NODE_PRIORITY, String.valueOf(8))
.end()
.build();

If priority is not defined:

Nodes without the property default to Integer.MAX_VALUE . If no nodes have priority, the system prioritizes nodes with fewer distributing level members.

Monitoring

You can retrieve data partition mapping and node priority from the query node.

Using the Java API

Use the IDistributedActivePivotVersion API:

final IDistributedActivePivotVersion version = distributedCube.getHead();
final IDistributionInformation distributionInfo = version.getDistributionInformation();
// Get data node priorities
final Map<String, Integer> priorities = distributionInfo.getPriorityPerDataNode();
// Get member mapping across data nodes
final MemberMapping memberMapping = distributionInfo.getMemberMapping();

Using JMX

Access the same information through JMX using the getDistributingLevelMemberMapping operation in the MBean com.activeviam:node0=ActivePivotManager,node1=<SchemaName>,node2=<CubeName>.