Data Overlap
In a horizontal distribution setup, it is sometimes useful to replicate all or part of the data across multiple data nodes.
This can be achieved using the data overlap feature.
In a distributed setup, each member of the distributing level represents a partition of the data.
The query node automatically splits queries across data nodes without duplicating computations on replicated partitions.
When to use Data Overlap?
Data overlap is mainly used in three scenarios:
-
Failover support: Each data node can be fully replicated so that if one node goes offline, another can take over.
-
DirectQuery hybrid setup: A DirectQuery data node can hold all the data, while a Datastore data node contains only the most recent partitions. The Datastore node acts as a query accelerator for some partitions. If it goes offline, the DirectQuery node can serve the queries. In this setup, data overlap facilitates the data roll over process.
-
Load balancing: Frequently accessed partitions can be replicated across multiple data nodes to distribute query load. When multiple nodes have the same priority, queries are randomly distributed across them, providing automatic load balancing for replicated partitions.
Constraints and limitations
- The application must have at least one distributing level.
- Replicated partitions must be identical across all data nodes. Improper synchronization will lead to inconsistent results.
Atoti does not enforce correct data replication. It is the user’s responsibility to keep replicated partitions synchronized at all times.
- Data overlap is defined at the query node level. All applications in the distributed setup must comply with these rules.
Query dispatching with data overlap
The query node maintains a map of which data node contains which partitions of the distributing level. It knows when a partition is replicated. When executing a query:
- Non-replicated partitions are processed as usual.
- For replicated partitions, the query node selects one data node based on priority.
Priority is represented by an integer and lower values indicate higher priority.
You can define priority using IDataClusterDefinition#DATA_NODE_PRIORITY. For details, see priority configuration section.
If no priority is defined by the user:
- The dispatching algorithm selects the node with the fewest distributing level members.
If at least one priority is defined:
- Nodes without priority default to
Integer.MAX_VALUE(lowest priority). - If nodes have the same priority, the dispatching algorithm randomly selects one at query time.
Example
This example illustrates how query dispatching works in a distributed setup with overlapping data across two nodes.
Setup
Distributing level: Country
Data node priority: Node A has higher priority than Node B
Data distribution:
| Country | Node A | Node B |
|---|---|---|
| France | ✅ | ❌ |
| Germany | ✅ | ✅ |
| Italy | ✅ | ✅ |
| Spain | ❌ | ✅ |
Overlapping partitions: Germany, Italy (present in both nodes)
Query examples
Query with point coordinate Country = "Germany"
- Germany exists in both Node A and Node B.
- Node A has higher priority → Query sent to Node A only.
Result:
| Country | Retrieved from |
|---|---|
| Germany | Node A |
Query with wildcard coordinate Country = "*"
- Query sent to both Node A and Node B.
- For overlapping partition (Germany, Italy): retrieved from Node A (higher priority)
- For unique partition: retrieved from their respective nodes
Result:
| Country | Retrieved from |
|---|---|
| France | Node A |
| Germany | Node A |
| Italy | Node A |
| Spain | Node B |
Query with list coordinate Country IN ["Italy", "Spain"]
- Italy exists in both nodes → retrieved from Node A (higher priority)
- Spain exists only in Node B → retrieved from Node B
Result:
| Country | Retrieved from |
|---|---|
| Italy | Node A |
| Spain | Node B |
How to
Enable data overlap
Activate the feature at startup:
StartBuilding.cube("MyQueryCube")
.asQueryCube()
.withClusterDefinition()
.withClusterId("MyCluster")
.withMessengerDefinition()
.withLocalMessenger()
.withNoProperty()
.end()
.withApplication("MyApplication")
.withDistributingLevels(distributingLevels)
.withProperty(
IQueryClusterDefinition.HORIZONTAL_DATA_DUPLICATION_PROPERTY, Boolean.toString(true))
.end()
.build();
This activation applies to all distributed applications. All applications must respect the constraints defined above.
Configure data node priority
Set the priority at startup:
StartBuilding.cube("MyDataCube")
.withDimensions(dimensionsAdder)
.asDataCube()
.withClusterDefinition()
.withClusterId("MyCluster")
.withMessengerDefinition()
.withLocalMessenger()
.withNoProperty()
.end()
.withApplicationId("MyApplication")
.withAllHierarchies()
.withAllMeasures()
.withProperty(IDataClusterDefinition.DATA_NODE_PRIORITY, String.valueOf(8))
.end()
.build();
If priority is not defined:
Nodes without the property default to Integer.MAX_VALUE .
If no nodes have priority, the system prioritizes nodes with fewer distributing level members.
Monitoring
You can retrieve data partition mapping and node priority from the query node.
Using the Java API
Use the IDistributedActivePivotVersion API:
final IDistributedActivePivotVersion version = distributedCube.getHead();
final IDistributionInformation distributionInfo = version.getDistributionInformation();
// Get data node priorities
final Map<String, Integer> priorities = distributionInfo.getPriorityPerDataNode();
// Get member mapping across data nodes
final MemberMapping memberMapping = distributionInfo.getMemberMapping();
Using JMX
Access the same information through JMX using the getDistributingLevelMemberMapping operation in the MBean com.activeviam:node0=ActivePivotManager,node1=<SchemaName>,node2=<CubeName>.