What is data overlap?
Data overlap is a feature in a horizontal distribution setup that allows replicating all or part of the data across multiple data nodes.
It helps improve availability, performance, and resilience in distributed environments.
In a distributed setup, each member of the distributing level represents a partition of the data.
The query node automatically splits queries across data nodes without duplicating computations on replicated partitions.
When to use Data Overlap?
Data overlap is mainly used in three scenarios:
-
Failover support:
Each data node can be fully replicated so that if one node goes offline, another can take over. -
DirectQuery hybrid setup:
A DirectQuery node holds all data, while a Datastore node contains recent partitions for faster queries. If the Datastore node goes offline, the DirectQuery node serves queries. In this setup, data overlap facilitates the data roll over process. -
Load balancing:
Frequently accessed partitions can be replicated across multiple nodes to distribute query load. When nodes have the same priority, queries are randomly distributed for automatic load balancing.
Does data overlap have constraints or limitations?
- The application must have at least one distributing level.
- Replicated partitions must be identical across all data nodes. Improper synchronization will lead to inconsistent results.
- Data overlap is defined at the query node level. All applications in the distributed setup must comply with these rules.
Atoti does not enforce correct data replication. It is the user’s responsibility to keep replicated partitions synchronized at all times.
How query dispatching works with data overlap
The query node maintains a map of which data node contains which partitions of the distributing level. It knows when a partition is replicated. When executing a query:
- Non-replicated partitions are processed normally.
- For replicated partitions:
- The node with the highest priority is selected.
- Priority is an integer; lower values mean higher priority.
- Define priority using
IDataClusterDefinition#DATA_NODE_PRIORITY. For details, see priority configuration section.
If no priority is defined by the user:
- The dispatching algorithm selects the node with the fewest distributing level members.
If at least one priority is defined:
- Nodes without priority default to
Integer.MAX_VALUE(lowest priority). - If nodes have the same priority, the dispatching algorithm randomly selects one at query time.
Example of query dispatching
This example illustrates how query dispatching works in a distributed setup with overlapping data across two nodes.
Setup
Distributing level: Country
Data node priority: Node A has higher priority than Node B
Data distribution:
| Country | Node A | Node B |
|---|---|---|
| France | ✅ | ❌ |
| Germany | ✅ | ✅ |
| Italy | ✅ | ✅ |
| Spain | ❌ | ✅ |
Overlapping partitions: Germany, Italy (present in node A and node B)
Query examples
Query with point coordinate Country = "Germany"
- Point coordinate: Country = Germany
- List coordinate: Country IN [Italy, Spain]
- Wildcard coordinate: Country = *
Results:
| Query | Type of Query | Country | Retrieved From |
|---|---|---|---|
| 1 | point coordinate | Germany | node A (priority node) |
| 2 | list coordinate | Italy, Spain | node A for Italy 'priority node) node B for Spain (unique partition) |
| 3 | wildcard coordinate | * | node A for Germany and Italy (priority node) node A for France, node B for Spain (unique partitions) |
How to enable data overlap
Activate the feature at startup:
StartBuilding.cube("MyQueryCube")
.asQueryCube()
.withClusterDefinition()
.withClusterId("MyCluster")
.withMessengerDefinition()
.withLocalMessenger()
.withNoProperty()
.end()
.withApplication("MyApplication")
.withDistributingLevels(distributingLevels)
.withProperty(
IQueryClusterDefinition.HORIZONTAL_DATA_DUPLICATION_PROPERTY, Boolean.toString(true))
.end()
.build();
This activation applies to all distributed applications. All applications must respect the constraints defined above.
How to configure data node priority
Set the priority at startup:
StartBuilding.cube("MyDataCube")
.withDimensions(dimensionsAdder)
.asDataCube()
.withClusterDefinition()
.withClusterId("MyCluster")
.withMessengerDefinition()
.withLocalMessenger()
.withNoProperty()
.end()
.withApplicationId("MyApplication")
.withAllHierarchies()
.withAllMeasures()
.withProperty(IDataClusterDefinition.DATA_NODE_PRIORITY, String.valueOf(8))
.end()
.build();
Nodes without the property default to Integer.MAX_VALUE .
If no nodes have priority, the system prioritizes nodes with fewer distributing level members.
How to monitor data overlap
Retrieve data partition mapping and node priority from the query node in two ways:
Use the Java API
Use the IDistributedActivePivotVersion API:
final IDistributedActivePivotVersion version = distributedCube.getHead();
final IDistributionInformation distributionInfo = version.getDistributionInformation();
// Get data node priorities
final Map<String, Integer> priorities = distributionInfo.getPriorityPerDataNode();
// Get member mapping across data nodes
final MemberMapping memberMapping = distributionInfo.getMemberMapping();
Use JMX
Access the same information through JMX using the getDistributingLevelMemberMapping operation in the MBean com.activeviam:node0=ActivePivotManager,node1=<SchemaName>,node2=<CubeName>.