> ## Documentation Index
> Fetch the complete documentation index at: https://docs.activeviam.com/llms.txt
> Use this file to discover all available pages before exploring further.

# What is data overlap?

Data overlap is a feature in a [horizontal distribution setup](distributed_intro#applications) that allows replicating all or part of the data across multiple data nodes.
It helps improve availability, performance, and resilience in distributed environments.<br />
In a distributed setup, each member of the distributing level represents a partition of the data.<br />
The query node automatically splits queries across data nodes without duplicating computations on replicated partitions.

## When to use Data Overlap?

Data overlap is mainly used in three scenarios:

* **Failover support**:<br />
  Each data node can be fully replicated so that if one node goes offline, another can take over.

* **DirectQuery hybrid setup**:<br />
  A DirectQuery node holds all data, while a Datastore node contains recent partitions for faster queries.
  If the Datastore node goes offline, the DirectQuery node serves queries.
  In this setup, data overlap facilitates the [data roll over process](./distributed_data_roll_over).

* **Load balancing**:<br />
  Frequently accessed partitions can be replicated across multiple nodes to distribute query load.
  When nodes have the same [priority](#how-query-dispatching-works-with-data-overlap), queries are randomly distributed for automatic load balancing. Note that this is true random selection — there is no guarantee of even distribution across nodes.

## Does data overlap have constraints or limitations?

* The application must have at least one distributing level.
* Replicated partitions must be identical across all data nodes. Improper synchronization will lead to inconsistent results.
* Data overlap is defined at the query node level. All applications in the distributed setup must comply with these rules.

<Danger>
  Atoti does not enforce correct data replication. It is the user’s responsibility to keep replicated partitions synchronized at all times.
</Danger>

## How query dispatching works with data overlap

The query node maintains a map of which data node contains which partitions of the distributing level. It knows when a partition is replicated.
When executing a query:

* Non-replicated partitions are processed normally.
* For replicated partitions:
  * The node with the highest priority is selected.
  * Priority is a strictly positive integer; lower values mean higher priority. Zero and negative values are not allowed.
  * Define priority using `IDataClusterDefinition#DATA_NODE_PRIORITY`. For details, see [priority configuration section](#how-to-configure-data-node-priority).<br />

If no priority is defined by the user:

* The dispatching algorithm selects the node with the fewest distributing level members.

If at least one priority is defined:

* Nodes without priority default to `Integer.MAX_VALUE` (lowest priority).
* If nodes have the same priority, the dispatching algorithm randomly selects one at query time. This is true random selection: there is no explicit sequencing. There is no user session affinity — a user is not tied to a specific data node.

## Example of query dispatching

This example illustrates how query dispatching works in a distributed setup with overlapping data across two nodes.

### Setup

**Distributing level**: `Country`

**Data node priority**: Node A has higher priority than Node B

**Data distribution**:

| Country | Node A | Node B |
| ------- | ------ | ------ |
| France  | ✅      | ❌      |
| Germany | ✅      | ✅      |
| Italy   | ✅      | ✅      |
| Spain   | ❌      | ✅      |

**Overlapping partitions**: Germany, Italy (present in node A and node B)

### Query examples

#### Query with point coordinate `Country = "Germany"`

1. Point coordinate: Country = Germany
2. List coordinate: Country IN \[Italy, Spain]
3. Wildcard coordinate: Country = \*

Results:

| Query | Type of Query       | Country      | Retrieved From                                                                                             |
| ----- | ------------------- | ------------ | ---------------------------------------------------------------------------------------------------------- |
| 1     | point coordinate    | Germany      | node A (priority node)                                                                                     |
| 2     | list coordinate     | Italy, Spain | node A for Italy (priority node)<br />node B for Spain (unique partition)                                  |
| 3     | wildcard coordinate | \*           | node A for Germany and Italy (priority node)<br /> node A for France, node B for Spain (unique partitions) |

## How to enable data overlap

Activate the feature at startup:

```java theme={"languages":{"custom":["/engine/python-sdk/0.9/languages/pycon.tmLanguage.json"]}}
StartBuilding.cube("MyQueryCube")
    .asQueryCube()
    .withClusterDefinition()
    .withClusterId("MyCluster")
    .withMessengerDefinition()
    .withLocalMessenger()
    .withNoProperty()
    .end()
    .withApplication("MyApplication")
    .withDistributingLevels(distributingLevels)
    .withProperty(
        IQueryClusterDefinition.HORIZONTAL_DATA_DUPLICATION_PROPERTY, Boolean.toString(true))
    .end()
    .build();
```

This activation applies to all distributed applications. All applications must respect the constraints defined [above](#does-data-overlap-have-constraints-or-limitations).

## How to configure data node priority

Set the priority at startup:

```java theme={"languages":{"custom":["/engine/python-sdk/0.9/languages/pycon.tmLanguage.json"]}}
StartBuilding.cube("MyDataCube")
    .withDimensions(dimensionsAdder)
    .asDataCube()
    .withClusterDefinition()
    .withClusterId("MyCluster")
    .withMessengerDefinition()
    .withLocalMessenger()
    .withNoProperty()
    .end()
    .withApplicationId("MyApplication")
    .withAllHierarchies()
    .withAllMeasures()
    .withProperty(IDataClusterDefinition.DATA_NODE_PRIORITY, String.valueOf(8))
    .end()
    .build();
```

Priority values must be strictly positive integers. Zero and negative values are not allowed.
Nodes without the property default to `Integer.MAX_VALUE`.
If no nodes have priority, the system prioritizes nodes with fewer distributing level members.

## How to monitor data overlap

Retrieve data partition mapping and node priority from the query node in two ways:

### Use the Java API

Use the `IDistributedActivePivotVersion` API:

```java theme={"languages":{"custom":["/engine/python-sdk/0.9/languages/pycon.tmLanguage.json"]}}
final IDistributedActivePivotVersion version = distributedCube.getHead();
final IDistributionInformation distributionInfo = version.getDistributionInformation();
// Get data node priorities
final Map<String, Integer> priorities = distributionInfo.getPriorityPerDataNode();
// Get member mapping across data nodes
final MemberMapping memberMapping = distributionInfo.getMemberMapping();
```

### Use JMX

Access the same information through JMX using the  `getDistributingLevelMemberMapping` operation in the MBean `com.activeviam:node0=ActivePivotManager,node1=<SchemaName>,node2=<CubeName>`.
