Distributed Post-Processors
A post-processor is a function that computes a measure at query time based on other measures in the cube. Thus, retrieving a post-processed measure involves retrieving the measures upon which it is based, potentially cascading through a dependency chain until Atoti finds primitive measures at the given location.
Prefetchers
are responsible for computing underlying dependencies, and are declared within
each post-processor.
A major feature of a distributed Atoti cluster is the fact that post-processors can have distributed prefetchers. A post-processor can ask for any underlying measure, be it pre-aggregated or post-processed, to be evaluated on a given set of locations spreading across all the remote cubes.
This allows the user to pre-compute some aggregates on the local instances before sending them to the query cube to finish the calculation. This way, the computation can benefit from each local instance's aggregates cache. Each data cube stores the results of the most popular queries, so that the pre-computed results are already available the next time similar queries hit the node.
For efficient query planning, post-processors declare a list of prefetchers. This often allows Atoti to retrieve all underlying measures in a single pass before evaluating the post- processors. In a distributed environment, this is of paramount importance, as a single pass also means a single round trip on the network. Do not ignore the prefetchers' warnings while working with a distributed cluster, as bad prefetching or lack of prefetching can lead to incorrect results.
Post-processors can be defined on both query cubes and data cubes. In what follows, we explain the different post-processor settings when using distribution.
Post-Processor Definition in the Query Cube
Declaring post-processors directly in the query cube allows performing advanced calculations on measures coming from data cubes of different applications that do not share the same topology.
In the following example, we consider one query cube linked to two data cubes from two different applications of a supply chain use case.
One data cube contains all the inbound flux, the other data cube contains all the outbound flux. To calculate the current stock for a product in a warehouse, the query cube needs to get all the inbound flux entries for that specific product/warehouse location, and needs to subtract all the outbound flux entries.
StartBuilding.cube("QueryCube")
.withUpdateTimestamp()
.withinFolder("Native_measures")
.withAlias("Distributed_Timestamp")
.withFormatter("DATE[HH:mm:ss]")
.withPostProcessor("CURRENT_STOCK")
// CURRENT_STOCK = incoming - leaving
.withPluginKey("DiffPostProcessor")
.withUnderlyingMeasures("incoming_flux", "leaving_flux")
.asQueryCube()
.withClusterDefinition()
.withClusterId("SupplyChainCubeCluster")
.withMessengerDefinition()
.withProtocolPath("jgroups-protocols/protocol-tcp.xml")
.end()
.withApplication("app-incoming-flux")
.withoutDistributingFields()
.withApplication("app-leaving-flux")
.withoutDistributingFields()
.end()
.build();
It is also possible to define measures in the query cube using the Copper API.
Post-Processor Definition in the Data Cube
In order to leverage the remote data cube instances, Atoti Server seeks to distribute the computations of the post-processors defined in the data cube. The partial results are then reduced within the query cube, using a given aggregation function (default is SUM). However, distributing a post-processor is not always possible. Therefore, an algorithm traverses the post-processors dependency chain to determine which ones can be distributed.
A distributed post-processor can not depend on a not distributed post-processor. Consequently, an ancestor of a not distributed post-processor will always be computed in the query cube. Another consequence is that the descendants of a distributed post-processor must all be distributed as well. The rest of the decision process depends on the queried location:
Distribution when the distributing field is not expressed in the queried location
In this case, a post-processor must fulfill two requirements to be distributed:
- Requirement 1: The distributing field must be included in the partitioning of the post-processor and in the partitioning of all its children.
- Requirement 2: At least one parent of the post-processor must define a reduction logic. This is the case if
and only if the ancestor implements either
IPartitionedPostProcessor
orIDistributedPostProcessor
.
Distribution when the distributing field is expressed in the queried location
In this case, the reduction logic always consists simply of concatenating the partial results. Therefore, there is no need for an ancestor to implement a reduction logic. Thus, a post-processor is distributed if and only if it meets the first requirement listed in the previous case: the distributing field must be included in its partitioning and in the partitioning of all its descendants.
To achieve a good distribution of the computations, it is therefore recommended to include the distributing field in the partitioning of the post-processors whenever it is possible. The top level post-processors should also define a reduction logic by implementing either
IPartitionedPostProcessor
orIDistributedPostProcessor
.