Aggregate Tables

DirectQuery is using the external database to feed some aggregate providers and answer some queries. This can result in some large queries with heavy load on the database, and if the same cube is started multiple times, the same query can be repeated.

Aggregate tables are a way to cache some aggregated data in the external database to make some queries faster and cheaper.

How does it work

Aggregate Tables are tables added in the external database with aggregated data.

For instance, consider a Sales table containing sales such as this one:

Id	Date	Product	Buyer	Quantity	Price
1	2023-01-01	P1	Buyer_A	100.0	15.5
2	2023-01-01	P1	Buyer_B	100.0	16.5
3	2023-01-01	P2	Buyer_A	100.0	30
4	2023-01-01	P2	Buyer_B	100.0	31
5	2023-01-02	P1	Buyer_A	100.0	15.5
6	2023-01-02	P1	Buyer_B	100.0	15.5
7	2023-01-02	P1	Buyer_A	100.0	18
8	2023-01-03	P1	Buyer_B	100.0	18
9	2023-01-03	P2	Buyer_A	100.0	35
10	2023-01-03	P2	Buyer_B	100.0	40

It is possible to aggregate the sum of the Quantity per Date and Product in the external database into an other table agg_sales:

date	product	sum_of_quantity	count_of_rows
2023-01-01	P1	200.0	2
2023-01-01	P2	200.0	2
2023-01-02	P1	300.0	3
2023-01-03	P1	100.0	1
2023-01-03	P2	200.0	2

If this table is added as an Aggregate Table in DirectQuery, it will be used whenever possible instead of the original Sales table.

For instance when asking for the SUM(Quantity) per Product in the Sales table, DirectQuery will retrieve the SUM(sum_of_quantity) per product in agg_sales. As the agg_sales is smaller and already aggregated the result will be faster to compute and will potentially avoid a huge scan of the database, saving both time and compute costs.

Use cases

There are a few classic use cases for using an Aggregate Table:

Feeding an aggregate provider faster by pre-aggregating over the provider fields
Having an "external database aggregate provider", i.e. being able to answer some queries from the pre-aggregated data in the external database but without having to store additional data in-memory.
Feeding some hierarchies faster

Supported aggregations

The following aggregation function can be used in an Aggregate Table:

Sum
Min
Max
Average
Sum product
Count

Custom Aggregations

If you have a custom aggregation functions, you must implement an equivalent ISqlReAggregationFunction plugin value and register it in the Registry with the same key as your aggregation function.

Examples

Create a simple Aggregate Table to feed an aggregate provider

Let’s create a new table that will pre-aggregate the quantity (in the tutorial sales table) per Date and Product:

ClickHouse
Snowflake

create table
  TUTORIAL.sales_by_product
(
  `product` String NOT NULL,
  `quantity_sum` Float64 NOT NULL,
  `contributors_count` UInt64 NOT NULL
)
ENGINE = MergeTree()
ORDER BY (product);

CREATE OR REPLACE TABLE
  TUTORIAL.SALES_BY_PRODUCT
(
  PRODUCT String NOT NULL,
  QUANTITY_SUM FLOAT NOT NULL,
  CONTRIBUTORS_COUNT INTEGER NOT NULL
)

We feed it with the data from the other tables:

ClickHouse
Snowflake

INSERT INTO TUTORIAL.sales_by_product
  SELECT
    T0.PRODUCT AS `product`,
    sum(T0.QUANTITY) AS `quantity_sum`,
    COUNT(*) AS `contributors_count`
  FROM
    TUTORIAL.SALES AS T0
  GROUP BY
    T0.PRODUCT;

INSERT INTO TUTORIAL.SALES_BY_PRODUCT
  SELECT
    T0.PRODUCT AS PRODUCT,
    sum(T0.QUANTITY) AS QUANTITY_SUM,
    COUNT(*) AS CONTRIBUTORS_COUNT
  FROM
    TUTORIAL.SALES AS T0
  GROUP BY
    T0.PRODUCT;

The pre-aggregated data looks like this:

PRODUCT	QUANTITY_SUM	CONTRIBUTORS_COUNT
BED_2	6	4
HOO_5	3	3
TAB_0	4	4
TAB_1	4	3
TSH_3	3	3
TSH_4	5	3

Let's first discover the SQL table corresponding to the Aggregate Table:

ClickHouse
Snowflake

final Table salesTable = session.discoverTable("TUTORIAL", "SALES");
final Table table = session.discoverTable("TUTORIAL", "sales_by_product");

final Table salesTable =
        session.discoverTable(new SqlTableId(TEST_DATABASE_NAME, TUTORIAL_SCHEMA_NAME, "SALES"));
final Table table = session
        .discoverTable(new SqlTableId(TEST_DATABASE_NAME, TUTORIAL_SCHEMA_NAME, "SALES_BY_PRODUCT"));

We will map the PRODUCT field of the sales table to the product field of the Aggregate Table:

ClickHouse
Snowflake

final Map<FieldPath, String> fieldMapping = Map.of(FieldPath.of("PRODUCT"), "product");

final Map<FieldPath, String> fieldMapping = Map.of(FieldPath.of("PRODUCT"), "PRODUCT");

We will also map the sum of the sales quantity column into a column called quantity_sum:

ClickHouse
Snowflake

final List<MeasureMapping> measureMapping =
        List.of(MeasureMapping.sum(FieldPath.of("QUANTITY"), "quantity_sum"));

final List<MeasureMapping> measureMapping =
        List.of(MeasureMapping.sum(FieldPath.of("QUANTITY"), "QUANTITY_SUM"));

We can now build the AggregateTable:

ClickHouse
Snowflake

AggregateTable aggregateTable = AggregateTable.builder()
        .fromTable(table)
        .asAggregationOf(salesTable)
        .withGroupByFieldsMapping(fieldMapping)
        .withMeasuresMappings(measureMapping)
        .build();

AggregateTable aggregateTable = AggregateTable.builder()
        .fromTable(table)
        .asAggregationOf(salesTable)
        .withGroupByFieldsMapping(fieldMapping)
        .withMeasuresMappings(measureMapping)
        .build();

And add it to the schema:

ClickHouse
Snowflake

final Schema schema = Schema.builder()
        .withExternalTables(List.of(salesTable), List.of())
        .withExternalAggregateTables(List.of(aggregateTable))
        .build();

final Schema schema = Schema.builder()
        .withExternalTables(List.of(salesTable), List.of())
        .withExternalAggregateTables(List.of(aggregateTable))
        .build();

Let's now define a bitmap aggregate provider on the sum of quantity per Product, the feeding of the in-memory provider will be done directly from the Aggregate Table:

ClickHouse
Snowflake

final IActivePivotInstanceDescription cubeDescription = StartBuilding.cube("MyCube")
        .withMeasures(
                superMeasureBuilder -> superMeasureBuilder
                        .withCalculations(c -> Copper.sum("QUANTITY").as("Quantity.SUM").publish(c)))
        .withDimensions(builder -> builder.withSingleLevelDimensions(List.of("DATE", "PRODUCT", "SHOP")))
        .withAggregateProvider()
        .withPartialProvider()
        .withName("MyProvider")
        .bitmap()
        .includingOnlyHierarchies(HierarchyIdentifier.simple("PRODUCT"))
        .includingOnlyMeasures("Quantity.SUM")
        .build();
final ISelectionDescription selection =
        StartBuilding.selection(schema).fromBaseStore("SALES").withAllReachableFields().build();
final IActivePivotManagerDescription managerDescription = StartBuilding.managerDescription()
        .withName("MyManager")
        .withSchema()
        .withSelection(selection)
        .withCube(cubeDescription)
        .build();
final Application app =
        session.applicationBuilder().schema(schema).managerDescription(managerDescription).build();

final IActivePivotInstanceDescription cubeDescription = StartBuilding.cube("MyCube")
        .withMeasures(
                superMeasureBuilder -> superMeasureBuilder
                        .withCalculations(c -> Copper.sum("QUANTITY").as("Quantity.SUM").publish(c)))
        .withDimensions(builder -> builder.withSingleLevelDimensions(List.of("DATE", "PRODUCT", "SHOP")))
        .withAggregateProvider()
        .withPartialProvider()
        .withName("MyProvider")
        .bitmap()
        .includingOnlyHierarchies(HierarchyIdentifier.simple("PRODUCT"))
        .includingOnlyMeasures("Quantity.SUM")
        .build();
final ISelectionDescription selection =
        StartBuilding.selection(schema).fromBaseStore("SALES").withAllReachableFields().build();
final IActivePivotManagerDescription managerDescription = StartBuilding.managerDescription()
        .withName("MyManager")
        .withSchema()
        .withSelection(selection)
        .withCube(cubeDescription)
        .build();
final Application app =
        session.applicationBuilder().schema(schema).managerDescription(managerDescription).build();

Note that it is possible to make sure that a provider will be fed by a given Aggregate Table by adding the provider coordinate to this Aggregate Table. This will simply trigger a check when starting the application to make sure that they are compatible.

ClickHouse
Snowflake

aggregateTable = AggregateTable.builder()
        .fromTable(table)
        .asAggregationOf(salesTable)
        .withGroupByFieldsMapping(fieldMapping)
        .withMeasuresMappings(measureMapping)
        .forProviders(List.of(ProviderCoordinate.at("MyCube", "MyProvider")))
        .build();

aggregateTable = AggregateTable.builder()
        .fromTable(table)
        .asAggregationOf(salesTable)
        .withGroupByFieldsMapping(fieldMapping)
        .withMeasuresMappings(measureMapping)
        .forProviders(List.of(ProviderCoordinate.at("MyCube", "MyProvider")))
        .build();

More complex aggregations such as average

In the previous example, SUM was based on a single input and pre-aggregated into a single column, but there are some more complex cases. For instance:

count has no input and produces 1 pre-aggregated column
sumProduct can takes N input columns and produces 1 pre-aggregated column
average takes a single input column but requires 2 pre-aggregated column in order to be able to re-aggregate: one column for the sum and one for the count

ClickHouse
Snowflake

final AggregateTable aggregateTable = AggregateTable.builder()
        .fromTable(table)
        .asAggregationOf(salesTable)
        .withGroupByFieldsMapping(Map.of(FieldPath.of("PRODUCT"), "product"))
        .withMeasuresMappings(
                List.of(
                        MeasureMapping.sum(FieldPath.of("QUANTITY"), "quantity_sum"),
                        MeasureMapping.avg(FieldPath.of("QUANTITY"), "quantity_sum", "contributors_count"),
                        MeasureMapping.count("contributors_count")))
        .build();

final AggregateTable aggregateTable = AggregateTable.builder()
        .fromTable(table)
        .asAggregationOf(salesTable)
        .withGroupByFieldsMapping(Map.of(FieldPath.of("PRODUCT"), "PRODUCT"))
        .withMeasuresMappings(
                List.of(
                        MeasureMapping.sum(FieldPath.of("QUANTITY"), "QUANTITY_SUM"),
                        MeasureMapping.avg(FieldPath.of("QUANTITY"), "QUANTITY_SUM", "CONTRIBUTORS_COUNT"),
                        MeasureMapping.count("CONTRIBUTORS_COUNT")))
        .build();