How does it work
Aggregate Tables are tables added in the external database as a table fed from aggregated data. These tables typically denormalize fields from joined tables into group-by columns, but can also preserve joins to keep the aggregate table smaller while still accessing fields from other tables at query time. Whenever possible, DirectQuery will query the Aggregate Table instead of the base Schema for a more optimized query: For instance, consider a Sales table containing sales such as this one:| Id | Date | Product | Buyer | Quantity | Price |
|---|---|---|---|---|---|
| 1 | 2023-01-01 | P1 | Buyer_A | 100.0 | 15.5 |
| 2 | 2023-01-01 | P1 | Buyer_B | 100.0 | 16.5 |
| 3 | 2023-01-01 | P2 | Buyer_A | 100.0 | 30 |
| 4 | 2023-01-01 | P2 | Buyer_B | 100.0 | 31 |
| 5 | 2023-01-02 | P1 | Buyer_A | 100.0 | 15.5 |
| 6 | 2023-01-02 | P1 | Buyer_B | 100.0 | 15.5 |
| 7 | 2023-01-02 | P1 | Buyer_A | 100.0 | 18 |
| 8 | 2023-01-03 | P1 | Buyer_B | 100.0 | 18 |
| 9 | 2023-01-03 | P2 | Buyer_A | 100.0 | 35 |
| 10 | 2023-01-03 | P2 | Buyer_B | 100.0 | 40 |
| date | product | sum_of_quantity | count_of_rows |
|---|---|---|---|
| 2023-01-01 | P1 | 200.0 | 2 |
| 2023-01-01 | P2 | 200.0 | 2 |
| 2023-01-02 | P1 | 300.0 | 3 |
| 2023-01-03 | P1 | 100.0 | 1 |
| 2023-01-03 | P2 | 200.0 | 2 |
SUM(Quantity) per Product in the Sales table, DirectQuery will retrieve the SUM(sum_of_quantity) per product in agg_sales.
As the agg_sales is smaller and already aggregated the result will be faster to compute and will potentially avoid a huge scan of the database, saving both time and compute costs.
Use cases
There are a few classic use cases for using an Aggregate Table:- Feeding an aggregate provider faster by pre-aggregating over the provider fields
- Having an “external database aggregate provider”, i.e. being able to answer some queries from the pre-aggregated data in the external database but without having to store additional data in-memory.
- Feeding some hierarchies faster
Supported aggregations
The following aggregation function can be used in an Aggregate Table:- Sum
- Min
- Max
- Average
- Sum product
- Count
- Long
- Short
- Gross
- Square sum
Custom Aggregations
To use a custom aggregation in an aggregate table, you need to define the corresponding user defined aggregation.You can also register its corresponding re-aggregation function.
This re-aggregation function will be used to compute the aggregates at a coarser granularity than the one stored in the aggregate table.
Examples
Create a simple Aggregate Table to feed an aggregate provider
Let’s create a new table that will pre-aggregate the quantity (in the tutorial sales table) per Date and Product:- BigQuery
- ClickHouse
- Databricks
- MsSql
- Redshift
- Snowflake
- Synapse
| PRODUCT | QUANTITY_SUM | CONTRIBUTORS_COUNT |
|---|---|---|
| BED_2 | 6 | 4 |
| HOO_5 | 3 | 3 |
| TAB_0 | 4 | 4 |
| TAB_1 | 4 | 3 |
| TSH_3 | 3 | 3 |
| TSH_4 | 5 | 3 |
- BigQuery
- ClickHouse
- Databricks
- MsSql
- Redshift
- Snowflake
- Synapse
PRODUCT field of the sales table to the product field of the Aggregate Table:
- BigQuery
- ClickHouse
- Databricks
- MsSql
- Redshift
- Snowflake
- Synapse
quantity column into a column called quantity_sum:
- BigQuery
- ClickHouse
- Databricks
- MsSql
- Redshift
- Snowflake
- Synapse
AggregateTable:
- BigQuery
- ClickHouse
- Databricks
- MsSql
- Redshift
- Snowflake
- Synapse
- BigQuery
- ClickHouse
- Databricks
- MsSql
- Redshift
- Snowflake
- Synapse
Product, the feeding of the in-memory provider will be done directly from the Aggregate Table:
- BigQuery
- ClickHouse
- Databricks
- MsSql
- Redshift
- Snowflake
- Synapse
- BigQuery
- ClickHouse
- Databricks
- MsSql
- Redshift
- Snowflake
- Synapse
More complex aggregations such as average
In the previous example, SUM was based on a single input and pre-aggregated into a single column, but there are some more complex cases. For instance:counthas no input and produces 1 pre-aggregated columnsumProductcan takes N input columns and produces 1 pre-aggregated columnaveragetakes a single input column but requires 2 pre-aggregated column in order to be able to re-aggregate: one column for the sum and one for the count
- BigQuery
- ClickHouse
- Databricks
- MsSql
- Redshift
- Snowflake
- Synapse
Create a simple Aggregate Table to feed a hierarchy
For an aggregate table to be used to feed a hierarchy, it needs to be defined with all the levels of the hierarchy asGroupByFieldsMapping and the Count as MeasuresMappings.
In the external database, the table for a hierarchy with the levels (Country, City) would look like:
| Country | City | Count |
|---|---|---|
| France | Paris | 812 |
| Germany | Berlin | 485 |
| … |
Count
The count is measure automatically added to any query sent to the database. When adding an aggregate table, a check is therefore performed to ensure it contains a column with a count. (otherwise it would never be used).Slicing hierarchies
Slicing hierarchies are always expressed in the queries sent to the database. When a field expressed in a query is not part of the Aggregate Table, the query will not use the Aggregate Table. Therefore it is recommended to include the slicing hierarchies in Aggregate Table definition.Aggregate tables with vectors
Arrays can be aggregated in an aggregate table but they need to have a specific format depending on the type of original array being aggregated:| Original Array | Aggregate Table Array |
|---|---|
| Native | Native |
| Multi Column Array | Multi Column Array |
| Multi Row Array | Native |
Generate SQL for Aggregate Table matching an Aggregate Provider
Aggregate Tables are a good tool to feed Aggregate Providers quickly because it can use the aggregated data instead of running a new aggregation. DirectQuery provides a bootstrapper to generate the SQL to create and feed an Aggregate Table matching an Aggregate Provider. Let’s imagine we have the following Aggregate Provider defined:- BigQuery
- ClickHouse
- Databricks
- MsSql
- Redshift
- Snowflake
- BigQuery
- ClickHouse
- Databricks
- MsSql
- Redshift
- Snowflake
- BigQuery
- ClickHouse
- Databricks
- MsSql
- Redshift
- Snowflake
- BigQuery
- ClickHouse
- Databricks
- MsSql
- Redshift
- Snowflake
Preserving joins with joinedTableNames
By default, the bootstrapper denormalizes all fields from joined tables into the aggregate table as group-by columns. However, when dealing with dimension tables with many columns, this denormalization can make the aggregate table unnecessarily large. ThejoinToKeepNormalized option allows specifying tables whose fields should remain accessible through joins at query time instead of being denormalized. This keeps the aggregate table smaller while still allowing queries to reference fields from those tables.
When using joinToKeepNormalized:
- Fields accessible through the specified join are not included as group-by columns in the aggregate table
- Join key fields (source fields of the join leading to those tables) are automatically added as group-by columns
- At query time, the aggregate table can still answer queries involving fields from joined tables by performing joins
- This reduces the number of denormalized columns while maintaining query compatibility
joinToKeepNormalized method on the bootstrapper builder with the source table and the join name:
- BigQuery
- ClickHouse
- Databricks
- MsSql
- Redshift
- Snowflake
SALES_TO_PRODUCTS join (starting from SALES) will not be denormalized into the aggregate table. Instead, queries referencing these fields will use joins to access them.
Note that preserving joins comes with a trade-off: queries referencing fields from joined tables will require the join to be performed at query time, which most of the time increases compute costs and result in slower responses.
Join key fields and join chains
Consider the following schema:STORE_ID from Sales as a group-by field in the aggregate table, queries can access fields from Stores and, by transitivity, from Regions as well — without denormalizing those fields into the aggregate table.- Dimension tables have many columns that would bloat the aggregate table
- Reducing storage footprint and improving aggregate table creation time is a priority
Database specific features
Creating a view for an Aggregate Table is not a solution as the view must be recomputed for each query. Creating a table requires to update it when the data changes. Some databases offers native implementations to automatically create and materialize the content of a view (so it is fast to access) while automatically updating it (so it is always up-to-date).- Snowflake
In Snowflake there is a powerful feature called Dynamic tables.
As these Dynamic tables also support time travel it is the recommended way to implement an Aggregate Table in Snowflake.
Is my query using the Aggregate Table?
The simplest solution to check whether the query is using an aggregate table is to look at the query in the external database. If that is the case, then the aggregate table name should appear in the SQL query as the base table (the first table in the FROM clause). If it using an aggregate table, the aggregate table name should appears as the base table (the first table in FROM clause) in the SQL query. Additionally, for databases supporting tags such as Snowflake, the tagaggregate_table_name will be added to the query with the name of the aggregate table used.
While building a project, to have more information about why a query is compatible or not with aggregate tables, it is possible to add additional logs by setting the logger atoti.server.directquery.query_resolution.aggregate_table (see Logger naming) to FINE.
The additional logs look like this: