Skip to main content

Clustering fields in DirectQuery

Feeding aggregate providers from an external database can result in very large queries to be run on this database. Some databases can have trouble processing such large queries. Clustering fields are a way to break down those queries in multiple parts when feeding ActivePivot Aggregate providers.

Clustering fields are set on a table on DirectQuery side. When feeding an Aggregate Provider, if the computed database query needs to access this table, the query will be split up according to the table clustering fields. If multiple tables are accessed by the query, the clustering fields from all tables are taken in account to split the query.

A query clustered by too many fields or by fields with a high cardinality can be divided in lots of subqueries To avoid creating too many queries on the external database, DirectQuery limits the number of subqueries to 500.
This limit can be set through the property ActiveViamProperty.DIRECT_QUERY_SUB_QUERY_LIMIT, but it is recommended to not increase it too much.

Examples

Example with one table

Consider the following table SALES.

SALE_IDDATESHOPPRODUCTQUANTITYUNIT_PRICE
S00102022-01-31shop_2BED_23150
..................

With this table clusterized by dates, the query sum quantities by product is split up in queries of the form sum quantities by product where date equals ....

graph LR; %% Nodes definition Query["Sum quantities by product"]; subgraph Where date equals Q1("2022-01-31") Q2("2022-01-30") Qn("...") QLast("2022-01-01") end; subgraph Aggregate Provider P1[(Partition 1)] P2[(Partition 2)] Pn[(...)] PLast[(Partition N)] end; %% Links definition Query ---> Q1 Query ---> Q2 Query ---> Qn Query ---> QLast Q1 -.-> P1 Q1 -.-> P2 Q2 -.-> P2 Q2 -.-> PLast QLast -.-> PLast

Example with multiple tables

Now consider this additional table PRODUCTS, storing all the products referenced in the SALES table.

PRODUCT_IDCATEGORYSUB_CATEGORYSIZEPURCHASE_PRICECOLORBRAND
BED_2FurnitureBedSingle127redMega
.....................

This table is clusterized by CATEGORY. With a query computing the sum of the margins by product, both the SALES and PRODUCTS table will be accessed, so the two clustering fields DATE and CATEGORY will be used to split the query up.

graph LR; %% Nodes definition Query["Sum of margins by product"]; subgraph "(date, category)" Q1("2022-01-31, furniture") Q2("2022-01-31, cloth") Q3("2022-01-31, furniture") Q4("2022-01-31, cloth") Qn("...") QLast("2022-01-01, cloth") end; subgraph Aggregate Provider P1[(Partition 1)] P2[(Partition 2)] Pn[(...)] PLast[(Partition N)] end; %% Links definition Query ---> Q1 Query ---> Q2 Query ---> Q3 Query ---> Q4 Query ---> Qn Query ---> QLast Q1 -.-> P1 Q1 -.-> P2 Q2 -.-> P2 Q2 -.-> PLast Q3 -.-> P1 Q4 -.-> PLast QLast -.-> PLast

Considerations

Clustering fields and external database clustering

DirectQuery clustering fields can be chosen independently of any database clusterization.
Still it can be interesting to match DirectQuery tables clustering fields with the potential clustering already implemented in your external database. This results in better performances for the sub-queries generated.

Clustering fields and Aggregate Provider partitioning

DirectQuery clustering fields can be chosen independently of any configured partitioning for Aggregate Providers. DirectQuery will repartition and reaggregate on the fly if needed.
Still it can be interesting to match DirectQuery tables clustering fields with the partitioning of the existing Aggregate Providers. This coherence results in better performances to avoid the cost of repartitioning.