> ## Documentation Index
> Fetch the complete documentation index at: https://docs.activeviam.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Clustering fields in DirectQuery

Feeding aggregate providers from an external database can result in very large queries to be run on this database.
Some databases can have trouble processing such large queries.
Clustering fields are a way to break down those queries in multiple parts when feeding Atoti Aggregate providers.
You can extends the usage of the clustering fields beyond the feeding with the `UseClusteringFields.ALWAYS` database settings.

Clustering fields are [set on a table](./how_to/customize_table#set-the-clustering-fields) on DirectQuery side.
When feeding an Aggregate Provider, if the computed database query needs to access this table, the query will be split up according to the table clustering fields.
If multiple tables are accessed by the query, the clustering fields from all tables are taken in account to split the query.

A query clustered by too many fields or by fields with a high cardinality can be divided in lots of sub queries.
To avoid creating too many queries on the external database, DirectQuery limits the number of sub queries to 500.<br />
This limit can be changed thanks to the parameter `IDatabaseSettings#maxSubQueriesAllowedInMultiStepsQuery()`,
which can be set for your specific database in `XXXDatabaseSettings`, e.g.:

* `BigqueryDatabaseSettings`
* `ClickhouseDatabaseSettings`
* `DatabricksDatabaseSettings`
* `SnowflakeDatabaseSettings`
* ...

<Tabs>
  <Tab title="BigQuery">
    ```java theme={"languages":{"custom":["/engine/python-sdk/0.9/languages/pycon.tmLanguage.json"]}}
    final BigqueryDatabaseSettings databaseSettings =
        BigqueryDatabaseSettings.builder().maxSubQueriesAllowedInMultiStepQuery(10).build();
    ```
  </Tab>

  <Tab title="ClickHouse">
    ```java theme={"languages":{"custom":["/engine/python-sdk/0.9/languages/pycon.tmLanguage.json"]}}
    final ClickhouseDatabaseSettings databaseSettings =
        ClickhouseDatabaseSettings.builder().maxSubQueriesAllowedInMultiStepQuery(10).build();
    ```
  </Tab>

  <Tab title="Databricks">
    ```java theme={"languages":{"custom":["/engine/python-sdk/0.9/languages/pycon.tmLanguage.json"]}}
    final DatabricksDatabaseSettings databaseSettings =
        DatabricksDatabaseSettings.builder().maxSubQueriesAllowedInMultiStepQuery(10).build();
    ```
  </Tab>

  <Tab title="Generic JDBC">
    ```java theme={"languages":{"custom":["/engine/python-sdk/0.9/languages/pycon.tmLanguage.json"]}}
    final GenericJdbcDatabaseSettings databaseSettings =
        GenericJdbcDatabaseSettings.builder().maxSubQueriesAllowedInMultiStepQuery(10).build();
    ```
  </Tab>

  <Tab title="MS SQL">
    ```java theme={"languages":{"custom":["/engine/python-sdk/0.9/languages/pycon.tmLanguage.json"]}}
    final MsSqlDatabaseSettings databaseSettings =
        MsSqlDatabaseSettings.builder().maxSubQueriesAllowedInMultiStepQuery(10).build();
    ```
  </Tab>

  <Tab title="Redshift">
    ```java theme={"languages":{"custom":["/engine/python-sdk/0.9/languages/pycon.tmLanguage.json"]}}
    final RedshiftDatabaseSettings databaseSettings =
        RedshiftDatabaseSettings.builder().maxSubQueriesAllowedInMultiStepQuery(10).build();
    ```
  </Tab>

  <Tab title="Snowflake">
    ```java theme={"languages":{"custom":["/engine/python-sdk/0.9/languages/pycon.tmLanguage.json"]}}
    final SnowflakeDatabaseSettings databaseSettings =
        SnowflakeDatabaseSettings.builder().maxSubQueriesAllowedInMultiStepQuery(10).build();
    ```
  </Tab>

  <Tab title="Synapse">
    ```java theme={"languages":{"custom":["/engine/python-sdk/0.9/languages/pycon.tmLanguage.json"]}}
    final SynapseDatabaseSettings databaseSettings =
        SynapseDatabaseSettings.builder().maxSubQueriesAllowedInMultiStepQuery(10).build();
    ```
  </Tab>
</Tabs>

```java theme={"languages":{"custom":["/engine/python-sdk/0.9/languages/pycon.tmLanguage.json"]}}
final Application app =
    Application.builder(connector)
        .schema(schema)
        .managerDescription(managerDescription)
        .databaseSettings(databaseSettings)
        .build();
```

It is recommended to not increase it too much, as it can lead to contention on the external database.

If the cardinality of the clustering field (or the set of clustering fields) is lower than the limit, the query will be split up in as many queries as the cardinality of the clustering fields.
If the cardinality of the clustering field (or the set of clustering fields) is higher than the limit, the same split will be performed but some queries will be batched together to stay within the limit.

By default, the clustering fields are only used for feeding queries as these are the ones using the most resources.
This limits the number of queries sent to the external database as "normal" queries are not split up.

However it is possible to change this behavior in the database settings:

<Tabs>
  <Tab title="BigQuery">
    ```java theme={"languages":{"custom":["/engine/python-sdk/0.9/languages/pycon.tmLanguage.json"]}}
    final BigqueryDatabaseSettings databaseSettings =
        BigqueryDatabaseSettings.builder().useClusteringFields(UseClusteringFields.ALWAYS).build();
    ```
  </Tab>

  <Tab title="ClickHouse">
    ```java theme={"languages":{"custom":["/engine/python-sdk/0.9/languages/pycon.tmLanguage.json"]}}
    final ClickhouseDatabaseSettings databaseSettings =
        ClickhouseDatabaseSettings.builder()
            .useClusteringFields(UseClusteringFields.ALWAYS)
            .build();
    ```
  </Tab>

  <Tab title="Databricks">
    ```java theme={"languages":{"custom":["/engine/python-sdk/0.9/languages/pycon.tmLanguage.json"]}}
    final DatabricksDatabaseSettings databaseSettings =
        DatabricksDatabaseSettings.builder()
            .useClusteringFields(UseClusteringFields.ALWAYS)
            .build();
    ```
  </Tab>

  <Tab title="Generic JDBC">
    ```java theme={"languages":{"custom":["/engine/python-sdk/0.9/languages/pycon.tmLanguage.json"]}}
    final GenericJdbcDatabaseSettings databaseSettings =
        GenericJdbcDatabaseSettings.builder()
            .useClusteringFields(UseClusteringFields.ALWAYS)
            .build();
    ```
  </Tab>

  <Tab title="MS SQL">
    ```java theme={"languages":{"custom":["/engine/python-sdk/0.9/languages/pycon.tmLanguage.json"]}}
    final MsSqlDatabaseSettings databaseSettings =
        MsSqlDatabaseSettings.builder().useClusteringFields(UseClusteringFields.ALWAYS).build();
    ```
  </Tab>

  <Tab title="Redshift">
    ```java theme={"languages":{"custom":["/engine/python-sdk/0.9/languages/pycon.tmLanguage.json"]}}
    final RedshiftDatabaseSettings databaseSettings =
        RedshiftDatabaseSettings.builder().useClusteringFields(UseClusteringFields.ALWAYS).build();
    ```
  </Tab>

  <Tab title="Snowflake">
    ```java theme={"languages":{"custom":["/engine/python-sdk/0.9/languages/pycon.tmLanguage.json"]}}
    final SnowflakeDatabaseSettings databaseSettings =
        SnowflakeDatabaseSettings.builder().useClusteringFields(UseClusteringFields.ALWAYS).build();
    ```
  </Tab>

  <Tab title="Synapse">
    ```java theme={"languages":{"custom":["/engine/python-sdk/0.9/languages/pycon.tmLanguage.json"]}}
    final SynapseDatabaseSettings databaseSettings =
        SynapseDatabaseSettings.builder().useClusteringFields(UseClusteringFields.ALWAYS).build();
    ```
  </Tab>
</Tabs>

```java theme={"languages":{"custom":["/engine/python-sdk/0.9/languages/pycon.tmLanguage.json"]}}
final Application app =
    Application.builder(connector)
        .schema(schema)
        .managerDescription(managerDescription)
        .databaseSettings(databaseSettings)
        .build();
```

## Examples

### Example with one table

Consider the following table `SALES`.

| SALE\_ID | DATE       | SHOP    | PRODUCT | QUANTITY | UNIT\_PRICE |
| -------- | ---------- | ------- | ------- | -------- | ----------- |
| S0010    | 2022-01-31 | shop\_2 | BED\_2  | 3        | 150         |
| ...      | ...        | ...     | ...     | ...      | ...         |

With this table clusterized by dates,
the query `sum quantities by product` is split up in queries of the form `sum quantities by product where date equals ...`.

```mermaid theme={"languages":{"custom":["/engine/python-sdk/0.9/languages/pycon.tmLanguage.json"]}}
graph LR;
    %% Nodes definition
    Query["Sum quantities by product"];
    subgraph Where date equals
    Q1("2022-01-31")
    Q2("2022-01-30")
    Qn("...")
    QLast("2022-01-01")
    end;
    subgraph Aggregate Provider
    P1[(Partition 1)]
    P2[(Partition 2)]
    Pn[(...)]
    PLast[(Partition N)]
    end;
    %% Links definition 
    Query ---> Q1
    Query ---> Q2
    Query ---> Qn
    Query ---> QLast
    Q1 -.-> P1
    Q1 -.-> P2
    Q2 -.-> P2
    Q2 -.-> PLast
    QLast -.-> PLast
```

### Example with multiple tables

Now consider this additional table `PRODUCTS`, storing all the products referenced in the `SALES` table.

| PRODUCT\_ID | CATEGORY  | SUB\_CATEGORY | SIZE   | PURCHASE\_PRICE | COLOR | BRAND |
| ----------- | --------- | ------------- | ------ | --------------- | ----- | ----- |
| BED\_2      | Furniture | Bed           | Single | 127             | red   | Mega  |
| ...         | ...       | ...           | ...    | ...             | ...   | ...   |

This table is clusterized by `CATEGORY`.
With a query computing the sum of the margins by product, both the `SALES` and `PRODUCTS` table will be accessed,
so the two clustering fields `DATE` and `CATEGORY` will be used to split the query up.

```mermaid theme={"languages":{"custom":["/engine/python-sdk/0.9/languages/pycon.tmLanguage.json"]}}
graph LR;
    %% Nodes definition
    Query["Sum of margins by product"];
    subgraph "(date, category)"
    Q1("2022-01-31, furniture")
    Q2("2022-01-31, cloth")
    Q3("2022-01-31, furniture")
    Q4("2022-01-31, cloth")
    Qn("...")
    QLast("2022-01-01, cloth")
    end;
    subgraph Aggregate Provider
    P1[(Partition 1)]
    P2[(Partition 2)]
    Pn[(...)]
    PLast[(Partition N)]
    end;
    %% Links definition 
    Query ---> Q1
    Query ---> Q2
    Query ---> Q3
    Query ---> Q4
    Query ---> Qn
    Query ---> QLast
    Q1 -.-> P1
    Q1 -.-> P2
    Q2 -.-> P2
    Q2 -.-> PLast
    Q3 -.-> P1
    Q4 -.-> PLast
    QLast -.-> PLast
```

### Example with one table and clustering field cardinality above the limit

Consider the following table `SALES`.

| SALE\_ID | DATE       | SHOP    | PRODUCT | QUANTITY | UNIT\_PRICE |
| -------- | ---------- | ------- | ------- | -------- | ----------- |
| S0010    | 2022-01-31 | shop\_2 | BED\_2  | 3        | 150         |
| ...      | ...        | ...     | ...     | ...      | ...         |

With this table clusterized by dates,
the query `sum quantities by product` is split up in queries of the form `sum quantities by product where date equals ...`.

However, as the cardinality of the clustering field `DATE` is above the limit,
each subqueries will be performed with a condition on several dates, in order to stay within the limit of the number of subqueries.

```mermaid theme={"languages":{"custom":["/engine/python-sdk/0.9/languages/pycon.tmLanguage.json"]}}
graph LR;
    %% Nodes definition
    Query["Sum quantities by product"];
    subgraph Where date equals
    Q1("2022-01-31 OR 2022-01-30")
    Qn("...")
    QLast("2022-01-02 OR 2022-01-01")
    end;
    subgraph Aggregate Provider
    P1[(Partition 1)]
    P2[(Partition 2)]
    Pn[(...)]
    PLast[(Partition N)]
    end;
    %% Links definition
    Query ---> Q1
    Query ---> Qn
    Query ---> QLast
    Q1 -.-> P1
    Q1 -.-> P2
    QLast -.-> PLast
```

## Considerations

### Clustering fields and external database clustering

DirectQuery clustering fields can be chosen independently of any database clustering. <br />
Still it can be interesting to match DirectQuery tables clustering fields with the potential clustering already implemented in your external database.
This results in better performances for the sub-queries generated.

### Clustering fields and Aggregate Provider partitioning

DirectQuery clustering fields can be chosen independently of any configured partitioning for Aggregate Providers.
DirectQuery will repartition and reaggregate on the fly if needed.<br />
Still it can be interesting to match DirectQuery tables clustering fields with the partitioning of the existing Aggregate Providers.
This coherence results in better performances to avoid the cost of repartitioning.

## When are clustering fields used

The usage of clustering fields depends on the `UseClusteringFields` database setting and the type of query.

The table below summarizes the different possibilities:

| Query type                             | UseClusteringFields.ALWAYS | UseClusteringFields.FEEDING |
| -------------------------------------- | -------------------------- | --------------------------- |
| discovery                              | no                         | no                          |
| aggregate provider feeding             | yes                        | yes                         |
| Just In Time cube query                | yes                        | no                          |
| hierarchy feeding                      | no                         | no                          |
| ad hoc list/distinct query on database | no                         | no                          |
