Skip to main content

Snowflake

Time-travel support

Snowflake supports time-travel, which means that querying the past is possible and queries stay consistent if the data in Snowflake is updated.
When creating new epochs for the Database, DirectQuery automatically snapshots the current timestamp and use it.

Querying past snapshots is done by using the AT TIMESTAMP SQL syntax. As such, when the historical data becomes unavailable, this is reflected in DirectQuery and queries against those old versions will fail.

Vector supports

Only multi-rows vectors and multi-columns vectors are supported by Snowflake.

The array type available in Snowflake does not offer powerful aggregation functions to work with them.

Gotchas

Nullable fields

Snowflake can define fields with nullable types. DirectQuery is capable of detecting this and defines its internal model accordingly.
While this is not a problem in itself, it can conflict with the rule in Atoti cubes that all levels must be based on non-nullable values. This does not create an issue of any sort as the local model is updated behind the scene, assigning a default value based on the type.
It has a side effect on the query performance, as DirectQuery must convert on-the-fly null values to their assigned default values, as well as adding extra conditions during joins to handle null values on both sides.

Feeding warehouse

On modern cloud distributed databases such as Snowflake, a key design point is the separation of compute and storage. This allows to use different computing resources to process queries and brings flexibility and the capability to scale up and down the resources (and the associated bill !). On Snowflake, the compute resources are called warehouses. A bigger warehouse will process queries faster but will also cost more.

The load on the external database of a DirectQuery application is quite particular:

  • At the application startup and refresh, many queries are performed on the database to initialize the cube and cache data in it (the hierarchies and aggregate providers).
  • After the initial feeding, if the aggregate providers have been chosen wisely, most user queries will hit the cache data and actually never been run on the external database. Additionally, queries hitting the external database should most of the time be drill-down with a very reduced scope and should be easier to handle by Snowflake.

In order to have faster startup time, it is recommended to use bigger computational resources during that time. Afterward, to save on cost, the computational resources can be scaled down. This can be done in Snowflake UI, but DirectQuery provides a way to define two warehouses to use for your application:

  • one powerful warehouse, used during initial feeding and refresh, to guarantee faster startup and refresh times
  • a regular warehouse, to serve the queries after the initial feeding in a cost-efficient manner

After the initial feeding, the large warehouse will become idle and be automatically shut down if you configured it as such.

This feeding warehouse can be defined as such:

final SnowflakeProperties properties =
SnowflakeProperties.builder()
.connectionString(connectionString)
.warehouse("SMALL_AND_CHEAP_WAREHOUSE_NAME")
.feedingWarehouse("BIG_AND_EXPENSIVE_WAREHOUSE_NAME")
.additionalOption(SFSessionProperty.PASSWORD, "your-plain-password")
.build();

You can then pass it as argument to your session:

final DirectQueryConnector<?> connector =
SnowflakeConnectorFactory.INSTANCE.createConnector(clientSettings);

If it is not defined, the regular warehouse will be used for the initial feeding.

Faster array aggregation on multi-rows vectors

To improve performance on multi-rows aggregation, DirectQuery provides a javascript function wrapper to improve performance of the Snowflake built-in ARRAY_AGG function.

You must define this function on Snowflake, thanks to a SQL query:

CREATE OR REPLACE FUNCTION EXAMPLE_DATABASE.EXAMPLE_SCHEMA.ARRAY_AGG_WRAPPER("ZIP_ARRAY" ARRAY)
RETURNS ARRAY
LANGUAGE JAVASCRIPT
AS '
// This function returns an array of values from an unsorted array of (index, value)

// If any index is null return null
if (ZIP_ARRAY.map(e => e[0]).some(index => index == null)) {
return undefined
}

// If all element are null return null
if (ZIP_ARRAY.map(e => e[1]).every(element => element == null)) {
// NULL element are converted to undefined
// element == null should catch both null and undefined
return undefined // This will be converted to SQL NULL
}
return ZIP_ARRAY
.sort((x, y) => x[0] < y[0] ? -1 : 1)
.map(e => e[1])
';

This function must be accessible to the role running the queries.

Then, in your DirectQuery application, you must pass this method name to:

final SnowflakeDirectQuerySettings settings =
SnowflakeDirectQuerySettings.builder()
.arrayAggWrapperFunctionName("ARRAY_AGG_WRAPPER")
.build();

You can then pass it as argument to your session:

final DirectQueryConnector<SnowflakeDatabaseSettings> connector =
SnowflakeConnectorFactory.INSTANCE.createConnector(clientSettings);

If not defined, the slower built-in Snowflake ARRAY_AGG function will be used.