Skip to main content

Databricks

Time-travel support

Time-travel is supported by DirectQuery with Databricks, but is limited to tables (and not views, as it is not supported by Databricks). This means that updating applications based on this Database are subject to data desynchronization in case of changes in a view of the database.

To choose the time-travel behavior, DatabricksDirectQuerySettings#timeTravelPolicy can be configured with the following values:

  • DatabricksDirectQuerySettings.TimeTravelPolicy#DISABLED: time-travel will not be used.
  • DatabricksDirectQuerySettings.TimeTravelPolicy#STRICT: time-travel will be used if there are only tables and no views, and otherwise will throw an error. This is the default mode.
  • DatabricksDirectQuerySettings.TimeTravelPolicy#LIGHT: time-travel will be used only for tables and views will be queried without it.

Read this page to learn more about the visible effects of this desynchronization.

Vectors support

Databricks handles vectors natively, via the Array type in the model and thanks to special functions at query time. However, Databricks does not natively support the aggregation function used by Atoti Server such as a vector sum.

Emulated vectors (multi-row and multi-column) are also supported on Databricks.

Databricks SQL Warehouse

Native vectors with DirectQuery are not supported on SQL warehouse, as UDAFs are not supported on Databricks SQL Warehouse. Only emulated vector are supported.

Databricks Cluster

In order to support (native vector aggregation)[../vectors.md#native-support], DirectQuery provides UDAFs (User Defined Aggregation Functions) as precompiled jars. The user must register those on the Databricks cluster.

The jar can be downloaded from ActiveViam's JFrog Artifactory. The artifact is under the name com.activeviam.database.databricks-udafs.

Upload those into Databricks: at the top of a notebook, select the option "Open DBFS File Browser" and then "Upload". Note that by default, the UI does not enable this file browser.

Then create the functions with the following sql, replacing ATOTI_VERSION by your version:

DROP FUNCTION IF EXISTS grossVector;
DROP FUNCTION IF EXISTS longVector;
DROP FUNCTION IF EXISTS shortVector;
DROP FUNCTION IF EXISTS sumVector;
DROP FUNCTION IF EXISTS sumProductVector;
CREATE FUNCTION grossVector AS 'com.activeviam.database.databricksudafs.internal.GenericUdafGrossVector' USING JAR 'dbfs:/FileStore/jars/databricks_udafs_ATOTI_VERSION.jar';
CREATE FUNCTION longVector AS 'com.activeviam.database.databricksudafs.internal.GenericUdafLongVector' USING JAR 'dbfs:/FileStore/jars/databricks_udafs_ATOTI_VERSION.jar';
CREATE FUNCTION shortVector AS 'com.activeviam.database.databricksudafs.internal.GenericUdafShortVector' USING JAR 'dbfs:/FileStore/jars/databricks_udafs_ATOTI_VERSION.jar';
CREATE FUNCTION sumVector AS 'com.activeviam.database.databricksudafs.internal.GenericUdafSumVector' USING JAR 'dbfs:/FileStore/jars/databricks_udafs_ATOTI_VERSION.jar';
CREATE FUNCTION sumProductVector AS 'com.activeviam.database.databricksudafs.internal.SumProductUdafVector' USING JAR 'dbfs:/FileStore/jars/databricks_udafs_ATOTI_VERSION.jar';

You can then use these new UDAFs as aggregation functions.

Run the following query to test whether the UDAFs were correctly registered:

SELECT group, sumVector(arr) FROM (SELECT * from (values ('a', ARRAY(1, 2, 3)), ('a', ARRAY(10, 10, 10))) AS t(group, arr)) GROUP BY group;

The query should run without error, and return a single row: ('a', [11, 12, 13]).

Databricks cluster also support emulated vectors but it is recommended to use the native ones for performance.

Gotchas

Nullable fields

Databricks can define fields with nullable types. DirectQuery is capable of detecting this and defines its internal model accordingly.
While this is not a problem in itself, it can conflict with the rule in Atoti cubes that all levels must be based on non-nullable values. This does not create an issue of any sort as the local model is updated behind the scene, assigning a default value based on the type.
It has a side effect on the query performance, as DirectQuery must convert on-the-fly null values to their assigned default values, as well as adding extra conditions during joins to handle null values on both sides.

Feeding connection string

On modern cloud distributed databases such as Databricks, a key design point is the separation of compute and storage. This allows to use different computing resources to process queries and brings flexibility and the capability to scale up and down the resources (and the associated bill !). On Databricks, the compute resources can be clusters or SQL warehouses. A bigger compute will process queries faster but will also cost more.

The load on the external database of a DirectQuery application is quite particular:

  • At the application startup and refresh, many queries are performed on the database to initialize the cube and cache data in it (the hierarchies and aggregate providers).
  • After the initial feeding, if the aggregate providers have been chosen wisely, most user queries will hit the cache data and actually never been run on the external database. Additionally, queries hitting the external database should most of the time be drill-down with a very reduced scope and should be easier for Databricks to handle.

In order to have faster startup time, it is recommended to use bigger computational resources during that time. Afterward, to save on cost, the computational resources can be scaled down. DirectQuery provides a way to define two compute resources to use for your application:

  • one powerful cluster or SQL warehouse, used during initial feeding and refresh, to guarantee faster startup and refresh times
  • a regular cluster or SQL warehouse, to serve the queries after the initial feeding in a cost-efficient manner

After the initial feeding, the large compute resource will become idle and be automatically shut down if you configured it as such.

This feeding connection string can be defined as such:

final DatabricksConnectionProperties properties =
DatabricksConnectionProperties.builder()
.connectionString("CONNECTION_STRING_TO_SMALL_AND_CHEAP_CLUSTER")
.feedingConnectionString("CONNECTION_STRING_TO_LARGE_AND_EXPENSIVE_CLUSTER")
.additionalOption(
DatabricksConnectionProperties.PASSWORD_PROPERTY_KEY, "your-plain-password")
.build();

You can then pass it as argument to your session:

final DatabricksClientSettings clientSettings =
DatabricksClientSettings.builder().connectionProperties(properties).build();
final DirectQueryConnector<DatabricksDatabaseSettings> connector =
DatabricksConnectorFactory.INSTANCE.createConnector(clientSettings);

If it is not defined, the regular compute resource will be used for the feeding.

Note that both compute resources need to be of the same type (either cluster or SQL warehouse), to run the same runtime version, and to share the same password.