Skip to main content

Databricks

Time-travel support

Time-travel is supported by DirectQuery with Databricks, but is limited to tables (and not views, as it is not supported by Databricks). This means that updating applications based on this Database are subject to data desynchronization in case of changes in a view of the database.

To choose the time-travel behavior, DatabricksDirectQuerySettings#timeTravelPolicy can be configured with the following values:

  • DatabricksDirectQuerySettings.TimeTravelPolicy#DISABLED: time-travel will not be used.
  • DatabricksDirectQuerySettings.TimeTravelPolicy#STRICT: time-travel will be used if there are only tables and no views, and otherwise will throw an error. This is the default mode.
  • DatabricksDirectQuerySettings.TimeTravelPolicy#LIGHT: time-travel will be used only for tables and views will be queried without it.

Read this page to learn more about the visible effects of this desynchronization.

Vectors support

Databricks handles vectors natively, via the Array type in the model and thanks to special functions at query time. However, Databricks does not natively support the aggregation function used by Atoti Server such as a vector sum.

Emulated vectors (multi-row and multi-column) are also supported on Databricks.

Databricks SQL Warehouse

Native vectors with DirectQuery are not supported on SQL warehouse, as UDAFs are not supported on Databricks SQL Warehouse. Only emulated vector are supported.

Databricks Cluster

In order to support native vector aggregation, DirectQuery provides UDAFs (User Defined Aggregation Functions) as precompiled jars. The user must register those on the Databricks cluster.

The jar can be downloaded from ActiveViam's JFrog Artifactory. The artifact is under the name com.activeviam.database.databricks-udafs.

Upload those into Databricks: at the top of a notebook, select the option "Open DBFS File Browser" and then "Upload". Note that by default, the UI does not enable this file browser.

Then create the functions with the following sql, replacing ATOTI_VERSION by your version:

CREATE OR REPLACE FUNCTION grossVector AS 'com.activeviam.database.databricksudafs.internal.GenericUdafGrossVector' USING JAR 'dbfs:/FileStore/jars/databricks_udafs_ATOTI_VERSION.jar';
CREATE OR REPLACE FUNCTION longVector AS 'com.activeviam.database.databricksudafs.internal.GenericUdafLongVector' USING JAR 'dbfs:/FileStore/jars/databricks_udafs_ATOTI_VERSION.jar';
CREATE OR REPLACE FUNCTION shortVector AS 'com.activeviam.database.databricksudafs.internal.GenericUdafShortVector' USING JAR 'dbfs:/FileStore/jars/databricks_udafs_ATOTI_VERSION.jar';
CREATE OR REPLACE FUNCTION sumVector AS 'com.activeviam.database.databricksudafs.internal.GenericUdafSumVector' USING JAR 'dbfs:/FileStore/jars/databricks_udafs_ATOTI_VERSION.jar';
CREATE OR REPLACE FUNCTION sumProductVector AS 'com.activeviam.database.databricksudafs.internal.SumProductUdafVector' USING JAR 'dbfs:/FileStore/jars/databricks_udafs_ATOTI_VERSION.jar';

After uploading a new JAR, do NOT forget to restart the cluster. You can then use these new UDAFs as aggregation functions.

Databricks cluster also support emulated vectors but it is recommended to use the native ones for performance.

Gotchas

Nullable fields

Databricks can define fields with nullable types. DirectQuery is capable of detecting this and defines its internal model accordingly.
While this is not a problem in itself, it can conflict with the rule in Atoti cubes that all levels must be based on non-nullable values. This does not create an issue of any sort as the local model is updated behind the scene, assigning a default value based on the type.
It has a side effect on the query performance, as DirectQuery must convert on-the-fly null values to their assigned default values, as well as adding extra conditions during joins to handle null values on both sides.