Time-Travel Emulation
Emulated time-travel is a mechanism to solve de-synchronization issues, as explained in the versioning documentation. However, unlike native time-travel, some additional set-up is required on the external database itself. It also makes the queries more complex and slower (typically about 10-30 %).
This feature is experimental, and requires this option to be enabled -Dactiveviam.feature.experimental.emulated_time_travel.enabled=true
.
External database set-up
The general idea is to add some column(s) in each table requiring time-travel; these column(s) contain meta-data about the row validity.
The data type of these columns is left to the user's choice, but needs to support comparison operators in SQL. Natural ones can be timestamp or integer.
The data type of these columns needs to be the same across all the tables in the database.
Table definition
General case
For each table requiring time-travel, the user needs to add 2 columns, let's call them valid_from
and valid_to
.
These columns hold the range of versions for which the row is valid, where valid_from
is inclusive and valid_to
exclusive.
Append-only case
For append-only tables, i.e. no updates or deletes are allowed, the user needs to add 1 column, let's call it valid_from
.
This column states from which version the row is valid, where valid_from
is inclusive.
If your table already contains an auto-increment column, then you can use it as valid_from
.
Data management
Current version
It is the version to be used for a given transaction, and needs to be increasing over each transaction.
If the data type is a timestamp, then the current version is the timestamp of the transaction.
If you use a timestamp type and an external database with several servers, clock synchronization might be an issue for frequent transactions, as the versions might not be increasing (for instance a transaction handled by a server with a late clock might have a timestamp before an earlier transaction done by a different server).
If the data type is an integer, then a counter needs to be kept to give the current version and incremented after each transaction.
Note that for integers, the current version can be different across different tables.
Insertion
During insertion, the valid_from
column is set to the current version and the valid_to
column is set to the default value for valid rows.
This default value can be set as a maximum value (e.g. 2099-12-31T23:59:59
) or null
.
The first choice might be less readable but is more efficient for queries as it avoids the need to check for null
values.
For example, let's consider a table with the following rows:
Id | Description | valid_from | valid_to |
---|---|---|---|
1 | First row | 1 | 2,147,483,647 |
2 | Second row | 1 | 2,147,483,647 |
The current version is 2, so after adding a new row, the table is:
Id | Description | valid_from | valid_to |
---|---|---|---|
1 | First row | 1 | 2,147,483,647 |
2 | Second row | 1 | 2,147,483,647 |
3 | New row | 2 | 2,147,483,647 |
For an append-only table, everything is the same, except there is no valid_to
column.
Time travel emulation method | Record insertion | Record update | Record deletion |
---|---|---|---|
Valid_from | |||
Valid_from and valid_to |
When running a query at version 1, the first 2 rows are valid and returned (the last one is filtered out), but when running a query at version 2 then all rows are valid and returned.
With this mechanism, the equivalent data at version 1 is:
Id | Description |
---|---|
1 | First row |
2 | Second row |
And at version 2:
Id | Description |
---|---|
1 | First row |
2 | Second row |
3 | New row |
Deletion
When a row is being deleted, the valid_to
column of the deleted row is set to the current version.
For example, let's consider a table with the following rows:
Id | Description | valid_from | valid_to |
---|---|---|---|
1 | First row | 1 | 2,147,483,647 |
2 | Second row | 1 | 2,147,483,647 |
The current version is 2, so after deleting the second row, the table is:
Id | Description | valid_from | valid_to |
---|---|---|---|
1 | First row | 1 | 2,147,483,647 |
2 | Second row | 1 | 2 |
When running a query at version 1, both rows are valid and returned, but when running a query at version 2 then only the first row is valid and returned, the second row is filtered out.
With this mechanism, the equivalent data at version 1 is:
Id | Description |
---|---|
1 | First row |
2 | Second row |
And at version 2:
Id | Description |
---|---|
1 | First row |
Update
When a row is being updated, the valid_to
column of the old row is set to the current version and a new row is inserted with the valid_from
column set to the current version.
For example, let's consider a table with the following rows:
Id | Description | Value | valid_from | valid_to |
---|---|---|---|---|
1 | First row | 100.0 | 1 | 2,147,483,647 |
2 | Second row | 200.0 | 1 | 2,147,483,647 |
The current version is 2, so after updating the first row, the table is:
Id | Description | Value | valid_from | valid_to |
---|---|---|---|---|
1 | First row | 100.0 | 1 | 2 |
2 | Second row | 200.0 | 1 | 2,147,483,647 |
1 | First row | 150.0 | 2 | 2,147,483,647 |
When running a query at version 1, the first 2 rows are valid and returned (last row is filtered out), but when running a query at version 2 then only the last 2 rows are valid and returned (first row is filtered out).
With this mechanism, the equivalent data at version 1 is:
Id | Description | Value |
---|---|---|
1 | First row | 100.0 |
2 | Second row | 200.0 |
And at version 2:
Id | Description | Value |
---|---|---|
1 | First row | 150.0 |
2 | Second row | 200.0 |
Data cleaning
For tables with many updates or deletions, there could be many rows which are no longer valid.
To keep the size from increasing indefinitely, it is recommended to clean the data from time to time, by dropping
all the rows with valid_to
less than a given value (e.g. a timestamp one week in the past).
Current version discovery
Similar to versioning for databases with native time-travel, the current version for each table is queried from
the external database (by doing a max over valid_from
and valid_to
columns).
Then this version is used to filter out invalid rows, and this ensures the result of a given query is unchanged, even after a table is updated.
This discovery runs at creation of each new Atoti version.
Atoti set-up
The examples below are shown for ClickHouse, but the code is identical for other databases.
An object EmulatedTimeTravelDescription
needs to be created as follows:
final EmulatedTimeTravelDescription timeTravelDescription =
EmulatedTimeTravelDescription.builder()
.table(new ClickhouseTableId(MY_DB, "table1"), "valid_from", "valid_to")
.appendOnlyTable(new ClickhouseTableId(MY_DB, "table2"), "valid_from")
.emptyTableVersion(0)
.validRowToColumnValue(Integer.MAX_VALUE)
.build();
All the tables with emulated time-travel should be added to the builder, along with the names (in the external database) of the columns. Some additional parameters can also be set:
emptyTableVersion
(required): the version to be inferred for empty tables. A discovery query to an empty table is unable to give any information, while a version is required to filter out rows added later on. Anything less than the current version is acceptable (e.g -1 for integers, or1970-01-01T00:00:00
).validRowToColumnValue
(optional - default:null
): the value of thevalid_to
column for valid rows. It could benull
or a maximum value (e.g.2099-12-31T23:59:59
). A maximum value is more efficient for queries as it avoids the need to check fornull
values.
Then this objet should be added to the database settings, along with the optional parameter timeTravelDiscoveryQueryTimeout
(which defaults to 1 minute):
final ClickhouseDatabaseSettings databaseSettings =
ClickhouseDatabaseSettings.builder()
.emulatedTimeTravelDescription(timeTravelDescription)
.timeTravelDiscoveryQueryTimeout(Duration.ofMinutes(10))
.build();