Checkpoint and Restore an Atoti application
You can take a snapshot of your running Atoti application and later restore it instantly instead of starting fresh thanks to CRaC technology.
What is CRaC ?
CRaC (Coordinated Restore at Checkpoint) is a Java OpenJDK feature designed to significantly reduce the startup time of Java applications. It allows an application to be paused ("checkpointed") at a fully initialized, ready-to-serve state, and later resumed ("restored") from that point — instead of restarting from scratch.
CRaC operates by taking a snapshot of the running JVM, including memory, file descriptors, and threads. When restored, the application resumes from the same state it was in at the time of the checkpoint.
Constraints
-
Operating system: CRaC currently only supports Linux on x64 and ARM64 architectures.
-
JDK requirement: CRaC is not available in standard OpenJDK distributions. You need a supported distribution such as Azul Zulu with CRaC in version 21.
-
Filesystem and environment consistency: Restoring a checkpoint assumes that the environment (e.g. network, file paths) remains consistent. Meaning that the architecture, OS version, JDK and application JAR file in the restore environment must be identical to those in the checkpoint environment. However, both environments can have different RAM, processors, and storage.
For instance, any swapped vector files must exist at the same path on the restore environment. -
Spring Boot: In Atoti, CRaC integration is primarily managed by the Spring Boot framework, which supports CRaC starting from version 3.2. More detailed information can be found in the official documentation.
What it enables in a non-distributed setup
Imagine starting your Atoti application (a Spring Boot service) and waiting for it to load data, warmup caches, and initialize internal services. Instead of repeating this process every time the application starts, CRaC allows you to:
- Start the application normally
- Once it is fully initialized and ready to serve queries, checkpoint it
- Later, instead of starting from scratch, restore it from the checkpoint — skipping all the startup and warmup time
A simple story
Let’s say you start Atoti in the morning, and it takes 20 minutes to load and prepare. You checkpoint it after this point. Now, every time you need to redeploy, restart, or scale, you can restore from the checkpoint and be ready almost instantly.
Why use CRaC ?
The primary motivation is fast startup and instant readiness. This is especially valuable when:
- you want to speed up development iterations
- you need to quickly recover from a crash or restart
- you are working in environments where cold start times are too long
One of the main use case is the ability to stop a costly running application that is not used, and restore it almost instantly when it is necessary again. Indeed, as CRaC allows avoiding the starting time of the application, you can be more aggressive on the delay of inactivity before stopping the application.
What happens during checkpoint and restore ?
- Application pauses: During checkpointing, the application temporarily freezes while its state is being captured.
You can choose whether to stop the application after checkpointing.
Checkpoint duration highly depends on the storage throughput.
- Queries during checkpoint ?: During checkpointing, the process is paused — so it's not queryable for a brief moment.
Queries that were running during a checkpoint are most likely to be interrupted.
- Queries during restore ?: Queries cannot be executed until the application is fully restored. Once restored, the state is identical to the moment of checkpoint, so any off-heap data or in-memory structures are still intact.
Although the application is almost instantly restored, all the data is not immediately re-mapped into memory, so the first queries can take around 10 to 20% more time as they trigger the memory mapping of the corresponding queried data.
-
Network sessions: Any open connections (like websockets or REST endpoints) will not survive the checkpoint and will be re-established during restore.
-
File watcher: CSV source watcher service is still working after restore.
Guidelines
Follow closely these guidelines as any failure during a checkpoint will make the application crash.
When to checkpoint ?
It is highly recommended to checkpoint when the application is in a fully initialized and stable state, meaning that Spring initialization is over and the system is ready to serve queries.
Avoid checkpointing if:
- there are ongoing queries as they are most likely to be interrupted
- there are ongoing transactions as any open file that the source might read during the checkpoint will cause an exception, and thus make the application crash
An application cannot be checkpointed with CRaC if there is any open resource, like a file or a socket.
While most resources are already properly handled by Atoti and Spring Boot, you must manage the closing and reopening of your own resources to be able to use CRaC, as described in the official documentation.
Check how to checkpoint an application with CRaC.
How to restore ?
It is highly recommended to restore an application in a clean environment to facilitate and optimize memory mapping.
As mentioned in the Constraints section, the environment must remain consistent with the one where the application has been checkpointed.
The license is reset at checkpoint, and reloaded then checked at restore.
Check how to restore an application with CRaC.
CRaC VM options
Creating and restoring a checkpoint is performed by a component called the engine.
CRaC provides different engines and several Java command-line options that control its behavior.
What it enables in a distributed setup
Both query and data nodes can be checkpointed and restored, in any order.
A checkpointed node will behave as if it was leaving the cluster, meaning that when a data node is checkpointed, its members are removed from the query node. The node then rejoins the cluster after restore.
Limitations
Currently, CRaC does not support creating a second checkpoint after a restore — the application must be restarted from scratch before a new checkpoint can be created.
Off-heap memory might not be correctly re-mapped on the right NUMA nodes after restore. Thus, it is not recommended to have a NUMA architecture when using CRaC.