Documentation Index
Fetch the complete documentation index at: https://docs.activeviam.com/llms.txt
Use this file to discover all available pages before exploring further.
Introduction
A vector in Atoti is equivalent to a fixed-size array that has been strongly typed for performance. It comes with read and write capabilities, internal operations related to statistics usage (topKIndices, variance…) and cross-vector operations (plus, minus, scale…). Vectors are used in Atoti as values of a field in the datastore. This can be achieved when building a store with theStoreDescriptionBuilder, using
withVectorField(String name, String type).
Storing Design Overview
While this section mentions Off-Heap memory to explain the design choices behind the vector architecture, the design is the same for On-Heap vectors.In most Java applications, there is a typical distribution for the lifetime of created objects, where the vast majority of objects dies young. Thus, most Garbage Collectors are built around this empirical observation. However, Atoti Server, and especially Atoti’s datastore, acts as a database, and thus does not follow this rule: all the data is long-lived. The Garbage Collector must keep track of these objects, and even transfer them when necessary. Off-Heap Memory provides a way to hide these objects from the Garbage Collector, allowing it to focus on objects created and deleted within the application’s life cycle. The Java NIO API gives access to a
DirectByteBuffer to read and write from the Off-Heap memory.
However, these buffers do not provide the performance needed for Atoti:
- Instances of DirectByteBuffer are much bigger than standard Java arrays.
- Java poorly handles millions of such buffers.
- The creation of each buffer induces a call to the system’s malloc, a single-threaded memory allocator that adds another memory overhead for its own tracking system. This goes against Atoti’s efforts for multithreading, notably through partitioning
Allocation
In the datastore, it is possible to define the size of the vector and the vector’s block for each field. Fields with different vector block sizes will use different vector allocators. However, vectors with the same vector block size will rely on the same allocator, and thus belong to the same blocks, even if they have different vector sizes. For instance, if field A has vectors of size 10, and field B has vectors of size 5, while both have a block size of 50, a block may, at a given time, contain . If the allocator needs to allocate a vector for field A, a new block will be created. Off-heap storage can be disabled. In this case, Atoti will rely on either arrays or heap-buffers to manage the storage. The classVectorUtils offers entry points to create a range of transient vectors that can be used
for optimization purposes. Atoti does not by default optimize vectors to rely on these entry points.
For instance, if a vector contains a single value, repeated n times, VectorUtils offers a way to
create a ISameValueVector, but there are no internal mechanism that will read each vector and
replace it, if possible, with a ISameValueVector.
Compaction
Atoti Server relies on a copy Garbage Collection algorithm. Whenever a version is discarded, Atoti Server iterates through all vectors in the datastore and the aggregate store, and checks if they belong to a block that is mostly garbage. If they do, the vectors are transferred to another block.Configuration
Each vector field can configure its own size and block size. The default block size is determined by the ActiveViam propertyactiveviam.vectors.defaultBlockSize.
By default, this value is automatically set to 16K, 32K, or 64K, depending on the available memory size of the JVM.
The ActiveViam Property activeviam.vectors.garbageCollectionFactor (value between 0 and 1) controls how
soon a block is considered needing compaction.
It represents a trade-off between transaction performance and memory footprint.
Selecting the size of the vector blocks is considered an advanced feature.
The size is set automatically, depending on the amount of memory available to the JVM, and the size of the vectors.
Selecting a vector block size for a field is an operation that requires great care:
in order to:
- minimize external memory loss (have the block size as close as possible to a multiple of the page cache size)
- minimize internal memory loss (not wasting memory in the last block to store the last vectors of the chunk)
- keep a high enough number of blocks to ensure good compaction
- keep their number low enough to minimize the number of allocations (those system calls are very expensive, especially in a multithreaded environment).
Swapping
This storing design allows you to seamlessly add efficient vectors swap capabilities in Atoti. The Operating System uses a cache where it stores the most frequently accessed portions of files in memory: the page cache. When requesting some data within a file, the OS first checks the cache: if the data is found, a copy is given to the user. Otherwise, the file is loaded into the cache, and a copy of the data is given to the user. Copying takes CPU time, hurts CPU caches, and wastes RAM with duplicated data. To minimize copies, Atoti Server relies onMappedByteBuffer (from Java NIO API as well).
This class relies internally on the mmap system call to grant access to the page cache.
When swapping vectors, blocks are stored in MappedByteBuffers, and Atoti delegates the
responsibility of writing the underlying file to the OS.
By default, files are written into the default OS temporary directory.
Because of its limited size, it is best to provide a swap directory when defining a swapped vector field.
Swapping Advanced tuning considerations
As the amount of swapped data increases, the default settings might not provide sufficient performance. For advanced tuning, it is recommended to read on how the page cache works on Linux, to learn aboutvm.swapiness, vm.dirty_background_bytes and vm.dirty_bytes.
Atoti suggests these three properties to be respectively set to 0, a low value, and a high value.
The ActiveViam Property activeviam.vectors.swap.directory.numberFiles, set by default to 10k, controls
the maximum number of swap files created in one directory.
On Linux, the default number of available mappings, set through vm.max_map_count, is 2^16. This
limit is very likely hit on big projects, which will result in an OutOfMemoryError("Map failed").
This can be avoided by changing this kernel property or by increasing block size.
The JVM might core dump if disk space is full, or if the swap directory is full, instead of throwing
an OOM Error.
Transparent Huge Pages, which acts as an adapter for simple use of Huge Pages, MUST be disabled,
as the default allocator (SLAB) in Atoti natively supports Huge Pages.
Cleaning swapped vectors
Atoti does not collect old unused blocks from the disk, meaning that the disk usage can grow indefinitely. Thanks to the way memory-mapped files work, this is easily fixable: these files can be deleted without impacting Atoti, as the OS only permanently removes these files when the last reference to the file is deleted. This means that one can callrm on a swapped file, and it will be effectively deleted once the last vector in the corresponding block is marked as garbage.
Working with vector fields
With Copper
Working with vectors within Copper is a seamless experience. Standard operators between vectors, or between scalars and vectors, behave naturally. One can also accessIVector-specific API using the map function and casting the argument, like
so:
With Post Processors
Within Atoti Server, custom aggregation functions can be written by extending the dedicatedAVectorAggregationFunction.
Atoti Server does not provide specialized Post-Processors to handle vectors.
Creating one’s own Post-Processors should prove easy.
For instance, one can calculate the 5% expected shortfall (the value that is lost on average with a 5% probability) using:
Warning: Note that this Post Processor does not copy the vector before using it. One must pay
attention to the operation’s impact on the vectors given as arguments.
In the following example, it is necessary to copy the vector before applying the operation, as the
plus operation modifies the vector in place.
IVector as output type class parameter.
Customized vector
Warning: Implementing a custom vector is highly technical and prone to errors.It’s possible to build a custom implementation by deriving from one of two convenience abstract classes:
We strongly discourage this type of customization. This API is likely to be deprecated in future versions.
Please contact support to request a specific implementation, providing details about your use case.
AReadOnlyVectorfor read only vectors.AVectorfor mutable vectors.
AllocationType must return AllocationType.DIRECT, or else the custom implementation will be replaced by a native one when stored in the datastore.