How to understand the data journey

This page introduces the data journey in Atoti, including how data is extracted, transformed, and loaded into the datastore, and how parallelization and NUMA awareness help deliver high performance at scale. Prerequisites: A basic understanding of Atoti stores, references, and sessions is recommended before reading this page. Familiarity with Java concepts is also helpful when referring to Java-specific features such as thread-based parsing, column calculators, and tuple publishers.

Why the data journey matters

The data journey determines how efficiently Atoti ingests, prepares, and serves data for analysis. Each stage has direct consequences on performance and scalability:

Extraction defines how quickly data enters the system
Transformation ensures records are clean and analytically relevant before storage
Loading establishes the indexes and structures used by the aggregation engine
Partitioning and NUMA policies define how the workload scales across available cores and memory nodes

What happens during data extraction?

The first step of the data journey is to extract data from external sources and then to convert it into an in-memory format suitable for the datastore. Atoti supports a range of source types, including CSV files, Parquet files, JDBC databases, and cloud storage providers such as Amazon S3, Microsoft Azure Blob Storage, and Google Cloud Storage. In the Atoti Java SDK, extraction is handled by dedicated threads that parse incoming data. For CSV files specifically, Atoti uses a built-in CSV parser to read and interpret records during this phase.

What happens during data transformation?

After extraction, the data is transformed before it is loaded into the datastore. Transformation ensures that records are clean, consistent, and enriched with any additional context required for analysis. Atoti Python SDK does not include methods for data transformation. If required, this step is managed with other Python tools and libraries. In the Atoti Java SDK, two mechanisms handle data transformation.

Column calculators
Tuple publishers

Column calculators modify or enrich data during ingestion. Built-in calculators handle operations such as inserting constant values, tracking line indexes, attaching file metadata, or inserting empty values. Custom calculators can also be implemented to address specific business requirements. Tuple publishers complement this by controlling how transformed records are submitted to the datastore. Tuple publishers allow data to be filtered, streamed in batches or row-by-row.

What happens during data loading?

Transformed data is then loaded into Atoti’s in-memory datastore. In the Atoti Java SDK, tuple publishers govern the flow of data during this phase, translating each record into the internal structures of a store. As data is inserted, Atoti builds indexes to enable fast lookups and analytical queries, and applies duplicate handlers to ensure that key constraints are respected and store integrity is maintained. Reliable data loading depends on how records are processed and how changes are committed to the datastore. Atoti uses transactions to group loading operations so queries always see a consistent state. This consistency holds even when loading happens in parallel. Transactions also enable performance optimizations during the initial load of an application. Partitions are created during data loading, and the datastore routes each record to the correct partition based on the store’s partitioning configuration. For high-cardinality hierarchies, loading time can be further reduced using virtual hierarchies. Instead of populating hierarchy members during loading, a virtual hierarchy defers member retrieval to query time, reducing both load time and memory usage.

How does Atoti use parallelization and NUMA?

Atoti is designed to take full advantage of modern multi-core and multi-processor hardware. Most operations within a partition are single-threaded, but different partitions execute in parallel across multiple CPU cores, enabling efficient use of available processing resources. On Linux servers, Atoti additionally supports Non-Uniform Memory Architecture (NUMA), which provides separate memory banks for each processor or group of processors. By aligning partitions with memory nodes through NUMA node selectors, Atoti reduces memory access latency and maximizes data locality, keeping data close to the cores that will operate on it.

How does partitioning enable parallelization?

Partitioning distributes records across multiple partitions, each of which is processed independently by a separate thread. Partitions are not predefined. They are created dynamically as data is loaded, meaning the order of insertion can influence their assignment and NUMA placement. A well-designed partitioning strategy creates balanced partitions, avoids resource contention, and reduces cross-node memory access. This makes partitioning a critical design decision with direct consequences on both loading and query performance.

Introduction

Database

The cube

Bring data into Atoti

Updates and versioning

Atoti data analytics

Atoti Content Server

Optimize performance

How to understand the data journey

Why the data journey matters

What happens during data extraction?

What happens during data transformation?

What happens during data loading?

How does Atoti use parallelization and NUMA?

How does partitioning enable parallelization?

​Why the data journey matters

​What happens during data extraction?

​What happens during data transformation?

​What happens during data loading?

​How does Atoti use parallelization and NUMA?

​How does partitioning enable parallelization?

Why the data journey matters

What happens during data extraction?

What happens during data transformation?

What happens during data loading?

How does Atoti use parallelization and NUMA?

How does partitioning enable parallelization?