Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.activeviam.com/llms.txt

Use this file to discover all available pages before exploring further.

What is the ETL framework in Atoti?

The ETL framework in Atoti is available in the Java SDK and provides a built-in mechanism for extracting, transforming, and loading data from external sources into its in-memory datastore. It uses components such as Sources, Channels, and Tuple Publishers to manage data ingestion and transformation efficiently. In contrast, the Atoti Python SDK does not include an ETL framework. Data extraction and transformation are performed using Python tools like pandas before loading the data into an Atoti session.

How does the ETL pipeline work in Atoti?

The ETL pipeline for Atoti Java SDK follows a structured process:
  • Extract: Data is retrieved from various sources such as files, databases, or external APIs.
  • Transform: Business logic, enrichment, and data cleaning are applied.
  • Load: Transformed data is inserted into Atoti’s datastore for fast analytical queries.
This pipeline supports real-time updates and ensures consistency across analytical views.

What is the extraction step?

Extraction involves retrieving data from external sources and converting it into an in-memory format suitable for loading into the datastore.

Supported source types

Atoti supports the following data sources:
  • CSV files: Parsed using Atoti’s built-in CSV parser.
  • Parquet files: Parsed using Atoti’s columnar data parser.
  • JDBC databases: Data is extracted using a JDBC driver and query.
  • Cloud storage: CSV and Parquet files can be extracted from:
    • Amazon S3
    • Microsoft Azure Blob Storage
    • Google Cloud Storage
The Cloud Source API provides a unified interface for accessing remote files, including authentication and access logic.

In-memory sources

Some sources can bypass the extraction step and interact directly with the transaction manager:
  • Message brokers (e.g. Kafka)
  • In-memory objects (e.g. Arrow table)
These sources are already structured and do not require parsing or transformation before loading.

Extraction components

Atoti models extraction using:
  • Topics: Represent a path to a specific collection of data (e.g., file, directory, or database query).
  • Sources: Manage how data is loaded; either as a one-time operation or as a continuous stream.
  • Channels: Route data from sources to specific stores in the datastore.
Datastore ETL components

What is the transformation step?

Transformation modifies or enriches data before it is loaded into the datastore. This ensures the data is clean, consistent, and ready for analysis.

Transformation mechanisms

Atoti provides two main mechanisms:
  • Tuple publishers: Manage how data is processed before loading. They can:
    • Filter records
    • Stream data in batches or row-by-row
    • Control transaction behavior
  • Column calculators: Modify or enrich data during ingestion.
    Built-in calculators include:
    • Constant value insertion
    • Line index tracking
    • File metadata (e.g., file name, path)
    • Empty value insertion
Custom calculators are built to answer specific needs. For example, using a numerical date to extract a written month: 12/12/29 becomes December. These tools help annotate data with useful metadata or generate unique identifiers. Find out how to work with tuple publishers and column calculators with Atoti Java SDK. For the Atoti Python SDK, data extraction and transformation are performed using Python tools.