ETL for SA Overview

As part of the data loading process, the ETL layer handles data manipulation between the file format and the internal datastore structure. For the various files, this can include:

  • Vectorization
  • Interpolation
  • Normalization

These processes are performed through Column Calculator, Tuple Publisher and Publisher objects.

This section provides an overview of the processes, calculators and publishers used in ETL for the Standardised Approach, starting with dataflow diagrams for SBM, DRC, Trade Attributes and Other SA CSV files.

  For information on specific ETL processes applied to individual CSV file types, see the following sections:

 

Dataflow Diagrams

SBM DRC Trade Attributes Other
sa etl dataflow ![drc etl dataflow diagram](/images etl/38700842.png) etl attributes etl dataflow diagram other files etl dataflow diagram

Vectorization

This is fairly common in ActiveViam projects. It involves taking single values on multiple rows of the CSV file and combining them into a vector to save space in the datastore. Combined with native vector aggregation, this provides better memory usage and aggregation speed and significantly decreases data duplication.

 

Vectorization is performed in a number of places, as detailed below:

Sensitivity input file Indices of the vectors
Delta sensitivities (except FX and Equity) Vertices (see, SA ETL for SBM Delta )
Vega sensitivities (note for GIRR Vega it is two dimensional vector) Option maturities (and for GIRR also the residual underlying maturities) See SA ETL for SBM Vega
Curvature shocked prices Risk weights. See SA ETL for SBM Curvature

 

note

Vectorization

  • The input of both vectorized and non-vectorized input data (and even a combination of the two) is supported. All values, individual or sub-vectors, are taken as input for the final vector representation of a sensitivity. An exception is made for vectors considered “pre-interpolated” (for details, see Pre-interpolated vectors).
  • The vectorization is limited to a single file, and the csv source is configured to load an entire file into memory before attempting to vectorize. The csv source cannot be configured to perform vectorization across files.

Interpolation

For Delta and Vega sensitivities there is a standard set of vertices that we use for the calculations.  When the vectors of sensitivities are created, they are then interpolated onto the standard set of vertices. Interpolation can be configured, with the default being a linear interpolation for values between FRTB vertices.

Pre-interpolated vectors

If a row in a file contains a value without a set of corresponding input vertices (one or more, the number being equal to the number of input values), we assume the row to be “pre-interpolated” onto the FRTB vertices. This will bypass both vectorization and interpolation, and commit the input values to the datastore without any modifications. This aims to speed up the data loading by removing redundant operations. 

note

Interpolation

This will change going forward.  Because the set of vertices can change with jurisdiction, there will eventually be a need to move this interpolation to query time.

Normalization

Database normalization, or simply normalization, is the process of restructuring a relational database in accordance with a series of so-called normal forms to reduce data redundancy and improve data integrity.

In the FRTB Accelerator, the Delta, Vega, Curvature, and Trade files are flat.  However, the datastore schema is normalized. Hence the ETL has to normalize the data as it is loaded.

This is mainly required for the RiskFactorDescription and UnderlyingDescription fields included within the Delta, Vega and Curvature files.

For example, when a row of a Delta sensitivities file is processed, (approximately) the first half of the row goes into the delta sensitivities store while (approximately) the second half of the row goes into the RiskFactorDescription and UnderlyingDescription stores.  There is a one-to-one correspondence (modulo vectorization) between rows in the file and rows in the sensitivities store.  However, multiple rows in the file will include the same data for the RiskFactorDescription and UnderlyingDescription (i.e. the description of the risk factor). 

It is the many-to-one references between the delta store and the RiskFactorDescription and UnderlyingDescription store which provide the normalization.

Column Calculators

Column calculators are used to fill in fields that don’t have values in the store, and to define the available fields in a row loaded from a file without depending on the datastore setup.

Tuple Publishers

The tuple publishers used here link the collections of rows to the base store and to publishers handling other stores, sending only the relevant fields in the row. - Normalization

The publishers perform the vectorization and interpolation required by individual risk classes and measures, using the PublisherUtils utility class.