How to use vectors

note

There are many ways to define vectors depending on the database, read the vectors documentation to know what is supported by your database.

Table with native vectors

If the table has native vectors, you can directly use them in a table without any transformation.

ClickHouse

In ClickHouse, native vector are called Array. Let's create an example table with a few of them:

create table if not exists
  TUTORIAL.NATIVE_VECTORS
(
  `PRODUCT` String NOT NULL,
  `QUANTITY` Array(Float64) NOT NULL
)
ENGINE = MergeTree()
ORDER BY (PRODUCT)
;
insert into TUTORIAL.NATIVE_VECTORS values
('product_1', [10, 20, 15, 25, 10]),
('product_2', [50, 65, 55, 30, 80]);

This table can directly be used:

final Table table = session.discoverTable("TUTORIAL", "NATIVE_VECTORS");

Table with one column per vector element

Let's create a table with 5 columns, each containing one element of a Quantity vector for a given product.

Snowflake
BigQuery
Synapse
Redshift

CREATE TRANSIENT TABLE IF NOT EXISTS MULTI_COLUMN_QUANTITY (
  PRODUCT STRING NOT NULL,
  QUANTITY_0 FLOAT NOT NULL,
  QUANTITY_1 FLOAT NOT NULL,
  QUANTITY_2 FLOAT NOT NULL,
  QUANTITY_3 FLOAT NOT NULL,
  QUANTITY_4 FLOAT NOT NULL);

INSERT INTO MULTI_COLUMN_QUANTITY SELECT 'product_1',10,20,15,25,10;
INSERT INTO MULTI_COLUMN_QUANTITY SELECT 'product_2',50,65,55,30,80;

CREATE TABLE IF NOT EXISTS tutorial.MULTI_COLUMN_QUANTITY (
  PRODUCT STRING NOT NULL,
  QUANTITY_0 FLOAT64 NOT NULL,
  QUANTITY_1 FLOAT64 NOT NULL,
  QUANTITY_2 FLOAT64 NOT NULL,
  QUANTITY_3 FLOAT64 NOT NULL,
  QUANTITY_4 FLOAT64 NOT NULL);

INSERT INTO tutorial.MULTI_COLUMN_QUANTITY VALUES
  ('product_1',10,20,15,25,10),
  ('product_2',50,65,55,30,80);

Load multi_column_quantity.csv into a Table:

IF NOT EXISTS (SELECT * from sys.external_tables WHERE name='MULTI_COLUMN_QUANTITY')
  CREATE EXTERNAL TABLE test_resources.TUTORIAL.MULTI_COLUMN_QUANTITY (
    [PRODUCT] VARCHAR(50),
    [QUANTITY_0] FLOAT,
    [QUANTITY_1] FLOAT,
    [QUANTITY_2] FLOAT,
    [QUANTITY_3] FLOAT,
    [QUANTITY_4] FLOAT
  )
  WITH (
    LOCATION = 'your-container/path/to/multi_column_quantity.csv',
    DATA_SOURCE = [YourDataSource],
    FILE_FORMAT = [CsvFormatWithHeader]
  )
GO

SET enable_case_sensitive_identifier TO true;
CREATE TABLE dev.tutorial."MULTI_COLUMN_QUANTITY" (
  "PRODUCT" VARCHAR NOT NULL,
  "QUANTITY_0" FLOAT8 NOT NULL,
  "QUANTITY_1" FLOAT8 NOT NULL,
  "QUANTITY_2" FLOAT8 NOT NULL,
  "QUANTITY_3" FLOAT8 NOT NULL,
  "QUANTITY_4" FLOAT8 NOT NULL);

INSERT INTO dev.tutorial."MULTI_COLUMN_QUANTITY" VALUES
  ('product_1',10,20,15,25,10),
  ('product_2',50,65,55,30,80);

The table looks like this in the database:

PRODUCT	QUANTITY_0	QUANTITY_1	QUANTITY_2	QUANTITY_3	QUANTITY_4
product_1	10	20	15	25	10
product_2	50	65	55	30	80

When a table has at least 2 columns with names following the pattern prefix + delimiter + number, all these columns can be converted into a single vector column (numbers must be consecutive from 0 (or 1) to N).

You simply need to create a Table on this and call vectorize on it to transform this table with one column per vector element to a table with vectors. The vectorize method accepts either a list of prefixes or a delimiter to convert columns into a single vector.

Snowflake
BigQuery
Synapse
Redshift

final Table tableWithoutVector = session
        .discoverTable(new SqlTableId(TEST_DATABASE_NAME, TUTORIAL_SCHEMA_NAME, "MULTI_COLUMN_QUANTITY"));
// Manually vectorize the table after discovery with a list of prefixes
final Table tableWithVector = tableWithoutVector.vectorizeTable(List.of("QUANTITY"));

final Table tableWithoutVector = session
        .discoverTable(new SqlTableId(TEST_DATABASE_NAME, TUTORIAL_SCHEMA_NAME, "MULTI_COLUMN_QUANTITY"));
// Manually vectorize the table after discovery with a delimiter
final Table tableWithVector = tableWithoutVector.vectorizeTable("_");

final Table tableWithoutVector =
        session.discoverTable(TableId.of(PROJECT_ID, "tutorial", "MULTI_COLUMN_QUANTITY"));
// Manually vectorize the table after discovery with a list of prefixes
final Table tableWithVector = tableWithoutVector.vectorizeTable(List.of("QUANTITY"));

final Table tableWithoutVector =
        session.discoverTable(TableId.of(PROJECT_ID, "tutorial", "MULTI_COLUMN_QUANTITY"));
// Manually vectorize the table after discovery with a delimiter
final Table tableWithVector = tableWithoutVector.vectorizeTable("_");

final Table tableWithoutVector =
        session.discoverTable(new SqlTableId(DATABASE_NAME, "TUTORIAL", "MULTI_COLUMN_QUANTITY"));
// Manually vectorize the table after discovery with a list of prefixes
final Table tableWithVector = tableWithoutVector.vectorizeTable(List.of("QUANTITY"));

final Table tableWithoutVector =
        session.discoverTable(new SqlTableId(DATABASE_NAME, "TUTORIAL", "MULTI_COLUMN_QUANTITY"));
// Manually vectorize the table after discovery with a delimiter
final Table tableWithVector = tableWithoutVector.vectorizeTable("_");

final Table tableWithoutVector = session
        .discoverTable(new SqlTableId(TEST_DATABASE_NAME, TUTORIAL_SCHEMA_NAME, "MULTI_COLUMN_QUANTITY"));
// Manually vectorize the table after discovery with a list of prefixes
final Table tableWithVector = tableWithoutVector.vectorizeTable(List.of("QUANTITY"));

final Table tableWithoutVector = session
        .discoverTable(new SqlTableId(TEST_DATABASE_NAME, TUTORIAL_SCHEMA_NAME, "MULTI_COLUMN_QUANTITY"));
// Manually vectorize the table after discovery with a delimiter
final Table tableWithVector = tableWithoutVector.vectorizeTable("_");

In this example, both code snippets produce the same result.

PRODUCT	QUANTITY
product_1	[10,20,15,25,10]
product_2	[50,65,55,30,80]

It is also possible to have this conversion done automatically. It can be enabled for the whole project by setting the ActiveViam property ActiveViamProperty.ENABLE_DIRECTQUERY_AUTOVECTORIZER to true.

With this property enabled, the auto-vectorizer will detect columns sharing the same prefix and will convert them into a vector.

The behavior of the auto-vectorization can be controlled by two more properties:

ActiveViamProperty.DIRECTQUERY_AUTOVECTORIZER_DELIMITER: sets the delimiter as defined above.
There is no delimiter by default.
ActiveViamProperty.DIRECTQUERY_AUTOVECTORIZER_THRESHOLD: sets minimum number of columns with the same prefix to trigger the auto-vectorization.
The default value is 50.

Read the properties documentation for more details about ActiveViam properties.

With the ActiveViam properties set to enable the autovectorizer, there is nothing more to do after the discovery.

Snowflake
BigQuery
Synapse
Redshift

PropertyRegistry.set(ActiveViamProperty.ENABLE_DIRECTQUERY_AUTOVECTORIZER, true);
PropertyRegistry.set(ActiveViamProperty.DIRECTQUERY_AUTOVECTORIZER_DELIMITER, "_");
PropertyRegistry.set(ActiveViamProperty.DIRECTQUERY_AUTOVECTORIZER_THRESHOLD, 5);

final Table table = session
        .discoverTable(new SqlTableId(TEST_DATABASE_NAME, TUTORIAL_SCHEMA_NAME, "MULTI_COLUMN_QUANTITY"));

PropertyRegistry.set(ActiveViamProperty.ENABLE_DIRECTQUERY_AUTOVECTORIZER, true);
PropertyRegistry.set(ActiveViamProperty.DIRECTQUERY_AUTOVECTORIZER_DELIMITER, "_");
PropertyRegistry.set(ActiveViamProperty.DIRECTQUERY_AUTOVECTORIZER_THRESHOLD, 5);

final Table table = session.discoverTable(TableId.of(PROJECT_ID, "tutorial", "MULTI_COLUMN_QUANTITY"));

PropertyRegistry.set(ActiveViamProperty.ENABLE_DIRECTQUERY_AUTOVECTORIZER, true);
PropertyRegistry.set(ActiveViamProperty.DIRECTQUERY_AUTOVECTORIZER_DELIMITER, "_");
PropertyRegistry.set(ActiveViamProperty.DIRECTQUERY_AUTOVECTORIZER_THRESHOLD, 5);

final Table table =
        session.discoverTable(new SqlTableId(DATABASE_NAME, "TUTORIAL", "MULTI_COLUMN_QUANTITY"));

PropertyRegistry.set(ActiveViamProperty.ENABLE_DIRECTQUERY_AUTOVECTORIZER, true);
PropertyRegistry.set(ActiveViamProperty.DIRECTQUERY_AUTOVECTORIZER_DELIMITER, "_");
PropertyRegistry.set(ActiveViamProperty.DIRECTQUERY_AUTOVECTORIZER_THRESHOLD, 5);

final Table table = session
        .discoverTable(new SqlTableId(TEST_DATABASE_NAME, TUTORIAL_SCHEMA_NAME, "MULTI_COLUMN_QUANTITY"));

Table with one index and value per row

In this example we will use the following table:

PRODUCT	INDEX	QUANTITY
product_1	0	10
product_1	1	20
product_1	2	15
product_1	3	25
product_1	4	10
product_2	0	50
product_2	1	65
product_2	2	55
product_2	3	30
product_2	4	80

This table represents 2 vectors split over multiple lines.

Let's create the table and feed it.

Snowflake
BigQuery
Synapse
Redshift

CREATE TRANSIENT TABLE IF NOT EXISTS MULTI_ROW_QUANTITY ( PRODUCT STRING NOT NULL, INDEX INTEGER NOT NULL, QUANTITY FLOAT NOT NULL);

INSERT INTO MULTI_ROW_QUANTITY SELECT 'product_1',0,10;
INSERT INTO MULTI_ROW_QUANTITY SELECT 'product_1',1,20;
INSERT INTO MULTI_ROW_QUANTITY SELECT 'product_1',2,15;
INSERT INTO MULTI_ROW_QUANTITY SELECT 'product_1',3,25;
INSERT INTO MULTI_ROW_QUANTITY SELECT 'product_1',4,10;
INSERT INTO MULTI_ROW_QUANTITY SELECT 'product_2',0,50;
INSERT INTO MULTI_ROW_QUANTITY SELECT 'product_2',1,65;
INSERT INTO MULTI_ROW_QUANTITY SELECT 'product_2',2,55;
INSERT INTO MULTI_ROW_QUANTITY SELECT 'product_2',3,30;
INSERT INTO MULTI_ROW_QUANTITY SELECT 'product_2',4,80;

CREATE TABLE IF NOT EXISTS tutorial.MULTI_ROW_QUANTITY (
  PRODUCT STRING NOT NULL,
  INDEX INT64 NOT NULL,
  QUANTITY FLOAT64 NOT NULL);

INSERT INTO tutorial.MULTI_ROW_QUANTITY VALUES
  ('product_1',0,10),
  ('product_1',1,20),
  ('product_1',2,15),
  ('product_1',3,25),
  ('product_1',4,10),
  ('product_2',0,50),
  ('product_2',1,65),
  ('product_2',2,55),
  ('product_2',3,30),
  ('product_2',4,80);

IF NOT EXISTS (SELECT * from sys.external_tables WHERE name='MULTI_ROW_QUANTITY')
  CREATE EXTERNAL TABLE test_resources.TUTORIAL.MULTI_ROW_QUANTITY (
    [PRODUCT] VARCHAR(50),
    [INDEX] INTEGER,
    [QUANTITY] FLOAT
  )
  WITH (
    LOCATION = 'tckstore/tutorial/vectors/multi_row_quantity.csv',
    DATA_SOURCE = "TckOnDemand",
    FILE_FORMAT = "TCKDelimitedTextFormat"
  )
GO;

SET enable_case_sensitive_identifier TO true;
CREATE TABLE dev.tutorial."MULTI_ROW_QUANTITY" (
  "PRODUCT" VARCHAR NOT NULL,
  "INDEX" INT8 NOT NULL,
  "QUANTITY" FLOAT8 NOT NULL);

INSERT INTO dev.tutorial."MULTI_ROW_QUANTITY" VALUES
  ('product_1',0,10),
  ('product_1',1,20),
  ('product_1',2,15),
  ('product_1',3,25),
  ('product_1',4,10),
  ('product_2',0,50),
  ('product_2',1,65),
  ('product_2',2,55),
  ('product_2',3,30),
  ('product_2',4,80);

You simply need to create a Table on this and call vectorize on it to transform this table with row by row values to a table with vectors.

Snowflake
BigQuery
Synapse
Redshift

final Table multiRowTable = session
        .discoverTable(new SqlTableId(TEST_DATABASE_NAME, TUTORIAL_SCHEMA_NAME, "MULTI_ROW_QUANTITY"));
final Table tableWithVector = multiRowTable.vectorizeTable("INDEX", List.of("QUANTITY"));

final Table multiRowTable =
        session.discoverTable(TableId.of(PROJECT_ID, "tutorial", "MULTI_ROW_QUANTITY"));
final Table tableWithVector = multiRowTable.vectorizeTable("INDEX", List.of("QUANTITY"));

final Table multiRowTable =
        session.discoverTable(new SqlTableId(DATABASE_NAME, "TUTORIAL", "MULTI_ROW_QUANTITY"));
final Table tableWithVector = multiRowTable.vectorizeTable("INDEX", List.of("QUANTITY"));

final Table multiRowTable = session
        .discoverTable(new SqlTableId(TEST_DATABASE_NAME, TUTORIAL_SCHEMA_NAME, "MULTI_ROW_QUANTITY"));
final Table tableWithVector = multiRowTable.vectorizeTable("INDEX", List.of("QUANTITY"));

You now have a table with a Vector column which will be equivalent to the following table:

PRODUCT	QUANTITY
product_1	[10,20,15,25,10]
product_2	[50,65,55,30,80]

note

Table with native vectors​

Table with one column per vector element​

Table with one index and value per row​

Table with native vectors

Table with one column per vector element

Table with one index and value per row