How to use vectors
note
There are many ways to define vectors depending on the database, read the vectors documentation to know what is supported by your database.
Table with native vectors
If the table has native vectors, you can directly use them in a table without any transformation.
- ClickHouse
In ClickHouse, native vector are called Array
.
Let's create an example table with a few of them:
create table if not exists
TUTORIAL.NATIVE_VECTORS
(
`PRODUCT` String NOT NULL,
`QUANTITY` Array(Float64) NOT NULL
)
ENGINE = MergeTree()
ORDER BY (PRODUCT)
;
insert into TUTORIAL.NATIVE_VECTORS values
('product_1', [10, 20, 15, 25, 10]),
('product_2', [50, 65, 55, 30, 80]);
This table can directly be used:
final Table table = session.discoverTable("TUTORIAL", "NATIVE_VECTORS");
Table with one column per vector element
Let's create a table with 5 columns, each containing one element of a Quantity vector for a given product.
- Snowflake
- BigQuery
- Synapse
- Redshift
CREATE TRANSIENT TABLE IF NOT EXISTS MULTI_COLUMN_QUANTITY (
PRODUCT STRING NOT NULL,
QUANTITY_0 FLOAT NOT NULL,
QUANTITY_1 FLOAT NOT NULL,
QUANTITY_2 FLOAT NOT NULL,
QUANTITY_3 FLOAT NOT NULL,
QUANTITY_4 FLOAT NOT NULL);
INSERT INTO MULTI_COLUMN_QUANTITY SELECT 'product_1',10,20,15,25,10;
INSERT INTO MULTI_COLUMN_QUANTITY SELECT 'product_2',50,65,55,30,80;
CREATE TABLE IF NOT EXISTS tutorial.MULTI_COLUMN_QUANTITY (
PRODUCT STRING NOT NULL,
QUANTITY_0 FLOAT64 NOT NULL,
QUANTITY_1 FLOAT64 NOT NULL,
QUANTITY_2 FLOAT64 NOT NULL,
QUANTITY_3 FLOAT64 NOT NULL,
QUANTITY_4 FLOAT64 NOT NULL);
INSERT INTO tutorial.MULTI_COLUMN_QUANTITY VALUES
('product_1',10,20,15,25,10),
('product_2',50,65,55,30,80);
Load multi_column_quantity.csv into a Table:
IF NOT EXISTS (SELECT * from sys.external_tables WHERE name='MULTI_COLUMN_QUANTITY')
CREATE EXTERNAL TABLE test_resources.TUTORIAL.MULTI_COLUMN_QUANTITY (
[PRODUCT] VARCHAR(50),
[QUANTITY_0] FLOAT,
[QUANTITY_1] FLOAT,
[QUANTITY_2] FLOAT,
[QUANTITY_3] FLOAT,
[QUANTITY_4] FLOAT
)
WITH (
LOCATION = 'your-container/path/to/multi_column_quantity.csv',
DATA_SOURCE = [YourDataSource],
FILE_FORMAT = [CsvFormatWithHeader]
)
GO
SET enable_case_sensitive_identifier TO true;
CREATE TABLE dev.tutorial."MULTI_COLUMN_QUANTITY" (
"PRODUCT" VARCHAR NOT NULL,
"QUANTITY_0" FLOAT8 NOT NULL,
"QUANTITY_1" FLOAT8 NOT NULL,
"QUANTITY_2" FLOAT8 NOT NULL,
"QUANTITY_3" FLOAT8 NOT NULL,
"QUANTITY_4" FLOAT8 NOT NULL);
INSERT INTO dev.tutorial."MULTI_COLUMN_QUANTITY" VALUES
('product_1',10,20,15,25,10),
('product_2',50,65,55,30,80);
The table looks like this in the database:
PRODUCT | QUANTITY_0 | QUANTITY_1 | QUANTITY_2 | QUANTITY_3 | QUANTITY_4 |
---|---|---|---|---|---|
product_1 | 10 | 20 | 15 | 25 | 10 |
product_2 | 50 | 65 | 55 | 30 | 80 |
When a table has at least 2 columns with names following the pattern prefix + delimiter + number
,
all these columns can be converted into a single vector column (numbers must be consecutive from 0 (or 1) to N).
You simply need to create a Table on this and call vectorize
on it to transform this table with one column per vector element to a table with vectors.
The vectorize
method accepts either a list of prefixes or a delimiter to convert columns into a single vector.
- Snowflake
- BigQuery
- Synapse
- Redshift
final Table tableWithoutVector = session
.discoverTable(new SqlTableId(TEST_DATABASE_NAME, TUTORIAL_SCHEMA_NAME, "MULTI_COLUMN_QUANTITY"));
// Manually vectorize the table after discovery with a list of prefixes
final Table tableWithVector = tableWithoutVector.vectorizeTable(List.of("QUANTITY"));
final Table tableWithoutVector = session
.discoverTable(new SqlTableId(TEST_DATABASE_NAME, TUTORIAL_SCHEMA_NAME, "MULTI_COLUMN_QUANTITY"));
// Manually vectorize the table after discovery with a delimiter
final Table tableWithVector = tableWithoutVector.vectorizeTable("_");
final Table tableWithoutVector =
session.discoverTable(TableId.of(PROJECT_ID, "tutorial", "MULTI_COLUMN_QUANTITY"));
// Manually vectorize the table after discovery with a list of prefixes
final Table tableWithVector = tableWithoutVector.vectorizeTable(List.of("QUANTITY"));
final Table tableWithoutVector =
session.discoverTable(TableId.of(PROJECT_ID, "tutorial", "MULTI_COLUMN_QUANTITY"));
// Manually vectorize the table after discovery with a delimiter
final Table tableWithVector = tableWithoutVector.vectorizeTable("_");
final Table tableWithoutVector =
session.discoverTable(new SqlTableId(DATABASE_NAME, "TUTORIAL", "MULTI_COLUMN_QUANTITY"));
// Manually vectorize the table after discovery with a list of prefixes
final Table tableWithVector = tableWithoutVector.vectorizeTable(List.of("QUANTITY"));
final Table tableWithoutVector =
session.discoverTable(new SqlTableId(DATABASE_NAME, "TUTORIAL", "MULTI_COLUMN_QUANTITY"));
// Manually vectorize the table after discovery with a delimiter
final Table tableWithVector = tableWithoutVector.vectorizeTable("_");
final Table tableWithoutVector = session
.discoverTable(new SqlTableId(TEST_DATABASE_NAME, TUTORIAL_SCHEMA_NAME, "MULTI_COLUMN_QUANTITY"));
// Manually vectorize the table after discovery with a list of prefixes
final Table tableWithVector = tableWithoutVector.vectorizeTable(List.of("QUANTITY"));
final Table tableWithoutVector = session
.discoverTable(new SqlTableId(TEST_DATABASE_NAME, TUTORIAL_SCHEMA_NAME, "MULTI_COLUMN_QUANTITY"));
// Manually vectorize the table after discovery with a delimiter
final Table tableWithVector = tableWithoutVector.vectorizeTable("_");
In this example, both code snippets produce the same result.
PRODUCT | QUANTITY |
---|---|
product_1 | [10,20,15,25,10] |
product_2 | [50,65,55,30,80] |
It is also possible to have this conversion done automatically.
It can be enabled for the whole project by setting the ActiveViam property ActiveViamProperty.ENABLE_DIRECTQUERY_AUTOVECTORIZER
to true
.
With this property enabled, the auto-vectorizer will detect columns sharing the same prefix and will convert them into a vector.
The behavior of the auto-vectorization can be controlled by two more properties:
ActiveViamProperty.DIRECTQUERY_AUTOVECTORIZER_DELIMITER
: sets the delimiter as defined above.
There is no delimiter by default.ActiveViamProperty.DIRECTQUERY_AUTOVECTORIZER_THRESHOLD
: sets minimum number of columns with the same prefix to trigger the auto-vectorization.
The default value is 50.
Read the properties documentation for more details about ActiveViam properties.
With the ActiveViam properties set to enable the autovectorizer, there is nothing more to do after the discovery.
- Snowflake
- BigQuery
- Synapse
- Redshift
PropertyRegistry.set(ActiveViamProperty.ENABLE_DIRECTQUERY_AUTOVECTORIZER, true);
PropertyRegistry.set(ActiveViamProperty.DIRECTQUERY_AUTOVECTORIZER_DELIMITER, "_");
PropertyRegistry.set(ActiveViamProperty.DIRECTQUERY_AUTOVECTORIZER_THRESHOLD, 5);
final Table table = session
.discoverTable(new SqlTableId(TEST_DATABASE_NAME, TUTORIAL_SCHEMA_NAME, "MULTI_COLUMN_QUANTITY"));
PropertyRegistry.set(ActiveViamProperty.ENABLE_DIRECTQUERY_AUTOVECTORIZER, true);
PropertyRegistry.set(ActiveViamProperty.DIRECTQUERY_AUTOVECTORIZER_DELIMITER, "_");
PropertyRegistry.set(ActiveViamProperty.DIRECTQUERY_AUTOVECTORIZER_THRESHOLD, 5);
final Table table = session.discoverTable(TableId.of(PROJECT_ID, "tutorial", "MULTI_COLUMN_QUANTITY"));
PropertyRegistry.set(ActiveViamProperty.ENABLE_DIRECTQUERY_AUTOVECTORIZER, true);
PropertyRegistry.set(ActiveViamProperty.DIRECTQUERY_AUTOVECTORIZER_DELIMITER, "_");
PropertyRegistry.set(ActiveViamProperty.DIRECTQUERY_AUTOVECTORIZER_THRESHOLD, 5);
final Table table =
session.discoverTable(new SqlTableId(DATABASE_NAME, "TUTORIAL", "MULTI_COLUMN_QUANTITY"));
PropertyRegistry.set(ActiveViamProperty.ENABLE_DIRECTQUERY_AUTOVECTORIZER, true);
PropertyRegistry.set(ActiveViamProperty.DIRECTQUERY_AUTOVECTORIZER_DELIMITER, "_");
PropertyRegistry.set(ActiveViamProperty.DIRECTQUERY_AUTOVECTORIZER_THRESHOLD, 5);
final Table table = session
.discoverTable(new SqlTableId(TEST_DATABASE_NAME, TUTORIAL_SCHEMA_NAME, "MULTI_COLUMN_QUANTITY"));
Table with one index and value per row
In this example we will use the following table:
PRODUCT | INDEX | QUANTITY |
---|---|---|
product_1 | 0 | 10 |
product_1 | 1 | 20 |
product_1 | 2 | 15 |
product_1 | 3 | 25 |
product_1 | 4 | 10 |
product_2 | 0 | 50 |
product_2 | 1 | 65 |
product_2 | 2 | 55 |
product_2 | 3 | 30 |
product_2 | 4 | 80 |
This table represents 2 vectors split over multiple lines.
Let's create the table and feed it.
- Snowflake
- BigQuery
- Synapse
- Redshift
CREATE TRANSIENT TABLE IF NOT EXISTS MULTI_ROW_QUANTITY ( PRODUCT STRING NOT NULL, INDEX INTEGER NOT NULL, QUANTITY FLOAT NOT NULL);
INSERT INTO MULTI_ROW_QUANTITY SELECT 'product_1',0,10;
INSERT INTO MULTI_ROW_QUANTITY SELECT 'product_1',1,20;
INSERT INTO MULTI_ROW_QUANTITY SELECT 'product_1',2,15;
INSERT INTO MULTI_ROW_QUANTITY SELECT 'product_1',3,25;
INSERT INTO MULTI_ROW_QUANTITY SELECT 'product_1',4,10;
INSERT INTO MULTI_ROW_QUANTITY SELECT 'product_2',0,50;
INSERT INTO MULTI_ROW_QUANTITY SELECT 'product_2',1,65;
INSERT INTO MULTI_ROW_QUANTITY SELECT 'product_2',2,55;
INSERT INTO MULTI_ROW_QUANTITY SELECT 'product_2',3,30;
INSERT INTO MULTI_ROW_QUANTITY SELECT 'product_2',4,80;
CREATE TABLE IF NOT EXISTS tutorial.MULTI_ROW_QUANTITY (
PRODUCT STRING NOT NULL,
INDEX INT64 NOT NULL,
QUANTITY FLOAT64 NOT NULL);
INSERT INTO tutorial.MULTI_ROW_QUANTITY VALUES
('product_1',0,10),
('product_1',1,20),
('product_1',2,15),
('product_1',3,25),
('product_1',4,10),
('product_2',0,50),
('product_2',1,65),
('product_2',2,55),
('product_2',3,30),
('product_2',4,80);
IF NOT EXISTS (SELECT * from sys.external_tables WHERE name='MULTI_ROW_QUANTITY')
CREATE EXTERNAL TABLE test_resources.TUTORIAL.MULTI_ROW_QUANTITY (
[PRODUCT] VARCHAR(50),
[INDEX] INTEGER,
[QUANTITY] FLOAT
)
WITH (
LOCATION = 'tckstore/tutorial/vectors/multi_row_quantity.csv',
DATA_SOURCE = "TckOnDemand",
FILE_FORMAT = "TCKDelimitedTextFormat"
)
GO;
SET enable_case_sensitive_identifier TO true;
CREATE TABLE dev.tutorial."MULTI_ROW_QUANTITY" (
"PRODUCT" VARCHAR NOT NULL,
"INDEX" INT8 NOT NULL,
"QUANTITY" FLOAT8 NOT NULL);
INSERT INTO dev.tutorial."MULTI_ROW_QUANTITY" VALUES
('product_1',0,10),
('product_1',1,20),
('product_1',2,15),
('product_1',3,25),
('product_1',4,10),
('product_2',0,50),
('product_2',1,65),
('product_2',2,55),
('product_2',3,30),
('product_2',4,80);
You simply need to create a Table on this and call vectorize
on it to transform this table with row by row values to a table with vectors.
- Snowflake
- BigQuery
- Synapse
- Redshift
final Table multiRowTable = session
.discoverTable(new SqlTableId(TEST_DATABASE_NAME, TUTORIAL_SCHEMA_NAME, "MULTI_ROW_QUANTITY"));
final Table tableWithVector = multiRowTable.vectorizeTable("INDEX", List.of("QUANTITY"));
final Table multiRowTable =
session.discoverTable(TableId.of(PROJECT_ID, "tutorial", "MULTI_ROW_QUANTITY"));
final Table tableWithVector = multiRowTable.vectorizeTable("INDEX", List.of("QUANTITY"));
final Table multiRowTable =
session.discoverTable(new SqlTableId(DATABASE_NAME, "TUTORIAL", "MULTI_ROW_QUANTITY"));
final Table tableWithVector = multiRowTable.vectorizeTable("INDEX", List.of("QUANTITY"));
final Table multiRowTable = session
.discoverTable(new SqlTableId(TEST_DATABASE_NAME, TUTORIAL_SCHEMA_NAME, "MULTI_ROW_QUANTITY"));
final Table tableWithVector = multiRowTable.vectorizeTable("INDEX", List.of("QUANTITY"));
You now have a table with a Vector column which will be equivalent to the following table:
PRODUCT | QUANTITY |
---|---|
product_1 | [10,20,15,25,10] |
product_2 | [50,65,55,30,80] |