The API
The entry point of CoPPer API is the BuildingContext. This object contains all the methods needed to declare all your calculations. You can retrieve this object when building your cube via the cube description builder and its withDescriptionPostProcessor method. You can thus write:
.withCube("MyCube", c -> c
// Define your dimensions, hierarchies and levels here (if you don't this snippet won't compile)
.withDescriptionPostProcessor(StartBuilding.copperCalculations().withDefinition(buildingContext -> {
// Here you can describe your CoPPer measures with the building context.
}).build()
)
)
We see in the above snippet that all the CoPPer measures will be defined in a single lambda method, and all examples will from now on only show the inside of this method for brevity.
So, now that we have access to a BuildingContext
instance let's explore it a little bit more. This instance has the same role than Spark's SparkContext. We use it to create the datasets that will represent the calculations we apply to the data.
Datasets
Each dataset corresponds to a table with columns and rows, and provides an API to apply transformations on this table whose result is a new dataset (datasets are immutable). Like Spark, CoPPer datasets are lazy data structures that do not actually apply the calculations as soon as you write them, they just remember which transformation occurred and will apply them only when a query is performed.
The facts dataset
The first dataset that you will want to use in your calculations is buildingContext.createDatasetFromFacts()
. This dataset represents the fact that are sent by the datastore to your cube via the selection. The columns of this datasets are all the fields of the selection so their name are the selection field names (not their expression), and this dataset has one line per fact sent by the datastore to ActivePivot. This dataset contains thus data from all the stores used in the selection, not just the base store.
The store datasets
In addition to the facts dataset you can create a dataset based on any store of the datastore, via buildingContext.createDatasetFromStore(String storeName)
. This dataset represents the records stored in the given store plus any reference starting from this store. So it has as columns all the fields of the given store (the column name is the field name), plus all fields in referenced stores (the column name is the full path to the field from the store represented by the dataset).
Dataset Operators
Now that we can create some datasets, let's see the main transformations you can apply on them. These transformations take one or multiple dataset as input and create a dataset as output. They are very similar to Spark's dataset operators and provide a fluent API in order to chain easily calculations.
Aggregate
The aggregate operator performs an aggregation over one or multiple columns of the dataset, creating a dataset that contains only a single line. Inside the operator one can specify multiple aggregation functions that are available in the Columns
class, like Columns.sum
, Columns.min
, Columns.max
or even your aggregation function with Columns.customAgg
. These methods take as argument the name of the column to aggregate and create the corresponding aggregated column. You can name the created column using .as()
, otherwise the output column will have a generated name. So for example:
The input
c1 | c2 |
---|---|
3 | 1 |
4 | 1 |
5 | 9 |
Transformed via
dataset.agg(
Columns.sum("c1").as("c1.SUM"),
Columns.max("c2").as("cm"))
Produces
c1.SUM | cm |
---|---|
12 | 9 |
GroupBy/Aggregate
CoPPer API mimics Spark's API because it is a well-thought API that matches the calculations ActivePivot can perform. Spark's API is in turn very close to the SQL language and offers similar keywords and syntax so that the SQL version and Spark version of the same computation are almost identical. If you already wrote some advanced SQL queries you could be familiar with the API, if not you might be surprised by some APIs that would not have been written completely differently if we didn't try to be close to SQL. GroupBy/Aggregate is one example: calling
groupBy()
on a dataset forces the user to call.agg()
just after. A regular Java API would have put the two calls in a single function instead of introducing an intermediate step.
Before calling aggregate, one can call .groupBy(groups)
. This behaves exactly like in Spark or SQL: the aggregation stops at the groups and produces a line per value of the groups.
The input
country | city | popk |
---|---|---|
France | Paris | 2206 |
France | Lyon | 506 |
France | Marseille | 855 |
Germany | Berlin | 3711 |
Germany | Hamburg | 1805 |
Transformed via
.groupBy("country")
.agg(Columns.sum("popk").as("popk"))
Produces
country | popk |
---|---|
France | 3567 |
Germany | 5516 |
As you probably noticed, the city
column is not in the resulting dataset. Aggregating with groups only keeps the columns that were used as groups and the aggregated columns. Similarly the aggregate operator that we saw earlier keeps only the aggregated columns in its result.
You can groupBy
multiple columns at once, this will keep all the combinations of these columns.
Select
The select operator is similar to its SQL version. Its role is to keep only some columns from the dataset and remove the other ones. This is useful to remove intermediate columns that we don't want to convert to measures in the cube. It can also be used to perform calculations on the columns like in SQL.
The input
a | b |
---|---|
3 | 1 |
4 | 1 |
5 | 9 |
Transformed via
.select(
Columns.col("a").plus(Columns.col("b")).as("a+b"),
Columns.col("a").plus(2).as("a+2"),
Columns.col("a").map((Integer l) -> Math.pow(l.doubleValue(), 2).as("a²"))
)
Produces
a+b | a+2 | a^2 |
---|---|---|
4 | 5 | 9 |
5 | 6 | 16 |
14 | 7 | 25 |
We just showed (above) multiple way of performing calculations on columns...
Note that you can often achieve the same computation either using directly the column API with methods like .plus
, .multiply
, .divide
, .equalTo
, .gt
etc...or by using a lambda in a map method. You should always try to use the column API to perform your calculation if possible, because CoPPer cannot understand the code of your lambda it will not be able to infer the types, optimize your calculation or even in some case understand it.
So, use lambdas only when you are sure there is no API equivalent.
The .withColumn
transformation can also be used instead of .select
. Its difference is that it keeps all the previous columns in the dataset and just adds the ones you provide as argument. So in the above example the produced dataset would still have the columns a
and b
.
Filter
The filter operator is, like in SQL, used to remove lines from the dataset based on a condition.
The input
a | b |
---|---|
3 | 1 |
4 | 1 |
5 | 9 |
Transformed via
.filter(Columns.col("a").gt(4))
Produces
a | b |
---|---|
5 | 9 |
The .gt
operator used here is strict, the greater or equal operator being .geq
. Operators such as equalTo
, notEqualTo
, lt
, gt
, leq
, geq
, isin
, and
and or
are available, but one can also implement custom logic by using the .mapToBoolean()
method provided in the API.
Join
The join operator is again very close to its SQL counterpart. It is used to join together two datasets based on the values of their column. Although SQL and Spark allow you to use almost any join condition, in CoPPer we restrict joins to equi-joins, meaning that we will only join rows from the two datasets if they have exactly the same value in the columns used as join columns. In addition, the current implementation behaves like an inner join.
When performing a join with CoPPer, you specify a ColumnMapping. This mapping gives the names of the column on the left and right dataset that should have equal values for the join to work.
The input left
pnl | currency |
---|---|
2 | EUR |
3 | USD |
with the input right
currency_from | currency_to | rate |
---|---|---|
EUR | USD | 1.2 |
EUR | GBP | 0.9 |
USD | GBP | 0.7 |
Transformed via
left.join(right, ColumnMapping.mapping("currency").to("currency_from"))
Produces
pnl | currency | currency_from | currency_to | rate |
---|---|---|---|---|
2 | EUR | EUR | USD | 1.2 |
2 | EUR | EUR | GBP | 0.9 |
3 | USD | USD | GBP | 0.7 |
Column API
We saw briefly among the dataset operators that there are a lot of methods that can be called on a Column object. This column API contains a lot of operators outside .as
, .plus
, .map
etc... they are called column operators and will be directly presented by examples in the Examples section below.
Window functions
Window functions are again a concept coming from the SQL world and integrated into Spark and CoPPer as well...
They are explained in details here, but in short: they provide a way to generate a value for a row of a dataset that depends on the values of surrounding rows.
CoPPer uses the same API as Spark for Window functions, so every column has a .over(WindowSpec)
method. The WindowSpec parameter is a window specification that contains the partitions and the order. The window partitions are used to limit the set of rows that are used to compute a value for a given row: only the rows of the same partition will be used. The window order gives the order in which the surrounding rows are read to produce the value.
This specification is created from com.activeviam.copper.operator.column.window.Window.partitionBy(String...)
then com.activeviam.copper.columns.WindowSpecWithoutOrder.orderBy(String...)
. Both these methods take column names as arguments and any of them can be empty. They will be described in depth in the Examples section below.
Context Values
CoPPer supports the usage of context values in its calculations.
They are made accessible to the user via context.contextValue(Class<? extends IContextValue> ContextValueInterface)
that creates a column that can be used in any calculation.
CoPPer needs to be aware of the list of context values that will be used in calculations, so you need to pass all of your context value interfaces as a vararg parameter to the constructor of CopperActivePivotDescriptionPostProcessor
.
We will see how they can be used in calculations in below examples.
Combining Columns
We saw with the select (or .withColumn
) operator that we can easily perform operations between multiple columns quite easily using .plus
, .map
, etc. The column API provides a lot of way to perform operations on one or multiple columns, and the .map
operator allows you to apply your lambda on the content of a column...
But all these tools don't allow you to apply a lambda to multiple columns at once. This is where we need to combine columns.
When we combine columns we create an intermediate object of type MultipleColumns
on which we can call mapping methods that take multiples arguments at once. For instance:
Columns.col("A")
.combineWith(Columns.col("B"))
.map(array -> f(array.readInt(0), array.readDouble(1)))
Here we call f(a, b)
for each value a and b of the columns A and B. f is an external function that takes an int
and a double
as input and whose implementation is not important here.
The important part is the call to .map
, this function takes a lambda as parameter, and this lambda takes as input an IArrayReader
a. This array represent one row containing only the combined columns in the order you combined them with indexes starting at 0. So in this example the first item of this row is the content of column A, and the second one is the content of column B. This array representation allows us to provide you all the readInt
, readLong
, readDouble
methods to prevent boxing in your lambda. This is why we don't give you an Object array as argument.
The reason why we have this columns combining mechanism instead of the dataset .map
(lambda) that can be found in Spark is that the dataset .map(lambda)
makes it impossible for Spark to know which columns are actually read in the lambda. And this is a major issue for CoPPer as we need to translate the calculation in a multidimensional cube we need to know which dimensions are used or not. Combining columns makes you explicitly tell us which columns you read so that we can efficiently translate your calculation.
So, don't combine columns that you don't use in your lambda, this could make your calculations a lot slower than expected.
When your calculations deals with context value columns you often need to combine these columns with other columns this way, see the examples section below.