Skip to main content

Joining isolated stores in Copper

In databases, data is typically stored in different tables, and it is often necessary to combine data from these tables. In traditional SQL databases, this is done using the JOIN operator. In Atoti, this can be done in two ways: Datastore References and the Copper API. In this article we will explain how to use the Copper API and the JOIN operation to combine data from different stores.

We advise you to read the following article of the Copper documentation before proceeding as it contains important concepts relative to the Copper API.

Join Operation in a Nutshell

In this section we will explain the concept of JOIN operation in traditional SQL databases. If you are already familiar with this concept, you can skip to the next section.

Suppose our database has two tables: Customers and Orders.

Customers:

customer_idcustomer_name
1John Doe
2Jane Smith
3Alice Brown

Orders:

order_idcustomer_idorder_dateorder_total
10112024-08-01300.0
10232024-08-03250.0
10312024-08-071488.0

If we perform a JOIN operation between Customers and Orders:

SELECT * FROM Customers 
JOIN Orders ON Customers.customer_id = Orders.customer_id;

Result:

customer_idcustomer_nameorder_idcustomer_idorder_dateorder_total
1John Doe10112024-08-01300.0
1John Doe10312024-08-071488.0
3Alice Brown10232024-08-03250.0

In this example, the JOIN operation uses customer_id to find matching records from Orders and from Customers, where customer_id matches. It then combines them into one result set.

Let's break down another example: if we want to know which orders were placed by customer "John Doe", we need to join data from both tables using the customer_id field.

In SQL, this is done as follows:

SELECT * FROM Customers
JOIN Orders ON Customers.customer_id = Orders.customer_id
WHERE Customers.customer_name = 'John Doe';

Result:

customer_idcustomer_nameorder_idcustomer_idorder_dateorder_total
1John Doe10112024-08-01300.0
1John Doe10312024-08-071488.0

Explanation:

  • SELECT * FROM Customers: Selects all available fields from the Customers table.
  • JOIN Orders ON Customers.customer_id = Orders.customer_id: Joins the Customers and Orders tables on the customer_id field.
  • WHERE Customers.customer_name = 'John Doe': Filters results to include only those where customer_name is ' John Doe'.

Types of JOIN Operations

SQL supports several types of JOIN operations. The examples we saw in the previous section, used an INNER JOIN, which returns records that have matching values in both tables. This is the most common type of JOIN.

There are several different types of JOIN in SQL. They are INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN.

The LEFT JOIN operation produces the same result set as INNER JOIN plus all of the records from first table (or the left table). For the records that have no matching records from the second table (or right table), NULL values are returned.

Example of LEFT JOIN:

SELECT * FROM Customers
LEFT JOIN Orders ON Customers.customer_id = Orders.customer_id;

Result:

customer_idcustomer_nameorder_idcustomer_idorder_dateorder_total
1John Doe10112024-08-01300.0
1John Doe10312024-08-071488.0
2Jane SmithNULLNULLNULLNULL
3Alice Brown10232024-08-03250.0

As we can see, LEFT JOIN returns all records from the left table and the corresponding records from the right table. If there are no corresponding records in the right table (as for Jane Smith), NULL is returned.

RIGHT JOIN and FULL JOIN work similarly but return all records from the right table and all records from both tables, respectively.

JOIN in Atoti

In Atoti, there are two main ways to join data from different stores: using datastore references and the Copper API.

Datastore references allow joining data from different stores by matching fields. This method is fast but has significant limitations:

  • Only INNER JOIN is available.
  • Only one-to-one and many-to-one relationships are supported: each record from the left table must match at most one record from the right table. This restriction is enforced at the datastore level during data loading.
  • Datastore references cannot be used with DirectQuery tables. Instead, DirectQuery joins must be used.

Joining tables to a cube using Copper

The Copper API provides a more flexible way of joining data from different stores. This allows more complex joins and relationships, such as one-to-many or many-to-many, without the restrictions of datastore references.

Unlike datastore references, which are designed for joining two tables, Copper Join allows a table to be joined to a cube.

When building a cube in Atoti, some fields from the base table become cube levels, while others become measures. With Copper Join, fields from the new table can be used to:

  • Specify the conditions that determine how the table fields are related (or joined) to existing cube levels (this is a mapping).
  • Create new measures based on the joined data.
  • Create new hierarchies in the cube, enriching the cube's structure and functionality for analysis.

Let's consider this simple case:

  • The cube contains the single-level user hierarchy.
  • The datastore contains a hobbies store, which is isolated from the selection and contains hobby related data (the primary key is userName):
user
ALL
Marion
Nathan
Roger
Lukas
Hannah
userName (K)hobbybudgetcategory
Mariondance1.0art
Rogertennis2.0sport
Nathantennis2.0sport
Lukaspiano4.0art
Romainpiano4.0art

Let's assume we want to create a measure that returns the budget a user has for their hobbies. To achieve this, we need to join the cube with the datastore and then create a measure using the data from the datastore.

First, we create a join:

final CopperStore store =
Copper.store("hobbies") // [1]
.joinToCube(JoinType.INNER) // [2]
.withMapping("userName", Copper.level("user", "user", "user")); // [3]

In line [1], we select the table we want to join with the cube.
In line [2], we specify the use of INNER JOIN. If you want to use LEFT JOIN, you can specify JoinType.LEFT. If no join type is specified, INNER JOIN is used by default.
In line [3], we define that we want to join the table on the userName field to the cube's user level, similar to the ON clause in SQL.

The resulting store object can now be used to create measures and hierarchies based on the joined datastore. Let's create a measure that returns the user's budget for hobbies:

Copper.newLookupMeasure(store.field("budget")).withName("budget").publish(context);

Now, we can retrieve the users' hobby budget by using the budget measure, for instance, with the following MDX query:

SELECT 
{[Measures].[budget]} ON COLUMNS,
NonEmpty([user].[user].[user].Members, [Measures].[budget]) ON ROWS
FROM [C]

This query will return the hobby budget for each user, using the budget measure created from the joined table data:

budget
Lukas4.0
Marion1.0
Nathan2.0
Roger2.0

We can see that the user Hannah is not included in the results because she does not have any hobbies listed in the datastore.

Note the function NonEmpty, which is used to filter out level members that have no corresponding measure values. By default, the results would include all cube levels, even those without any measure data. This happens due to the way the MDX engine in Atoti works: it first generates all possible combinations of levels on the axes (ignoring Copper Joins) and then fills in measure values. By using NonEmpty, you can ensure that only meaningful data is returned.

JOIN types in Copper

In the Copper API, two types of JOIN are supported: INNER JOIN and LEFT JOIN. In the previous example, we used an INNER JOIN. Now, let's see what changes if we use LEFT JOIN:

final CopperStore store =
Copper.store("hobbies")
.joinToCube(JoinType.LEFT)
.withMapping("userName", Copper.level("user", "user", "user"));

Now, the result will include all users, even those who do not have hobbies in the datastore:

budget
Hannah0.0
Lukas4.0
Marion1.0
Nathan2.0
Roger2.0

In addition, we can specify a default value for cases where there are no matching records in the datastore. This is done using the withDefaultValue() method:

final CopperStore store =
Copper.store("hobbies")
.joinToCube(JoinType.LEFT)
.withDefaultValue(FieldPath.of("budget"), -666.0)
.withMapping("userName", Copper.level("user", "user", "user"));

And result will be the following:

budget
Hannah-666.0
Lukas4.0
Marion1.0
Nathan2.0
Roger2.0

Use case: Many-To-Many Relationship

Let's consider a more complex scenario where a user can have multiple hobbies. First, we need to modify the table schema by making the hobby field a key field. Then, we will add a new record to the table:

userName (K)hobby (K)budgetcategory
Mariondance1.0art
Rogertennis2.0sport
Nathantennis2.0sport
Lukaspiano4.0art
Lukassoccer3.0sport
Romainpiano4.0art

Now, Lukas has two hobbies: piano and soccer. This is a many-to-many relationship:

  • One user can have many hobbies.
  • One hobby can be shared by many users.
  • Or in other words, many users can have many hobbies.

For now, we can keep the same Java code as in the previous example:

final CopperStore store =
Copper.store("hobbies") // [1]
.joinToCube(JoinType.INNER) // [2]
.withMapping("userName", Copper.level("user", "user", "user")); // [3]

Copper.newLookupMeasure(store.field("budget")).withName("budget").publish(context);

But if we keep the same MDX query,

SELECT 
{[Measures].[budget]} ON COLUMNS,
NonEmpty([user].[user].[user].Members, [Measures].[budget]) ON ROWS
FROM [C]

the result will be empty:

budget

What's happening here? The problem is that we have changed the structure of our data model with the addition of the many-to-many relationship. The table structure is no longer aligned with the cube structure.

To solve this, Copper automatically creates hierarchies based on key fields in the datastore that aren't included in the mapping. In our case, Copper will generate a hobby hierarchy containing all the hobbies from the datastore. To retrieve the data, we need to use this hierarchy:

SELECT
{[Measures].[budget]} ON COLUMNS,
NonEmpty(
[user].[user].[user].Members * [hobby].[hobby].[hobby].Members,
[Measures].[budget]) ON ROWS
FROM [C]

Result:

budget
Lukaspiano4.0
Lukassoccer3.0
Mariondance1.0
Nathantennis2.0
Rogertennis2.0

Now, we can see that Lukas has two hobbies, and we can retrieve the budget for each of them. But what if we want to get the total budget for all hobbies for each user? To do this, we need to create an aggregated measure:

Copper.sum(store.field("budget")).withName("budget").publish(context);

Now, we can retrieve the total budget for each user:

budget
Lukas7.0
Marion1.0
Nathan2.0
Roger2.0

Copper includes several built-in aggregations: sum, avg, min, and max. You can also create your own custom aggregation by passing a plugin key to the Copper.agg() method.

If we want to configure the hobby hierarchy, we can do so by explicitly publishing it:

Copper.newHierarchy("hobby")
.fromField(store.field("hobby"))
.withLevelName("CUSTOM_NAME_FOR_HOBBY_LEVEL")
.publish(context);

Hierarchies can be also created manually. For example, let's add a category hierarchy:

Copper.newHierarchy("category").fromField(store.field("category")).publish(context);

Now, we can use the category hierarchy in MDX queries:

SELECT
[category].[category].[category].Members * {[Measures].[budget]} ON COLUMNS,
NonEmpty([user].[user].[user].Members, [Measures].[budget]) ON ROWS
FROM [C]

Result:

art/budgetsport/budget
Lukas4.03.0
Marion1.0
Nathan2.0
Roger2.0

A field does not have to be a key field to be used in a hierarchy. However, all key fields that are not included in the mapping will be automatically added as hierarchies.

This new hierarchy is an analysis hierarchy. Its members match the records of the category field in the hobbies store.

The created hierarchy can be configured similarly to a single-level hierarchy. For more information, see here.

Advanced topics

Joining without mapping

In some cases, you may want to join a store to a cube without mapping any field. This is possible by calling joinToCube() without any call to withMapping().

final CopperStore store = Copper.store("hobbies").joinToCube();

The defined operation is sometimes referred to as CROSS JOIN. As a result, hierarchies corresponding to all key fields in the datastore will be created. If the joined store is an append-only store (a store without a primary key), all the fields are considered as part of the effective primary key of the store.

Multiple join

With the Copper API it is possible to join the same store of the datastore to the same cube with multiple join operations, as long as the join operations, defined by the CopperStore object are different.

This means the same stores can be joined differently inside the same cube, for example by using different fields in the mapping, or simply by using a different join type :

final CopperStore STORE_BUY =
Copper.store("rates")
.joinToCube(UnlinkedCopperStore.JoinType.INNER)
.withMapping("currency", Copper.level("buy_currency", "buy_currency", "buy_currency"));
final CopperStore STORE_SELL =
Copper.store("rates")
.joinToCube(UnlinkedCopperStore.JoinType.INNER)
.withMapping(
"currency", Copper.level("sell_currency", "sell_currency", "sell_currency"));
final CopperStore STORE_BUY_LEFT =
Copper.store("rates")
.joinToCube(UnlinkedCopperStore.JoinType.LEFT)
.withDefaultValue(FieldPath.of("rate"), 1.0)
.withDefaultValue(FieldPath.of("bank_change_commission"), 0.1)
.withMapping("currency", Copper.level("buy_currency", "buy_currency", "buy_currency"));
final CopperStore STORE_SELL_LEFT =
Copper.store("rates")
.joinToCube(UnlinkedCopperStore.JoinType.LEFT)
.withDefaultValue(FieldPath.of("rate"), 1.0)
.withDefaultValue(FieldPath.of("bank_change_commission"), 0.05)
.withMapping(
"currency", Copper.level("sell_currency", "sell_currency", "sell_currency"));

The previous block showcases four different join operations between a FOREX "rates" store and a cube (Different join types and different mappings ). Each of the CopperStore instances created can be used independently to define its own measures and analysis hierarchies.

Note that the computations will also be independently created and performed. Therefore, querying measures requiring the execution of multiple joins will impact the join computation time with linear cost.

Using multiple stores with Datastore references

It is possible to access data from multiple isolated stores if they are linked together with datastore references:

Let's say that we separate the hobbies store into two stores: hobbies and hobbyDetails.

hobbies:

userName (K)hobby (K)
Mariondance
Rogertennis
Nathantennis
Lukaspiano
Lukassoccer
Romainpiano

hobbyDetails:

hobby (K)budgetcategory
dance1.0art
tennis2.0sport
piano4.0art
soccer3.0sport

If the datastore description declares a reference between two isolated stores, as represented below:

StartBuilding.reference()
.fromStore("hobbies")
.toStore("hobbyDetails")
.withName("hobbiesToHobbyDetails")
.withMapping("hobby", "hobby")
.build();

It will be then possible to use the fields of the referenced store when defining join operations, by using the field's expression rather than using its name:

Copper.sum(store.field(FieldPath.of("hobbiesToHobbyDetails", "budget")))
.withName("budget")
.publish(context);
Copper.newHierarchy("category")
.fromField(store.field(FieldPath.of("hobbiesToHobbyDetails", "category")))
.publish(context);

Performance considerations about join operations

Copper joins allow the user to seamlessly define a many-to-many relationship between the cube facts and the records of a store isolated from the selection.

The user should be aware that querying a measure using data from the joined store will require the computation of the join operation at query time, as the many-to-many relationship must be resolved to return the corresponding data, since it is not materialized in the stores.

This means that using a join-related measure will have an impact on the performance of the query. That impact corresponds to solving the many-to-many relationship during the query runtime, and is proportional to the amount of queried locations in the largest partitions of the cube's aggregate provider, multiplied by the size of the largest partition of the isolated store.

In a scenario where both the aggregate provider and the isolated store's partitions are large and of equivalent size, the time spent resolving the join may be large enough to significantly impact the overall query computation time.

A way to alleviate this cost is to improve the granularity of the common partitioning between the aggregate provider and the partitioning of the joined store, as the resolution of the join is performed asynchronously on each common partition between the cube and the isolated store.

If the query performance remains an issue after these optimizations, the "flattening" of the many-to-many joined data (i.e. the full materialization of the existing many-to-many combinations inside the base store schema) is necessary, in which case the join operation is no longer necessary as the joined store was merged to the base store, but this will induce a large increase of the datastore's size, and therefore will increase the RAM requirement of the application.