Examples
Now that the main API and representation of CoPPer calculations is clearer, we can give some examples of calculations that you can use in your projects. All these examples will be based on the twitter dataset given above.
In these examples we will use the
dataset.toCellSet(String)
method to look at the content of a dataset. In your project when you have written your calculation and want to add it into your cube so that end users can query it in MDX you should usedataset.publish()
instead.
Aggregate
The aggregation is the simplest calculation and is the one given just above, so let's add a more complex example using multiple aggregations.
The code
context
.createDatasetFromFacts()
.agg(
Columns.min("likes"),
Columns.count().as("Count"),
Columns.longSum("likes"),
Columns.max("year").as("lastYear")
).toCellSet("SELECT [Measures].Members ON ROWS FROM [tweets]")
Produces
Measures | Values |
---|---|
contributors.COUNT | 5 |
lastYear | 2018 |
likes.LONG | 1024 |
likes.MIN | 0 |
update.TIMESTAMP |
You will notice that on some aggregations the name of the column wasn't customized with .as()
afterwards, so the columns produced by the aggregation has an automatic name which is the name of the aggregated column to which we append .
and the aggregation function plugin key. The count()
aggregation creates a column named contributors.COUNT
. The update.TIMESTAMP
measures is empty because the cube we create to perform the query has a JUST_IN_TIME aggregate provider which doesn't maintain this technical measure.
You can also use Columns.customAgg(String column, String pluginKey)
to aggregate a column with your own aggregation function via its plugin key.
Leaf Aggregate
The leaf aggregation is an aggregation that stops at a given level, so that computations can be made at this level and then aggregated further on above this level with the same or a different aggregation function.
For example the code
context
.createDatasetFromFacts()
.groupBy(Columns.col("year"))
.agg(Columns.sum("likes").as("sum likes per year"))
.agg(Columns.avg("sum likes per year").as("avg likes per year"))
.toCellSet("SELECT [sender_id].[sender_id].[sender_id].Members "
+ "ON ROWS FROM [tweets] WHERE [Measures].[avg likes per year]")
Produces
sender_id | avg likes per year |
---|---|
0 | 12.5 |
1 | 0.0 |
2 | 999.0 |
The first call to .agg()
is the aggregation up to the year level because it is preceded by .groupBy()
, and the second agg()
aggregates up to the top.
Here we see that the likes of the same year were summed together and then we took the average value among all years where there is data for each account.
We can also use the information from the year level after having used it as a group. Let's say we want to weight our likes to give less importance to old ones. An easy way to achieve this is to divide the likes by the distance to the current year plus one (to have non null weights).
The code
context
.createDatasetFromFacts()
.groupBy(Columns.col("year"))
.agg(Columns.sum("likes").as("likes.per.year"))
.withColumn("weightedLikes",
Columns.col("likes.per.year").divide(Columns.literal(2019).minus(Columns.col("year"))))
.agg(Columns.sum("weightedLikes").as("weightedLikesSum"))
.toCellSet("SELECT [Measures].[weightedLikesSum] ON COLUMNS FROM [tweets]")
Produces
weightedLikesSum |
---|
1011.0 |
Currently, this calculation has the same limitations than the Aggregate example (it can be understood by CoPPer only if the aggregated columns correspond to fields from the selection). On top of that, the columns used in the
groupBy()
must also be fields from the selection and additionally these fields must be used as levels in your cube description. If you group by more complex columns like calculated ones you might be closer to the Bucketing example.
Bucketing
We call bucketing the process of creating a new hierarchy in the cube whose members are calculated based on members of other hierarchies or measures or context values. These new hierarchies are often used to create coarser grouping of the facts in buckets hence the name. For instance in our current data we have a granularity to the month of our data. If we want to see aggregated values per quarter instead, we can do the following:
The code
context
.createDatasetFromFacts()
// Create the quarters with names Q1...Q4
.withColumn("quarter", Columns.col("month").map((Integer m) -> "Q" + ((m / 4) + 1))
.asHierarchy("QuarterHierarchy")
.inDimension("Time"))
.agg(Columns.sum("likes").as("likes.SUM"))
.toCellSet("SELECT {[Measures].[likes.SUM], [Measures].[contributors.COUNT]} ON COLUMNS, [Time].[QuarterHierarchy].[quarter] ON ROWS FROM [tweets]")
Produces
likes.SUM | contributors.COUNT | |
---|---|---|
Q1 | 999 | 3 |
Q3 | 23 | 1 |
Q4 | 2 | 1 |
There are a few things to note here:
- We used a lambda to create the quarter based on the month. This is because there is no way to perform this with the column API. In this lambda We gave a type to the input of this lambda (month) and this is mandatory since we don't know at compile time what is the type of each column.
- In the MDX query we use the level
[Time].[QuarterHierarchy].[quarter]
. CoPPer creates a single level hierarchy representing thequarter
column because we calledasHierarchy()
orasHierarchy(String)
on it to indicate we want this column to be exported as so. This hierarchy is automatically created with a single level whose name is the name of the column, the name of the hierarchy is the name of the column if not specified (withasHierarchy(String)
) and the name of the dimension is the name of the column if not specified (withinDimension(String)
). - Q2 doesn't appear in the query result even though we didn't use the
NON EMPTY
keyword on theROWS
axis. This happens because CoPPer computes automatically the list of members of the hierarchy and here since we have no month in our whole data that correspond to the Q2 bucket. If one wants to explicitly see Q2 even if no data matches this value one has to use.withMemberList()
which is explained below. - The quarters bucketing was applied to all measures, even
contributors.COUNT
. This is different than the 5.6 bucketing where only one measure would benefit from it.
The example above is one of the simplest bucketing one can make. It is made by grouping aggregates by a column that is calculated based on cube levels, but we can also create buckets using underlying measures and context values.
Let's say we want to collect a user interest for tweets, we created a dedicated context value that represent these interest. We left the code below code for thoroughness:
/**
* Represents the interests of a user in the content of the tweets.
*/
protected interface UserInterests extends IContextValue {
/**
* Indicates how much the user represented by this context value is
* interested in tweets sent by a given account.
*
* @param senderId The sender of the tweet.
* @param numberOfLikes The number of likes the tweets of this sender accumulated.
*
* @return The estimated interest in the tweet.
*/
String getUserSenderIdInterest(long senderId, long numberOfLikes);
@Override
default Class<? extends IContextValue> getContextInterface() {
return UserInterests.class;
}
}
/**
* An implementation liking users in a white list or having at list a given amount of likes in total.
*/
protected static class MinLikesAndWhiteListInterests implements UserInterests {
private static final long serialVersionUID = 1L;
/** The other users liked by the current user. */
protected final Set<Long> likedSenders;
/** The minimum accumulated number of likes a user needs to have to be appreciated by the current user. */
protected final long minTotalLikes;
/**
* Fully defining constructor.
*
* @param minTotalLikes The minimum accumulated number of likes a user needs to have to be appreciated by the current user.
* @param likedSenders The list of senders the current user likes anyway.
*/
public MinLikesAndWhiteListInterests(long minTotalLikes, Long ...likedSenders) {
this.minTotalLikes = minTotalLikes;
this.likedSenders = new HashSet<Long>();
this.likedSenders.addAll(Arrays.<Long>asList(likedSenders));
}
/**
* Copy constructor.
*
* @param minTotalLikes See {@link #minTotalLikes}.
* @param likedSenders See {@link #likedSenders}.
*/
protected MinLikesAndWhiteListInterests(long minTotalLikes, Set<Long> likedSenders) {
this.minTotalLikes = minTotalLikes;
this.likedSenders = likedSenders;
}
@Override
public String getUserSenderIdInterest(long senderId, long numberOfLikes) {
if (likedSenders.contains(senderId) || numberOfLikes > this.minTotalLikes) {
return "Interested";
} else {
return "Not interested";
}
}
@Override
public MinLikesAndWhiteListInterests clone() {
return new MinLikesAndWhiteListInterests(minTotalLikes, likedSenders);
}
@Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((likedSenders == null) ? 0 : likedSenders.hashCode());
result = prime * result + (int) (minTotalLikes ^ (minTotalLikes >>> 32));
return result;
}
@Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
MinLikesAndWhiteListInterests other = (MinLikesAndWhiteListInterests) obj;
if (likedSenders == null) {
if (other.likedSenders != null)
return false;
} else if (!likedSenders.equals(other.likedSenders))
return false;
if (minTotalLikes != other.minTotalLikes)
return false;
return true;
}
}
Now with this context value we can bucket with this calculation:
/** Creates calculation bucketing by a context value. */
protected Dataset contextValueBucketing() {
return context
.createDatasetFromFacts()
.agg(Columns.sum("likes").as("numberOfLikes"))
.withColumn("interest", context.contextValue(UserInterests.class)
.combineWith(Columns.col("sender_id"), Columns.col("numberOfLikes"))
.map(a -> ((UserInterests) a.read(0)).getUserSenderIdInterest(a.readLong(1), a.readLong(2)))
.asHierarchy()
.withMemberList("Interested", "Not interested"));
}
And test the bucketing produced with a MDX query using different context values:
The first one with a threshold of 200 and having sender 1 as favorite:
UserInterests userInterests = new MinLikesAndWhiteListInterests(200L, 1L);
contextValueBucketing()
.toCellSet("SELECT "
+ "[interest].[interest].[interest] ON ROWS "
+ "FROM [tweets] WHERE [Measures].[contributors.COUNT]", userInterests)
Produces
interest | contributors.COUNT |
---|---|
Interested | 2 |
Not interested | 3 |
The second one with a threshold of 10 and having sender 2 as favorite:
UserInterests userInterests = new MinLikesAndWhiteListInterests(10L, 2L);
contextValueBucketing()
.toCellSet("SELECT "
+ "[interest].[interest].[interest] ON ROWS "
+ "FROM [tweets] WHERE [Measures].[contributors.COUNT]", userInterests)
Produces
interest | contributors.COUNT |
---|---|
Interested | 4 |
Not interested | 1 |
There are multiple things to explain in this example:
- We retrieve the value of the context value via
context.contextValue(UserInterests.class)
as explained in the API - We combine the columns
UserInterests.class
,sender_id
andnumberOfLikes
to produce the column interest (see Columns combining above). We can safely cast the first object we receive in our lambda to the context value class since the first column is the context value column, and can then directly call a method of the context value in our lambda with as parameters the content of thesender_id
andnumberOfLikes
column. - We use a new column operator:
.withMemberList()
. This operator is needed when CoPPer detects that a column (interest
here) will be converted to a cube level AND it can't figure automatically its list of members which is required by ActivePivot to work properly. Here the calculation uses a context value so it is actually impossible for CoPPer to know all the different instances this interface can have, even more all the values the.getUserSenderIdInterest()
function can return. This is why it asks the user to provide the actual list of distinct values that can be returned by this function. You will notice that this method wasn't needed for the previous example although this previous example created a new level. This is because in the case of the previous example CoPPer could automatically find the list of the members of the level Quarter. So you don't need to always call this method. You should write your calculation without calling it, and then if CoPPer detects it can't find the member list automatically it will throw an exception with a quite explicit error message that should guide you on where to add.withMemberList
. Another situation when you would want to use.withMemberList
is the quarter example above: since there is not a lot of data in the example CoPPer doesn't detect there can be a Q2 quarter in the quarters level and you don't see it in the query result even if you didn't use the NON EMPTY keyword. You can fix that by adding.withMemberList("Q1", "Q2", "Q3", "Q4")
after.map((Integer month) -> "Q" + ((month / 4) + 1)).asHierarchy()
.
StreamToPos
The StreamToPos calculation is a calculation where you want to aggregate values along the members of a level. For instance in our tweets example, if we want to see the cumulative count of liked tweets up to any date.
We can write the following calculation:
context.createDatasetFromFacts()
.agg(Columns.sum("likes").as("likes.SUM"))
.select(
Columns.col("likes.SUM"),
Columns.sum("likes.SUM")
.over(Window.orderBy("day")).as("cumulatedLikes"))
.doNotAggregateAbove()
.toCellSet("SELECT [time].[day].Members ON ROWS, "
+ "{[Measures].[likes.SUM], [Measures].[cumulatedLikes]} ON COLUMNS FROM [tweets]")
Produces
likes.SUM | cumulatedLikes | |
---|---|---|
2017/11/2 | 23 | 23 |
2017/12/14 | 2 | 25 |
2018/1/4 | 0 | 25 |
2018/2/9 | 0 | 25 |
2018/3/14 | 999 | 1024 |
We added the likes.SUM
measure just to better understand what is going on in cumulatedLikes.
In order to achieve this calculation, we used a window: .over(Window.orderBy("day"))
. We applied this window to the Columns.sum("likes.SUM")
calculation to have a cumulated sum over all values which are for days before the current day (including it). You can use other aggregations than sum in this situation: CoPPer supports sum
, min
, max
, avg
, count
and distinctCount
here.
The cumulatedLikes
above cumulate the likes along all dimensions, if we want to see a separate count per sender_id
for instance we need to use Window.partitionBy
.
The code
context.createDatasetFromFacts()
.agg(Columns.sum("likes").as("likes"))
.select(
Columns.col("likes"),
Columns.sum("likes")
.over(Window.partitionBy("sender_id").orderBy("day")).as("cumulatedLikes"))
.doNotAggregateAbove()
Produces
0 | 0 | 1 | 1 | 2 | 2 | |
---|---|---|---|---|---|---|
likes | cumulatedLikes | likes | cumulatedLikes | likes | cumulatedLikes | |
2017/11/2 | 23 | 23 | 0 | 0 | ||
2017/12/14 | 2 | 25 | 0 | 0 | ||
2018/1/4 | 0 | 25 | 0 | 0 | ||
2018/2/9 | 25 | 0 | 0 | 0 | ||
2018/3/14 | 25 | 0 | 999 | 999 |
The StreamToPos name is a reference to the name of the previous mechanism responsible of this feature before CoPPer was introduced.
The columns you
partitionBy
andorderBy
should be columns corresponding to cube levels directly.
Lead, Lag, First and Last
Here is the second category of window operators: lead, lag, first and last. These four operators give to a row the value present in another row relative to it in a window specification.
- The first and last operators give the value for the first and last row respectively contained in the same window.
- The lead and lag operators gives the value for the previous and next row respectively contained in the same window. They also take an int parameter indicating by how many rows we move. Let's see all these operators in action:
The code
context.createDatasetFromFacts()
.agg(Columns.sum("likes").as("likes"))
.select(
Columns.col("likes"),
Columns.first("likes").over(Window.orderBy("day")).as("likes FirstDay"),
Columns.last("likes").over(Window.orderBy("day")).as("likes LastDay"),
Columns.lag("likes", 1).over(Window.orderBy("day")).as("likes Yesterday"),
Columns.lead("likes", 1).over(Window.orderBy("day")).as("likes Tomorrow"),
Columns.lead("likes", 2).over(Window.orderBy("day")).as("likesDayAfterTomorrow"))
.doNotAggregateAbove()
.toCellSet("SELECT [time].[day].Members ON ROWS, "
+ "Filter([Measures].Members, Left([Measures].CurrentMember.MEMBER_CAPTION, 5) = \"likes\") ON COLUMNS FROM [tweets]")
(We select only the measures starting with the word "likes").
Produces
likes | likes FirstDay | likes LastDay | likes Yesterday | likes Tomorrow | likesDayAfterTomorrow | |
---|---|---|---|---|---|---|
2017/11/2 | 23 | 23 | 999 | 2 | 0 | |
2017/12/14 | 2 | 23 | 999 | 23 | 0 | 0 |
2018/1/4 | 0 | 23 | 999 | 2 | 0 | 999 |
2018/2/9 | 0 | 23 | 999 | 0 | 999 | |
2018/3/14 | 999 | 23 | 999 | 0 |
You can also use partitions to retrict the scope within which neighbour values are looked up. We can here for instance partition by year to keep the lookups within the current year. Let see that with the previous example just adding a partitioning by year:
The code
context.createDatasetFromFacts()
.agg(Columns.sum("likes").as("likes"))
.select(
Columns.col("likes"),
Columns.first("likes").over(Window.partitionBy("year").orderBy("day")).as("likes FirstDay"),
Columns.last("likes").over(Window.partitionBy("year").orderBy("day")).as("likes LastDay"),
Columns.lag("likes", 1).over(Window.partitionBy("year").orderBy("day")).as("likes Yesterday"),
Columns.lead("likes", 1).over(Window.partitionBy("year").orderBy("day")).as("likes Tomorrow"),
Columns.lead("likes", 2).over(Window.partitionBy("year").orderBy("day")).as("likesDayAfterTomorrow"))
.doNotAggregateAbove()
.toCellSet("SELECT [time].[day].Members ON ROWS, "
+ "Filter([Measures].Members, Left([Measures].CurrentMember.MEMBER_CAPTION, 5) = \"likes\") ON COLUMNS FROM [tweets]")
Produces
likes | likes FirstDay | likes LastDay | likes Yesterday | likes Tomorrow | likesDayAfterTomorrow | |
---|---|---|---|---|---|---|
2017/11/2 | 23 | 23 | 2 | 2 | ||
2017/12/14 | 2 | 23 | 2 | 23 | ||
2018/1/4 | 0 | 0 | 999 | 0 | 999 | |
2018/2/9 | 0 | 0 | 999 | 0 | 999 | |
2018/3/14 | 999 | 0 | 999 | 0 |
Limitations:
- The columns you partitionBy and orderBy should be columns corresponding to cube levels directly.
- Only one column can be used in the orderBy parameter.
- The column used as orderBy should correspond to the last level of its hierarchy.
Shift
The lead, lag, first and last operators give you an easy way to get access to neighbour values, but they don't give a lot of flexibility like random access. This can be achieved via the at() operator. If we want to compare the amount of likes with the previous month
We can do
context.createDatasetFromFacts()
.agg(Columns.sum("likes").as("likes"))
.select(Columns.col("likes").at("month", (Integer m) -> m - 1).as("likesPrevMonth"))
.doNotAggregateAbove()
.toCellSet("SELECT [time].[month].Members ON ROWS, "
+ "{[Measures].[likes], [Measures].[likesPrevMonth]} ON COLUMNS FROM [tweets]")
To produce
likes | likesPrevMonth | |
---|---|---|
2017/11 | 23 | |
2017/12 | 2 | 23 |
2018/1 | 0 | |
2018/2 | 0 | 0 |
2018/3 | 999 | 0 |
This usage of at()
might look similar to the usage of lag()
but it is very different for multiple reasons:
- We used explicitely
m → m -1
in our lambda to achieve something close to lag, but we could have donem → Math.floor(d / 2)
or whatever transformation on the month value, which is not possible with lag. - More importantly: the shift operated with
at()
is not taking into account the actual values of the underlying measure. If on the current row you are at month 8 and the underlying measure (here the measure likes) has no member in its result whose value is 7, then the calculation produced by shift will return null, whereas usinglag()
will in this case find the first member in the underlying measure result just before 8, so it could be 6 or 5 or even less if your dataset is quite sparse. - The second important thing to notice is that using
at()
here doesn't rely on the order of the members in the hierarchy. Here the value oflikesPrevMonth
for2018/1
is not equal to the value of likes for2017/12
. The lag operator in the example above could move to the previous month when reaching the first day of a month. This is because thelag()
operator relies on the member order of the hierarchies in theorderBy()
call. For this reason also the lead, lag first and last operator can only use the last level of a hierarchy in theirorderBy()
parameter. Here theat()
operator doesn't rely on any order and you need to compute exactly the value of the members you want to use.
So here in order to handle years changes one can do instead:
context.createDatasetFromFacts()
.agg(Columns.sum("likes").as("likes"))
.select(Columns.col("likes").at("month", "year", a -> {
if (a.readInt(0) == 1) {
// Roll year
a.writeInt(0, 12);
a.writeInt(1, a.readInt(1) - 1);
} else {
a.writeInt(0, a.readInt(0) - 1);
}
}).as("likesPrevMonth"))
.doNotAggregateAbove()
.toCellSet("SELECT [time].[month].Members ON ROWS, "
+ "{[Measures].[likes], [Measures].[likesPrevMonth]} ON COLUMNS FROM [tweets]")
Which produces
likes | likesPrevMonth | |
---|---|---|
2017/11 | 23 | |
2017/12 | 2 | 23 |
2018/1 | 0 | 2 |
2018/2 | 0 | 0 |
2018/3 | 999 | 0 |
Here we fixed the value for 2018/1 by manually rolling the years. In order to do that we need to have both the year and month information, this is why we use a version for at()
that takes a lambda with an array argument, exactly like combined columns. The difference here is that the input array is an IWritableArray
. Your lambda can read the values of the columns from it and also write to it to tell CoPPer to which row to shift. Like combined columns, we use these arrays interface to prevent boxing so we recommend you use the dedicated primitive type method like readInt()
and writeInt()
instead of read and write()
.
Condition Shift
The shift APIs we show above all use lambdas to decide at which row to shift. This is quite simple for the user but again CoPPer can't read the code inside of a lambda so has no idea how the shift will happen and can't optimize it. If you are shifting to rows that do not depend on where you are currently you can use the condition shift API which has way better performance.
We can get the difference between the likes of each user and the first one this way:
context.createDatasetFromFacts()
.agg(Columns.sum("likes").as("likes"))
.select(Columns.col("likes").minus(
Columns.col("likes").at(Columns.col("sender_id").equalTo(0L))).as("likesDiffWithFirstSender"))
.doNotAggregateAbove()
.toCellSet("SELECT [sender_id].[sender_id].[sender_id].Members "
+ "ON ROWS FROM [tweets] WHERE [Measures].[likesDiffWithFirstSender]")
Which produces
sender_id | likesDiffWithFirstSender |
---|---|
0 | 0 |
1 | -25 |
2 | 974 |
This variant of the at()
operator is thus recommended as it produces both readable and fast calculations, so you should prefer it to the lambda API if you shift to constant rows.
Limitations
- The columns used in the at() condition or as input to the lambda method should be used as cube levels.
- When using a condition, we support only a condition of the form column.equalTo(value) since otherwise multiple rows could match and we need to have only one corresponding row.
DrillUp
The drillup calculation is relevant when working with hierarchies, it creates a measure whose value is the value of another measure on a parent member. We can for instance compare a value to its parent via a percent-of parent.
The code
context
.createDatasetFromFacts()
.agg(Columns.count().as("count"))
.select(Columns.col("count").cast("double")
.divide(Columns.col("count").drillUp("day"))
.withFormatter("DOUBLE[0.00%]").as("percentOfParent"))
.toCellSet("SELECT Descendants([time].[ALL].[AllMember], 2, SELF_AND_BEFORE) ON ROWS, "
+ "{[Measures].[contributors.COUNT], [Measures].[percentOfParent]} ON COLUMNS FROM [tweets]")
Produces
contributors.COUNT | percentOfParent | |
---|---|---|
AllMember | 5 | |
2017 | 2 | 40.00% |
2018 | 3 | 60.00% |
2017/11 | 1 | 50.00% |
2017/12 | 1 | 50.00% |
2018/1 | 1 | 33.33% |
2018/2 | 1 | 33.33% |
2018/3 | 1 | 33.33% |
There are multiple things to note about this example:
- The MDX we use in this example is a little bit more complex because we want to showcase the measure values for all time year month and days but not the AllMember because will have a null drillUp value creating a division by null when computing percentOfParent.
- We see here also that we need to cast() our count column as a double column, this is because it is of type long and the .divide() operator uses the division between long numbers which would not produce fractional results so we would see only 0s and 1s in the query result if we didn't cast.
- We used a formatter for the percentOfParent column to have a nice display of this fraction. The withFormatter() call takes the same formatter string than when declaring your cubes in Java and XML previously, i.e. it takes a formatter plugin key and optionally a pattern inside brackets if this formatter handles patterns.
- We gave only "day" as parameter to the drillUp() operator but even month were affected by the drillUp. This is because the drillUp operator works on a whole hierarchy and the columns you give it as arguments are just used to enable it or not on the hierarchy containing the level they correspond to. One can also drillUp on multiple columns at once, so that we perform the drillUp on multiple hierarchies at once.
Limitations: The columns used in the drillUp operators should correspond to cube levels.
Store Query
We will now see how our calculations can integrate data from other stores that do not belong to the cube selection. Such stores are called isolated stores.
Let's say we have an isolated store login_events with this content:
account_id | timestamp | attemps_before_succes |
---|---|---|
0 | 1450000000L | 3 |
0 | 1460000000L | 1 |
0 | 1480000000L | 1 |
0 | 1500000000L | 2 |
1 | 1470000000L | 1 |
1 | 1490000000L | 2 |
2 | 1480000000L | 1 |
2 | 1500000000L | 1 |
The key of this store being (account_id
, timestamp
). This store could have a reference to the base store but not the other way around, this is why it can't be part of the selection of the ActivePivot Schema containing our cube. We will thus need to perform datastore queries to read its content from an ActivePivot query.
Let's say we want to say the difference of days between a login and a tweet (we don't go to the second precision for simplicity),
The following code:
StoreDataset loginEvents = context.createDatasetFromStore("login_events");
Dataset daysBetween = context
.createDatasetFromFacts()
.join(loginEvents, ColumnMapping.mapping("sender_id").to("account_id"))
.withColumn("daysBetweenLoginAndTweet", Columns.col("timestamp").combineWith(
Columns.col("year"),
Columns.col("month"),
Columns.col("day")).map(a -> {
LocalDate tweetDate = LocalDate.of(a.readInt(1), a.readInt(2), a.readInt(3));
LocalDate loginDate = LocalDateTime.ofEpochSecond(a.readLong(0), 0, ZoneOffset.UTC)
.toLocalDate();
return ChronoUnit.DAYS.between(loginDate, tweetDate);
}))
.agg(Columns.max("daysBetweenLoginAndTweet").as("maxDaysBetweenLoginAndTweet"));
daysBetween
.toCellSet("SELECT [sender_id].[sender_id].[sender_id].Members ON ROWS, "
+ "[Measures].[maxDaysBetweenLoginAndTweet] ON COLUMNS FROM [tweets]")
Produces
maxDaysBetweenLoginAndTweet | |
---|---|
0 | 753 |
1 | 558 |
2 | 475 |
There are multiple things to note here:
- We used createDatasetFromStore to retrieve the content of the isolated store.
- We store the corresponding dataset in a variable. Datasets are like any other Java variable and can be stored, passed in functions and so forth, the previous examples did not use this aspect because they were simpler but this is definitely possible.
- When we joined, we defined the column mapping indicating which column of which dataset will be used for the equi-join between the datasets. So here the sender_id column from the selection dataset matches the account_id column from the isolated store.
- We used again combineWith to perform a complex calculation involving multiple columns in a single lambda. We aggregated afterwards with a MAX to get the maximal date difference value when aggregating. The paragraph below gives more details about this aggregation.
Store Queries and Hierarchies
The interaction between store queries and hierarchies can be summarized like this:
When joining with an isolated store, CoPPer will create automatically a hierarchy for every field of the isolated store part of the store key whose value cannot be determined from the calculation (via a join or a filter).
The reason for this is explained below but is quite complex and not required to be understood to use correctly CoPPer.
Reason:
One of the jobs of CoPPer is to decide which levels of the cube are needed for each calculation. In the above example CoPPer has to decide which levels to use when calling the lambda producing the day difference. It needs to choose the levels so that when it performs the calculation at these levels it has a single year, month day and timestamp to give to the lambda. For year, month and day this is easy, just go to the level day, it is the lowest in the time hierarchy so it will automatically provide a value for year and month.
But for timestamp, things are a little bit more complex since it comes from an isolated store. Even if we try to execute the calculation at the lowest level of all the cube hierarchies, (sender_id
and day here), there can be multiple rows inlogin_events
joined to a row representing a fact since there are multiplelogin_events
per account_id. For instance at (sender_id
= 0L and day = 2017/11/2) we have 3 rows in thelogin_events
store matching oursender_id
. CoPPer needs thus a way to dive deeper in the data to separate these 3 rows and process them separately. The ActivePivot way of doing this is to add a new hierarchy to the cube with a level that has a distinct member for each row. How does CoPPer we decide which level to create? Well it just needs to look at all the fields of thelogin_events
that need a condition on their value to produce a single row in the datastore query onlogin_events
. These are actually the key fields of the store. Here the key fields oflogin_events
is account_id and timestamp. CoPPer can create a condition on the account_id field easily since it is joined tosender_id
which is a cube level, so it just needs to create a factless hierarchy for the remaining field: timestamp with a single level: timestamp. Then the calculation is made at thesender_id
and timestamp level (in addition to the day level), here CoPPer is guaranteed that it will have only one row in thelogin_events
store because it has an equality condition for all its key fields and will be able to extract a single timestamp to call the user lambda with.
If CoPPer cannot give a value to multiple key fields it will create one hierarchy for each of the field needing one.
The consequence is that in this example CoPPer created a hierarchy for the timestamp field of the login_events
store.
So we can query this dataset with this hierarchy:
daysBetween.toCellSet("SELECT [sender_id].[sender_id].[sender_id].Members ON ROWS, "
+ "[Measures].[maxDaysBetweenLoginAndTweet] ON COLUMNS FROM [tweets] "
+ "WHERE [timestamp].[timestamp].[timestamp].[1470000000]")
maxDaysBetweenLoginAndTweet | |
---|---|
0 | |
1 | 558 |
2 |
We see here that again this hierarchy is created in the timestamp
dimension with a single level named timestamp
. Names used are the name of the field.
Filtering
To filter the data from the isolated stores, we can use the filter()
operator. This operator must be used on the dataset after the join
operator because the join operator accepts only dataset that did not received any calculation. Imagine we want to keep only the rows of login_events
where the user managed to log in after only one attempt, we can write:
context
.createDatasetFromFacts()
.join(loginEvents, ColumnMapping.mapping("sender_id").to("account_id"))
.filter(Columns.col("attempts_before_success").equalTo(1))
The rest of the code stays identical to the previous example, and produces the table
maxDaysBetweenLoginAndTweet | |
---|---|
0 | 637 |
1 | 558 |
2 | 475 |
If we filter on the timestamp field instead with an equality:
context
.createDatasetFromFacts()
.join(loginEvents, ColumnMapping.mapping("sender_id").to("account_id"))
.filter(Columns.col("timestamp").equalTo(1470000000L))
Then we obtain the same query result than if we performed this filtering in the MDX query.