Flat and Multi-Dimensional Representations
All the operators descriptions seen in the previous page are explained using the flat representation of the datasets. This representation is what you can see in Spark with dataset.show()
or dataset.collectAsList()
.
This is because, in Spark, datasets are tightly coupled with the data: one dataset corresponds to one and only one table of values.
The goal of datasets in CoPPer is different: a dataset describes the business logic to apply to the data but the user can still navigate in the data and change it afterwards.
CoPPer will actually translate the datasets into ActivePivot measures so that users benefit from all the power and capabilities of ActivePivot (multi-dimensional analysis, monitoring, what-If analysis, real-time updates). This is why the main way to explore a CoPPer dataset is a MDX query instead of a show() method.
In short: the flat data representation is the best to write and understand calculations made on the data. But the multi-dimensional representation is more suited to explore the result and extract the most out of it.
This is also why CoPPer datasets only exist in the context of an ActivePivot cube. They add measures and dimensions to this cube but need a scaffolding. This scaffolding consists of:
- the dimensions of the cube along which the user wants to explore its data.
- the description of the datastore this cube is built on top of
- the description of the selection feeding the cube
For the next examples let's say we have:
A datastore with a single store called tweets containing
id | text | sender_id | likes | year | month | day |
---|---|---|---|---|---|---|
0 | Hello World | 0 | 23 | 2017 | 11 | 2 |
1 | Lol | 0 | 2 | 2017 | 12 | 14 |
2 | Foo | 0 | 0 | 2018 | 1 | 4 |
3 | Test | 1 | 0 | 2018 | 2 | 9 |
4 | Hola | 2 | 999 | 2018 | 3 | 14 |
A selection and a cube on top of this dataset:
StartBuilding.selection(datastoreDescription)
.fromBaseStore("tweets")
.withAllReachableFields()
.build();
StartBuilding.cube("tweets")
.withSingleLevelDimensions("sender_id")
.withDimension("time")
.withHierarchy("time")
.withLevel("year")
.withLevel("month")
.withLevel("day")
Now let's write the simplest calculation with CoPPer and its corresponding cellSet:
Dataset likesSum = context
.createDatasetFromFacts()
.agg(Columns.sum("likes").as("likes.SUM"));
likesSum
.toCellSet("SELECT [Measures].[likes.SUM] ON ROWS FROM [tweets]")
Produces
likes.SUM |
---|
1024 |
There are multiple things to notice:
- The name of the measure to use in the MDX query is the name of the column in your dataset
- The cube name to use in the query is the actual cube name in your description. Although CoPPer creates temporary cubes for the tests with a test setup (no query time limit, a Just In Time aggregate provider), you can reuse the queries from your bookmarks or taken from ActiveUI for the sake of simplicity.
We can actually try other queries and see their corresponding cellSets:
The code
likesSum.toCellSet("SELECT "
+ "[Measures].[likes.SUM] ON COLUMNS, "
+ "[sender_id].[sender_id].[sender_id] ON ROWS FROM [tweets]")
Produces
likes.SUM | |
---|---|
0 | 25 |
1 | 0 |
2 | 999 |
The code
likesSum.toCellSet("SELECT "
+ "[time].[time].[year] ON COLUMNS, "
+ "[sender_id].[sender_id].[sender_id] ON ROWS "
+ "FROM [tweets] WHERE [Measures].[likes.SUM]")
Produces
2017 | 2018 | |
---|---|---|
0 | 25 | 0 |
1 | 0 | |
2 | 999 |
The code
likesSum.toCellSet("SELECT" +
" NON EMPTY [Measures].[likes.SUM] ON COLUMNS" +
" FROM (" +
" SELECT" +
" TopCount(" +
" Filter(" +
" [sender_id].[sender_id].[sender_id].Members," +
" NOT IsEmpty(" +
" [Measures].[likes.SUM]" +
" )" +
" )," +
" 1," +
" [Measures].[likes.SUM]" +
" ) ON COLUMNS" +
" FROM [tweets]" +
" )")
Produces
likes.SUM |
---|
999 |
And this represents the main difference between datasets created via Spark and CoPPer. In both Spark and CoPPer your dataset only represents the calculations you apply to your data and the operations you apply on them are applied lazily. But:
- in the case of Spark this calculation is linked to the data and can only produce a single result
- whereas in CoPPer, the calculation are then published in an ActivePivot cube and can be queried with any MDX query, in multiple UIs, and benefit from all the features of ActivePivot like Real-Time updates, What-If simulations, Monitoring with ActiveMonitor.