Flat and Multi-Dimensional Representations

All the operators descriptions seen in the previous page are explained using the flat representation of the datasets. This representation is what you can see in Spark with dataset.show() or dataset.collectAsList().

This is because, in Spark, datasets are tightly coupled with the data: one dataset corresponds to one and only one table of values.

The goal of datasets in CoPPer is different: a dataset describes the business logic to apply to the data but the user can still navigate in the data and change it afterwards.

CoPPer will actually translate the datasets into ActivePivot measures so that users benefit from all the power and capabilities of ActivePivot (multi-dimensional analysis, monitoring, what-If analysis, real-time updates). This is why the main way to explore a CoPPer dataset is a MDX query instead of a show() method.

In short: the flat data representation is the best to write and understand calculations made on the data. But the multi-dimensional representation is more suited to explore the result and extract the most out of it.

This is also why CoPPer datasets only exist in the context of an ActivePivot cube. They add measures and dimensions to this cube but need a scaffolding. This scaffolding consists of:

the dimensions of the cube along which the user wants to explore its data.
the description of the datastore this cube is built on top of
the description of the selection feeding the cube

For the next examples let's say we have:

A datastore with a single store called tweets containing

id	text	sender_id	likes	year	month	day
0	Hello World	0	23	2017	11	2
1	Lol	0	2	2017	12	14
2	Foo	0	0	2018	1	4
3	Test	1	0	2018	2	9
4	Hola	2	999	2018	3	14

A selection and a cube on top of this dataset:

StartBuilding.selection(datastoreDescription)
        .fromBaseStore("tweets")
        .withAllReachableFields()
        .build();

StartBuilding.cube("tweets")
.withSingleLevelDimensions("sender_id")
    .withDimension("time")
        .withHierarchy("time")
            .withLevel("year")
            .withLevel("month")
            .withLevel("day")

Now let's write the simplest calculation with CoPPer and its corresponding cellSet:

Dataset likesSum = context
        .createDatasetFromFacts()
        .agg(Columns.sum("likes").as("likes.SUM"));

likesSum
.toCellSet("SELECT [Measures].[likes.SUM] ON ROWS FROM [tweets]")

Produces

likes.SUM
1024

There are multiple things to notice:

The name of the measure to use in the MDX query is the name of the column in your dataset
The cube name to use in the query is the actual cube name in your description. Although CoPPer creates temporary cubes for the tests with a test setup (no query time limit, a Just In Time aggregate provider), you can reuse the queries from your bookmarks or taken from ActiveUI for the sake of simplicity.

We can actually try other queries and see their corresponding cellSets:

The code

likesSum.toCellSet("SELECT "
        + "[Measures].[likes.SUM] ON COLUMNS, "
        + "[sender_id].[sender_id].[sender_id] ON ROWS FROM [tweets]")

Produces

	likes.SUM
0	25
1	0
2	999

The code

likesSum.toCellSet("SELECT "
        + "[time].[time].[year] ON COLUMNS, "
        + "[sender_id].[sender_id].[sender_id] ON ROWS "
        + "FROM [tweets] WHERE [Measures].[likes.SUM]")

Produces

	2017	2018
0	25	0
1		0
2		999

The code

likesSum.toCellSet("SELECT" +
        "  NON EMPTY [Measures].[likes.SUM] ON COLUMNS" +
        "  FROM (" +
        "    SELECT" +
        "    TopCount(" +
        "      Filter(" +
        "        [sender_id].[sender_id].[sender_id].Members," +
        "        NOT IsEmpty(" +
        "          [Measures].[likes.SUM]" +
        "        )" +
        "      )," +
        "      1," +
        "      [Measures].[likes.SUM]" +
        "    ) ON COLUMNS" +
        "    FROM [tweets]" +
        "  )")

Produces

likes.SUM
999

And this represents the main difference between datasets created via Spark and CoPPer. In both Spark and CoPPer your dataset only represents the calculations you apply to your data and the operations you apply on them are applied lazily. But:

in the case of Spark this calculation is linked to the data and can only produce a single result
whereas in CoPPer, the calculation are then published in an ActivePivot cube and can be queried with any MDX query, in multiple UIs, and benefit from all the features of ActivePivot like Real-Time updates, What-If simulations, Monitoring with ActiveMonitor.

ActivePivot

5.8.24

Copper 2

CoPPer

Flat and Multi-Dimensional Representations