Class ReadOnlySparseVector
- java.lang.Object
-
- com.qfs.vector.impl.AVector
-
- com.qfs.vector.impl.AReadOnlyVector
-
- com.qfs.vector.impl.ADelegateReadOnlyVector
-
- com.activeviam.risk.core.vector.impl.ReadOnlySparseVector
-
- All Implemented Interfaces:
com.qfs.chunk.IArray
,com.qfs.chunk.IArrayReader
,com.qfs.chunk.IArrayWriter
,com.qfs.chunk.IWritableArray
,com.qfs.monitoring.memory.IMemoryMonitored
,com.qfs.vector.IDelegateVector
,com.qfs.vector.IReleasableVector
,com.qfs.vector.IVector
,Serializable
- Direct Known Subclasses:
ReadOnlySparseVector.SparseRandomAccessVector
public class ReadOnlySparseVector extends com.qfs.vector.impl.ADelegateReadOnlyVector
Read-onlyIVector
implementation of a view of an underlyingIVector
who contains a special, compact, representation of an allegedly-sparse vector.A "sparse" vector is a vector where a given single value (so called "implicit value") appears many time in the vector. The more times such value appears the more sparse the vector is considered. The other values in the vector are called the "explicit values".
From an external (caller) code perspective, this read-only vector implementation provides natural access to the original vector values w/o requiring any knowledge of the internal compact representation of those values.
The point of using such implementation as opposed to a naive "dense" representation of the original vector is to spare memory, since the internal compact representation gives us a memory footprint that is very close to the footprint of the explicit values only as long as the original vector size is greater than 100.
The overhead of the compact representation is so small that there very little chance that we end up consuming more memory than if using a naive representation unless the original vector is very dense. E.g., for original vector size = 100, we will consume more only if explicit values account for more than 96% of all values. For size = 10, this will happen if explicit values account for more than 70% of all values.
A good formula to estimate memory consumption by the internal representation (S values) vs. the naive representation (N values), considering a "density ratio" = # explicit values / # values, is:
S/N = density ratio + 0.016 + 3/N
.One must use
generateSparseVectorInternalRepresentation(double[], double)
to create an adequate compact representation, based on a given vanilla (expanded) representation of any vector (allegedly sparse). Alternatively, one may usegenerateSparseVectorInternalRepresentation(int, double, int[], double[])
to generate such compact representation. The compact representation can then be given to ActivePivot for transfer into an off-heap vector (custom code shall not create the off-heap vector directly because it doesn't know on which NUMA node to locate it). Lastly, once ActivePivot has created the proper off-heap vector, one must use custom code to wrap that vector into an instance ofReadOnlySparseVector
. This is what guarantees, ultimately, that other components can access the sparse vector content transparently (as per usual vector API), w/o having to know anything about the specifics of the internal compact representation at work.Hence, a typical pattern for loading sparse vectors in ActivePivot would be:
- Get vanilla (expanded) vector representation from source.
- Feed to Datastore (in a vector field) the internal (compact) representation of that vector obtained via
generateSparseVectorInternalRepresentation(double[], double)
(... or, alternatively, feed the vanilla representation and use insertion-time update-where procedure to do the conversion later on). - Use a commit-time update-where procedure to retrieve the
vector
automatically created by the core product (which now wraps thedouble[]
that contains the internal compact representation) in the vector field value, and then wrap it into aReadOnlySparseVector
instance and set this as the new the vector field value.
The internal, compact, representation consists in several
double
values, where:- The first
double
value must beinterpreted
as along
value that represents the packing of twoint
values, and where:- The first
int
value is the size of the original vector ('vanilla' representation). - The second
int
value is the size (in 'words') of the bitmap that tracks implicit values.
- The first
- The second
double
value is the implicit value used to consider sparsity of the original vector. - The B next
double
values must beinterpreted
aslong
values that constitute the bitmap words to track the positioning of implicit values in the original vector. - The remaining
double
values are the explicit values from the original sparse vector.
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
ReadOnlySparseVector.ReadOnlySparseVectorBinding
Custom binding to use more efficient custom ReadOnlySparseVector methods for aggregation methods: plus, minus and copyFrom.protected static class
ReadOnlySparseVector.SparseRandomAccessVector
Random access version of aReadOnlySparseVector
.
-
Constructor Summary
Constructors Constructor Description ReadOnlySparseVector(com.qfs.vector.IVector underlying)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description double
average()
Mean of a vector contains the arithmetic average of its components.com.qfs.util.IPrimitiveIterator
bottomK(int k)
Returns aniterator
of a collection composed of the k smallest elements of the vector.int[]
bottomKIndices(int kArg)
Return the k indices of the k smallest elements of the vector sorted in ascending order.void
copyTo(double[] dst)
static double[]
generateSparseVectorInternalRepresentation(double[] vector, double implicitValue)
Generate an internal compact representation of a (allegedly) sparse vector.static double[]
generateSparseVectorInternalRepresentation(int vectorSize, double implicitValue, int[] explicitValueIndexes, double[] explicitValues)
Generate an internal compact representation of a (allegedly) sparse vector.int
getBindingId()
int
getComponentType()
double[]
getDoubleArray()
Will return the full double array containing whole SparseVector.protected int
nearestRank(int length, double r)
Convenient method for quantile using the Nearest Rank definition of quantiledouble
quantileDouble(double r)
Compute the quantile of orderr
using the Nearest Rank definition of quantile with rounding.int
quantileIndex(double r)
Compute the quantile index of orderr
using the Nearest Rank definition of quantile with rounding.Object
read(int position)
double
readDouble(int position)
int
size()
com.qfs.vector.ITransientVector
sort()
com.qfs.vector.IVector
subVector(int start, int length)
com.qfs.util.IPrimitiveIterator
topK(int k)
Returns aniterator
of a collection composed of the k biggest elements of the vector.int[]
topKIndices(int kArg)
Return the k indices of the k biggest elements of the vector sorted in ascending order.com.qfs.vector.IRandomAccessVector
toRandomAccessVector()
double
variance()
Variance of a vector contains the arithmetic variance of its components.-
Methods inherited from class com.qfs.vector.impl.ADelegateReadOnlyVector
acquireReference, cloneOnHeap, collect, getMemoryStatistic, getUnderlying, hashCode, releaseReference
-
Methods inherited from class com.qfs.vector.impl.AReadOnlyVector
addDouble, addFloat, addInt, addLong, copyFrom, divide, divide, minus, minusNegativeValues, minusPositiveValues, plus, plusNegativeValues, plusPositiveValues, scale, scale, scale, scale, translate, translate, translate, translate, write, writeBoolean, writeDouble, writeFloat, writeInt, writeLong
-
Methods inherited from class com.qfs.vector.impl.AVector
applyAsDouble, applyAsFloat, applyAsInt, applyAsLong, checkIndex, checkIndex, copyFrom, copyFrom, copyFrom, copyFrom, copyFrom, copyTo, copyTo, copyTo, copyTo, equals, isNull, quantileFloat, quantileInt, quantileLong, readBoolean, readFloat, readInt, readLong, sumDouble, sumFloat, sumInt, sumLong, toArray, toDoubleArray, toFloatArray, toIntArray, toLongArray, toString, toString
-
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
-
Methods inherited from interface com.qfs.chunk.IArrayReader
isNull, readBoolean, readFloat, readInt, readLong
-
Methods inherited from interface com.qfs.chunk.IArrayWriter
addDouble, addFloat, addInt, addLong, writeBoolean, writeDouble, writeFloat, writeInt, writeLong
-
Methods inherited from interface com.qfs.vector.IVector
applyAsDouble, applyAsFloat, applyAsInt, applyAsLong, copyFrom, copyFrom, copyFrom, copyFrom, copyFrom, copyFrom, copyTo, copyTo, copyTo, copyTo, divide, divide, equals, minus, minusNegativeValues, minusPositiveValues, plus, plusNegativeValues, plusPositiveValues, quantileFloat, quantileInt, quantileLong, scale, scale, scale, scale, sumDouble, sumFloat, sumInt, sumLong, toArray, toDoubleArray, toFloatArray, toIntArray, toLongArray, translate, translate, translate, translate, write
-
-
-
-
Method Detail
-
generateSparseVectorInternalRepresentation
public static double[] generateSparseVectorInternalRepresentation(double[] vector, double implicitValue)
Generate an internal compact representation of a (allegedly) sparse vector.We use
Double.compare(double, double)
to judge whether or not a given vector value is equal to the specifiedimplicitValue
.What is returned by this utility method is expected to be wrapped into a
vector
ofdouble
elements, which itself must be wrapped into aReadOnlySparseVector
.Refer to
ReadOnlySparseVector
class Javadoc for more details about the internal representation format and a typical design pattern for loading such data in ActivePivot.- Parameters:
vector
- The vanilla (expanded) representation of the vector.implicitValue
- The value to be considered as implicit when looking at vector sparsity.- Returns:
- Internal compact representation of the (allegedly) sparse vector.
-
generateSparseVectorInternalRepresentation
public static double[] generateSparseVectorInternalRepresentation(int vectorSize, double implicitValue, int[] explicitValueIndexes, double[] explicitValues)
Generate an internal compact representation of a (allegedly) sparse vector.What is returned by this utility method is expected to be wrapped into a
vector
ofdouble
elements, which itself must be wrapped into aReadOnlySparseVector
.Refer to
ReadOnlySparseVector
class Javadoc for more details about the internal representation format and a typical design pattern for loading such data in ActivePivot.- Parameters:
vectorSize
- The size of the vector.implicitValue
- The value to be considered as implicit when looking at vector sparsity.explicitValueIndexes
- The indexes, by ascending order, where we find explicit values. Might benull
if no explicit value at all.explicitValues
- The explicit values (ordered consistently w.r.t.explicitValueIndexes
). Might benull
if no explicit value at all.- Returns:
- Internal compact representation of the (allegedly) sparse vector.
-
getComponentType
public int getComponentType()
- Specified by:
getComponentType
in interfacecom.qfs.vector.IVector
- Overrides:
getComponentType
in classcom.qfs.vector.impl.ADelegateReadOnlyVector
-
getBindingId
public int getBindingId()
-
size
public int size()
- Specified by:
size
in interfacecom.qfs.vector.IVector
- Overrides:
size
in classcom.qfs.vector.impl.ADelegateReadOnlyVector
-
read
public Object read(int position)
- Specified by:
read
in interfacecom.qfs.chunk.IArrayReader
- Overrides:
read
in classcom.qfs.vector.impl.AVector
-
readDouble
public double readDouble(int position)
- Specified by:
readDouble
in interfacecom.qfs.chunk.IArrayReader
- Overrides:
readDouble
in classcom.qfs.vector.impl.AVector
-
toRandomAccessVector
public com.qfs.vector.IRandomAccessVector toRandomAccessVector()
-
subVector
public com.qfs.vector.IVector subVector(int start, int length)
-
copyTo
public void copyTo(double[] dst)
- Specified by:
copyTo
in interfacecom.qfs.vector.IVector
- Overrides:
copyTo
in classcom.qfs.vector.impl.AVector
-
sort
public com.qfs.vector.ITransientVector sort()
-
topK
public com.qfs.util.IPrimitiveIterator topK(int k)
Returns aniterator
of a collection composed of the k biggest elements of the vector. The iterator iterates over the k biggest elements from the smallest element to the biggest element. The algorithm used for this version of topK is a dynamic based algorithm. Because most of the vector is expected to be the implicit value we only need to iterate over the explicit values to find the topK. We should expect the performance of this algorithm to be O(e + k). e is the total count of explicits that exist in this sparse vector and k is the value passed as a parameter for the number of elements to return. If you expect that your implicits will be smaller then your "bottom" most topK then you can also expect this algorithm to just have a performance of O(e). This is because we skip iterating over implicit values entirely in this case. Additionally some temporary memory is used in this algorithm. The memory used will be of size O(k) where k is the value passed as a parameter for the number of elements to return.- Parameters:
k
- The number of elements to return.- Returns:
- An iterator over the k biggest elements.
-
topKIndices
public int[] topKIndices(int kArg)
Return the k indices of the k biggest elements of the vector sorted in ascending order. The algorithm used for this version of topKIndices is a dynamic based algorithm. One of the challenges of finding indices for values, implicit or explicit, is that the only way to know what index a value originally was at is to query the internal bitmap of the underlying array. This requires the sparse vector to iterate through the bitmap to find the necessary explicit and implicit indexes for values. Because this algorithm finds the original indexes for values in the bitmap the performance size of this algorithm is O(n) where n is the original size of the vector uncompressed. However it should be noted that is the worst case scenario performance and can only be achieved in a few ways. One such way is if k = n then o(n) will always be expected. Additionally some temporary memory is used in this algorithm. The memory used will be of size O(k) where k is the value passed as a parameter for the number of elements to return.- Parameters:
kArg
- The number of elements to return.- Returns:
- The k indices of the k biggest elements.
-
bottomK
public com.qfs.util.IPrimitiveIterator bottomK(int k)
Returns aniterator
of a collection composed of the k smallest elements of the vector. The iterator iterates over the k smallest elements from the biggest element to the smallest element. The algorithm used for this version of bottomK is a dynamic based algorithm. Because most of the vector is expected to be the implicit value we only need to iterate over the explicit values to find the topK. We should expect the performance of this algorithm to be O(e + k). e is the total count of explicits that exist in this sparse vector and k is the value passed as a parameter for the number of elements to return. If you expect that your implicits will be greater then your "top" most bottomK then you can also expect this algorithm to just have a performance of O(e). This is because we skip iterating over implicit values entirely in this case. Additionally some temporary memory is used in this algorithm. The memory used will be of size O(k) where k is the value passed as a parameter for the number of elements to return.- Parameters:
k
- The number of elements to return.- Returns:
- An iterator over the k smallest elements.
-
bottomKIndices
public int[] bottomKIndices(int kArg)
Return the k indices of the k smallest elements of the vector sorted in ascending order. The algorithm used for this version of bottomKIndices is a dynamic based algorithm. One of the challenges of finding indices for values, implicit or explicit, is that the only way to know what index a value originally was at is to query the internal bitmap of the underlying array. This requires the sparse vector to iterate through the bitmap to find the necessary explicit and implicit indexes for values. Because this algorithm finds the original indexes for values in the bitmap the performance size of this algorithm is O(n) where n is the original size of the vector uncompressed. However it should be noted that is the worst case scenario performance and can only be achieved in a few ways. One such way is if k = n then o(n) will always be expected. Additionally some temporary memory is used in this algorithm. The memory used will be of size O(k) where k is the value passed as a parameter for the number of elements to return.- Parameters:
kArg
- The number of elements to return.- Returns:
- The k indices of the k smallest elements.
-
quantileDouble
public double quantileDouble(double r)
Compute the quantile of orderr
using the Nearest Rank definition of quantile with rounding.For instance, if
This method is only supported for vectors that can retrieve their content asr = 1/2
, the function returns the median. Forr = 1/4
, the first quartile. Forr = 3/4
, the third quartile.double
without information loss. This implementation for the SparseVector usesbottomK
ortopK
depending on if the rank is higher, equal or less then 0.5. The performance growth of this method is equal to the performance of either of those methods though we never have to iterate more then half of the entire underlying array in practice.- Specified by:
quantileDouble
in interfacecom.qfs.vector.IVector
- Overrides:
quantileDouble
in classcom.qfs.vector.impl.AVector
- Parameters:
r
- the order of the quantile, a double in]0.0, 1.0]
- Returns:
- the quantile of order
r
-
quantileIndex
public int quantileIndex(double r)
Compute the quantile index of orderr
using the Nearest Rank definition of quantile with rounding.For instance, if
This method is only supported for vectors that can retrieve their content asr = 1/2
, the function returns the median. Forr = 1/4
, the first quartile. Forr = 3/4
, the third quartile.double
without information loss. This implementation for the SparseVector usesbottomKIndices
ortopKIndices
depending on if the rank is higher, equal or less then 0.5. The performance growth of this method is equal to the performance of either of those methods though we never have to iterate more then half of the entire underlying array in practice.- Parameters:
r
- the order of the quantile, a double in]0.0, 1.0]
- Returns:
- the quantile of order
r
-
nearestRank
protected int nearestRank(int length, double r)
Convenient method for quantile using the Nearest Rank definition of quantile- Parameters:
length
- The length of the vector.r
- The radix, as a number between 0 and 1.- Returns:
- The nearest rank.
-
average
public double average()
Mean of a vector contains the arithmetic average of its components. The algorithm used for this version of average is a simple iterative based algorithm. The performance size of this algorithm should be expected to be θ(e) where e is the total count of explicits that exist in this sparse vector.- Specified by:
average
in interfacecom.qfs.vector.IVector
- Overrides:
average
in classcom.qfs.vector.impl.ADelegateReadOnlyVector
- Returns:
- the mean of the vector
-
variance
public double variance()
Variance of a vector contains the arithmetic variance of its components. The algorithm used for this version of average is a Shifted Data based algorithm. In such an algorithm we take a value that should be close to the mean (the implicit value) and shift all of our explicit values before summing or squaring them. The larger our vector is the closer to the true variance this algorithm becomes. The performance size of this algorithm should be expected to be θ(e) where e is the total count of explicits that exist in this sparse vector.- Returns:
- the variance of the vector
-
getDoubleArray
public double[] getDoubleArray()
Will return the full double array containing whole SparseVector. This is more efficient than calling getDouble for every value in the vector.- Returns:
- double array containing whole sparse vector
-
-