Class ReadOnlySparseVector

  • All Implemented Interfaces:
    com.qfs.chunk.IArray, com.qfs.chunk.IArrayReader, com.qfs.chunk.IArrayWriter, com.qfs.chunk.IWritableArray, com.qfs.monitoring.memory.IMemoryMonitored, com.qfs.vector.IDelegateVector, com.qfs.vector.IReleasableVector, com.qfs.vector.IVector, Serializable
    Direct Known Subclasses:
    ReadOnlySparseVector.SparseRandomAccessVector

    public class ReadOnlySparseVector
    extends com.qfs.vector.impl.ADelegateReadOnlyVector
    Read-only IVector implementation of a view of an underlying IVector who contains a special, compact, representation of an allegedly-sparse vector.

    A "sparse" vector is a vector where a given single value (so called "implicit value") appears many time in the vector. The more times such value appears the more sparse the vector is considered. The other values in the vector are called the "explicit values".

    From an external (caller) code perspective, this read-only vector implementation provides natural access to the original vector values w/o requiring any knowledge of the internal compact representation of those values.

    The point of using such implementation as opposed to a naive "dense" representation of the original vector is to spare memory, since the internal compact representation gives us a memory footprint that is very close to the footprint of the explicit values only as long as the original vector size is greater than 100.

    The overhead of the compact representation is so small that there very little chance that we end up consuming more memory than if using a naive representation unless the original vector is very dense. E.g., for original vector size = 100, we will consume more only if explicit values account for more than 96% of all values. For size = 10, this will happen if explicit values account for more than 70% of all values.

    A good formula to estimate memory consumption by the internal representation (S values) vs. the naive representation (N values), considering a "density ratio" = # explicit values / # values, is: S/N = density ratio + 0.016 + 3/N.

    One must use generateSparseVectorInternalRepresentation(double[], double) to create an adequate compact representation, based on a given vanilla (expanded) representation of any vector (allegedly sparse). Alternatively, one may use generateSparseVectorInternalRepresentation(int, double, int[], double[]) to generate such compact representation. The compact representation can then be given to ActivePivot for transfer into an off-heap vector (custom code shall not create the off-heap vector directly because it doesn't know on which NUMA node to locate it). Lastly, once ActivePivot has created the proper off-heap vector, one must use custom code to wrap that vector into an instance of ReadOnlySparseVector. This is what guarantees, ultimately, that other components can access the sparse vector content transparently (as per usual vector API), w/o having to know anything about the specifics of the internal compact representation at work.

    Hence, a typical pattern for loading sparse vectors in ActivePivot would be:

    1. Get vanilla (expanded) vector representation from source.
    2. Feed to Datastore (in a vector field) the internal (compact) representation of that vector obtained via generateSparseVectorInternalRepresentation(double[], double) (... or, alternatively, feed the vanilla representation and use insertion-time update-where procedure to do the conversion later on).
    3. Use a commit-time update-where procedure to retrieve the vector automatically created by the core product (which now wraps the double[] that contains the internal compact representation) in the vector field value, and then wrap it into a ReadOnlySparseVector instance and set this as the new the vector field value.

    The internal, compact, representation consists in several double values, where:

    • The first double value must be interpreted as a long value that represents the packing of two int values, and where:
      • The first int value is the size of the original vector ('vanilla' representation).
      • The second int value is the size (in 'words') of the bitmap that tracks implicit values.
    • The second double value is the implicit value used to consider sparsity of the original vector.
    • The B next double values must be interpreted as long values that constitute the bitmap words to track the positioning of implicit values in the original vector.
    • The remaining double values are the explicit values from the original sparse vector.
    See Also:
    Serialized Form
    • Field Summary

      • Fields inherited from class com.qfs.vector.impl.ADelegateReadOnlyVector

        underlying
    • Constructor Summary

      Constructors 
      Constructor Description
      ReadOnlySparseVector​(com.qfs.vector.IVector underlying)  
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      double average()
      Mean of a vector contains the arithmetic average of its components.
      com.qfs.util.IPrimitiveIterator bottomK​(int k)
      Returns an iterator of a collection composed of the k smallest elements of the vector.
      int[] bottomKIndices​(int kArg)
      Return the k indices of the k smallest elements of the vector sorted in ascending order.
      void copyTo​(double[] dst)  
      static double[] generateSparseVectorInternalRepresentation​(double[] vector, double implicitValue)
      Generate an internal compact representation of a (allegedly) sparse vector.
      static double[] generateSparseVectorInternalRepresentation​(int vectorSize, double implicitValue, int[] explicitValueIndexes, double[] explicitValues)
      Generate an internal compact representation of a (allegedly) sparse vector.
      int getBindingId()  
      int getComponentType()  
      double[] getDoubleArray()
      Will return the full double array containing whole SparseVector.
      protected int nearestRank​(int length, double r)
      Convenient method for quantile using the Nearest Rank definition of quantile
      double quantileDouble​(double r)
      Compute the quantile of order r using the Nearest Rank definition of quantile with rounding.
      int quantileIndex​(double r)
      Compute the quantile index of order r using the Nearest Rank definition of quantile with rounding.
      Object read​(int position)  
      double readDouble​(int position)  
      int size()  
      com.qfs.vector.ITransientVector sort()  
      com.qfs.vector.IVector subVector​(int start, int length)  
      com.qfs.util.IPrimitiveIterator topK​(int k)
      Returns an iterator of a collection composed of the k biggest elements of the vector.
      int[] topKIndices​(int kArg)
      Return the k indices of the k biggest elements of the vector sorted in ascending order.
      com.qfs.vector.IRandomAccessVector toRandomAccessVector()  
      double variance()
      Variance of a vector contains the arithmetic variance of its components.
      • Methods inherited from class com.qfs.vector.impl.ADelegateReadOnlyVector

        acquireReference, cloneOnHeap, collect, getMemoryStatistic, getUnderlying, hashCode, releaseReference
      • Methods inherited from class com.qfs.vector.impl.AReadOnlyVector

        addDouble, addFloat, addInt, addLong, copyFrom, divide, divide, minus, minusNegativeValues, minusPositiveValues, plus, plusNegativeValues, plusPositiveValues, scale, scale, scale, scale, translate, translate, translate, translate, write, writeBoolean, writeDouble, writeFloat, writeInt, writeLong
      • Methods inherited from class com.qfs.vector.impl.AVector

        applyAsDouble, applyAsFloat, applyAsInt, applyAsLong, checkIndex, checkIndex, copyFrom, copyFrom, copyFrom, copyFrom, copyFrom, copyTo, copyTo, copyTo, copyTo, equals, isNull, quantileFloat, quantileInt, quantileLong, readBoolean, readFloat, readInt, readLong, sumDouble, sumFloat, sumInt, sumLong, toArray, toDoubleArray, toFloatArray, toIntArray, toLongArray, toString, toString
      • Methods inherited from interface com.qfs.chunk.IArrayReader

        isNull, readBoolean, readFloat, readInt, readLong
      • Methods inherited from interface com.qfs.chunk.IArrayWriter

        addDouble, addFloat, addInt, addLong, writeBoolean, writeDouble, writeFloat, writeInt, writeLong
      • Methods inherited from interface com.qfs.vector.IVector

        applyAsDouble, applyAsFloat, applyAsInt, applyAsLong, copyFrom, copyFrom, copyFrom, copyFrom, copyFrom, copyFrom, copyTo, copyTo, copyTo, copyTo, divide, divide, equals, minus, minusNegativeValues, minusPositiveValues, plus, plusNegativeValues, plusPositiveValues, quantileFloat, quantileInt, quantileLong, scale, scale, scale, scale, sumDouble, sumFloat, sumInt, sumLong, toArray, toDoubleArray, toFloatArray, toIntArray, toLongArray, translate, translate, translate, translate, write
    • Constructor Detail

      • ReadOnlySparseVector

        public ReadOnlySparseVector​(com.qfs.vector.IVector underlying)
    • Method Detail

      • generateSparseVectorInternalRepresentation

        public static double[] generateSparseVectorInternalRepresentation​(double[] vector,
                                                                          double implicitValue)
        Generate an internal compact representation of a (allegedly) sparse vector.

        We use Double.compare(double, double) to judge whether or not a given vector value is equal to the specified implicitValue.

        What is returned by this utility method is expected to be wrapped into a vector of double elements, which itself must be wrapped into a ReadOnlySparseVector.

        Refer to ReadOnlySparseVector class Javadoc for more details about the internal representation format and a typical design pattern for loading such data in ActivePivot.

        Parameters:
        vector - The vanilla (expanded) representation of the vector.
        implicitValue - The value to be considered as implicit when looking at vector sparsity.
        Returns:
        Internal compact representation of the (allegedly) sparse vector.
      • generateSparseVectorInternalRepresentation

        public static double[] generateSparseVectorInternalRepresentation​(int vectorSize,
                                                                          double implicitValue,
                                                                          int[] explicitValueIndexes,
                                                                          double[] explicitValues)
        Generate an internal compact representation of a (allegedly) sparse vector.

        What is returned by this utility method is expected to be wrapped into a vector of double elements, which itself must be wrapped into a ReadOnlySparseVector.

        Refer to ReadOnlySparseVector class Javadoc for more details about the internal representation format and a typical design pattern for loading such data in ActivePivot.

        Parameters:
        vectorSize - The size of the vector.
        implicitValue - The value to be considered as implicit when looking at vector sparsity.
        explicitValueIndexes - The indexes, by ascending order, where we find explicit values. Might be null if no explicit value at all.
        explicitValues - The explicit values (ordered consistently w.r.t. explicitValueIndexes). Might be null if no explicit value at all.
        Returns:
        Internal compact representation of the (allegedly) sparse vector.
      • getComponentType

        public int getComponentType()
        Specified by:
        getComponentType in interface com.qfs.vector.IVector
        Overrides:
        getComponentType in class com.qfs.vector.impl.ADelegateReadOnlyVector
      • getBindingId

        public int getBindingId()
      • size

        public int size()
        Specified by:
        size in interface com.qfs.vector.IVector
        Overrides:
        size in class com.qfs.vector.impl.ADelegateReadOnlyVector
      • read

        public Object read​(int position)
        Specified by:
        read in interface com.qfs.chunk.IArrayReader
        Overrides:
        read in class com.qfs.vector.impl.AVector
      • readDouble

        public double readDouble​(int position)
        Specified by:
        readDouble in interface com.qfs.chunk.IArrayReader
        Overrides:
        readDouble in class com.qfs.vector.impl.AVector
      • toRandomAccessVector

        public com.qfs.vector.IRandomAccessVector toRandomAccessVector()
      • subVector

        public com.qfs.vector.IVector subVector​(int start,
                                                int length)
      • copyTo

        public void copyTo​(double[] dst)
        Specified by:
        copyTo in interface com.qfs.vector.IVector
        Overrides:
        copyTo in class com.qfs.vector.impl.AVector
      • sort

        public com.qfs.vector.ITransientVector sort()
      • topK

        public com.qfs.util.IPrimitiveIterator topK​(int k)
        Returns an iterator of a collection composed of the k biggest elements of the vector. The iterator iterates over the k biggest elements from the smallest element to the biggest element. The algorithm used for this version of topK is a dynamic based algorithm. Because most of the vector is expected to be the implicit value we only need to iterate over the explicit values to find the topK. We should expect the performance of this algorithm to be O(e + k). e is the total count of explicits that exist in this sparse vector and k is the value passed as a parameter for the number of elements to return. If you expect that your implicits will be smaller then your "bottom" most topK then you can also expect this algorithm to just have a performance of O(e). This is because we skip iterating over implicit values entirely in this case. Additionally some temporary memory is used in this algorithm. The memory used will be of size O(k) where k is the value passed as a parameter for the number of elements to return.
        Parameters:
        k - The number of elements to return.
        Returns:
        An iterator over the k biggest elements.
      • topKIndices

        public int[] topKIndices​(int kArg)
        Return the k indices of the k biggest elements of the vector sorted in ascending order. The algorithm used for this version of topKIndices is a dynamic based algorithm. One of the challenges of finding indices for values, implicit or explicit, is that the only way to know what index a value originally was at is to query the internal bitmap of the underlying array. This requires the sparse vector to iterate through the bitmap to find the necessary explicit and implicit indexes for values. Because this algorithm finds the original indexes for values in the bitmap the performance size of this algorithm is O(n) where n is the original size of the vector uncompressed. However it should be noted that is the worst case scenario performance and can only be achieved in a few ways. One such way is if k = n then o(n) will always be expected. Additionally some temporary memory is used in this algorithm. The memory used will be of size O(k) where k is the value passed as a parameter for the number of elements to return.
        Parameters:
        kArg - The number of elements to return.
        Returns:
        The k indices of the k biggest elements.
      • bottomK

        public com.qfs.util.IPrimitiveIterator bottomK​(int k)
        Returns an iterator of a collection composed of the k smallest elements of the vector. The iterator iterates over the k smallest elements from the biggest element to the smallest element. The algorithm used for this version of bottomK is a dynamic based algorithm. Because most of the vector is expected to be the implicit value we only need to iterate over the explicit values to find the topK. We should expect the performance of this algorithm to be O(e + k). e is the total count of explicits that exist in this sparse vector and k is the value passed as a parameter for the number of elements to return. If you expect that your implicits will be greater then your "top" most bottomK then you can also expect this algorithm to just have a performance of O(e). This is because we skip iterating over implicit values entirely in this case. Additionally some temporary memory is used in this algorithm. The memory used will be of size O(k) where k is the value passed as a parameter for the number of elements to return.
        Parameters:
        k - The number of elements to return.
        Returns:
        An iterator over the k smallest elements.
      • bottomKIndices

        public int[] bottomKIndices​(int kArg)
        Return the k indices of the k smallest elements of the vector sorted in ascending order. The algorithm used for this version of bottomKIndices is a dynamic based algorithm. One of the challenges of finding indices for values, implicit or explicit, is that the only way to know what index a value originally was at is to query the internal bitmap of the underlying array. This requires the sparse vector to iterate through the bitmap to find the necessary explicit and implicit indexes for values. Because this algorithm finds the original indexes for values in the bitmap the performance size of this algorithm is O(n) where n is the original size of the vector uncompressed. However it should be noted that is the worst case scenario performance and can only be achieved in a few ways. One such way is if k = n then o(n) will always be expected. Additionally some temporary memory is used in this algorithm. The memory used will be of size O(k) where k is the value passed as a parameter for the number of elements to return.
        Parameters:
        kArg - The number of elements to return.
        Returns:
        The k indices of the k smallest elements.
      • quantileDouble

        public double quantileDouble​(double r)
        Compute the quantile of order r using the Nearest Rank definition of quantile with rounding.

        For instance, if r = 1/2, the function returns the median. For r = 1/4, the first quartile. For r = 3/4, the third quartile.

        This method is only supported for vectors that can retrieve their content as double without information loss. This implementation for the SparseVector uses bottomK or topK depending on if the rank is higher, equal or less then 0.5. The performance growth of this method is equal to the performance of either of those methods though we never have to iterate more then half of the entire underlying array in practice.
        Specified by:
        quantileDouble in interface com.qfs.vector.IVector
        Overrides:
        quantileDouble in class com.qfs.vector.impl.AVector
        Parameters:
        r - the order of the quantile, a double in ]0.0, 1.0]
        Returns:
        the quantile of order r
      • quantileIndex

        public int quantileIndex​(double r)
        Compute the quantile index of order r using the Nearest Rank definition of quantile with rounding.

        For instance, if r = 1/2, the function returns the median. For r = 1/4, the first quartile. For r = 3/4, the third quartile.

        This method is only supported for vectors that can retrieve their content as double without information loss. This implementation for the SparseVector uses bottomKIndices or topKIndices depending on if the rank is higher, equal or less then 0.5. The performance growth of this method is equal to the performance of either of those methods though we never have to iterate more then half of the entire underlying array in practice.
        Parameters:
        r - the order of the quantile, a double in ]0.0, 1.0]
        Returns:
        the quantile of order r
      • nearestRank

        protected int nearestRank​(int length,
                                  double r)
        Convenient method for quantile using the Nearest Rank definition of quantile
        Parameters:
        length - The length of the vector.
        r - The radix, as a number between 0 and 1.
        Returns:
        The nearest rank.
      • average

        public double average()
        Mean of a vector contains the arithmetic average of its components. The algorithm used for this version of average is a simple iterative based algorithm. The performance size of this algorithm should be expected to be θ(e) where e is the total count of explicits that exist in this sparse vector.
        Specified by:
        average in interface com.qfs.vector.IVector
        Overrides:
        average in class com.qfs.vector.impl.ADelegateReadOnlyVector
        Returns:
        the mean of the vector
      • variance

        public double variance()
        Variance of a vector contains the arithmetic variance of its components. The algorithm used for this version of average is a Shifted Data based algorithm. In such an algorithm we take a value that should be close to the mean (the implicit value) and shift all of our explicit values before summing or squaring them. The larger our vector is the closer to the true variance this algorithm becomes. The performance size of this algorithm should be expected to be θ(e) where e is the total count of explicits that exist in this sparse vector.
        Returns:
        the variance of the vector
      • getDoubleArray

        public double[] getDoubleArray()
        Will return the full double array containing whole SparseVector. This is more efficient than calling getDouble for every value in the vector.
        Returns:
        double array containing whole sparse vector