StreamSampling.jl

StreamSamplingModule

StreamSampling.jl

CI codecov Aqua QA DOI

The scope of this package is to provide general methods to sample from any stream in a single pass through the data, even when the number of items contained in the stream is unknown.

This has some advantages over other sampling procedures:

  • If the iterable is lazy, the memory required is a small constant or grows in relation to the size of the sample, instead of the all population.
  • With reservoir methods, the sample collected is a random sample of the portion of the stream seen thus far at any point of the sampling process.
  • In some cases, sampling with the techniques implemented in this library can bring considerable performance gains, since the population of items doesn't need to be previously stored in memory.

For information about the available functionalities consult the documentation.

Contributing

Contributions are welcome! If you encounter any issues, have suggestions for improvements, or would like to add new features, feel free to open an issue or submit a pull request.

source

Overview of the functionalities

The itsample function allows to consume all the stream at once and return the sample collected:

julia> using StreamSampling

julia> st = 1:100;

julia> itsample(st, 5)
5-element Vector{Int64}:
  9
 15
 52
 96
 91

In some cases, one needs to control the updates the ReservoirSample will be subject to. In this case you can simply use the fit! function to update the reservoir:

julia> using StreamSampling

julia> st = 1:100;

julia> rs = ReservoirSample{Int}(5);

julia> for x in st
           fit!(rs, x)
       end

julia> value(rs)
5-element Vector{Int64}:
  7
  9
 20
 49
 74

If the total number of elements in the stream is known beforehand and the sampling is unweighted, it is also possible to iterate over a StreamSample like so

julia> using StreamSampling

julia> st = 1:100;

julia> ss = StreamSample{Int}(st, 5, 100);

julia> r = Int[];

julia> for x in ss
           push!(r, x)
       end

julia> r
5-element Vector{Int64}:
 10
 22
 26
 35
 75

The advantage of StreamSample iterators in respect to ReservoirSample is that they require O(1) memory if not collected, while reservoir techniques require O(k) memory where k is the number of elements in the sample.

Consult the API page for more information about the package interface.