Performance Tips

Use Immutable Reservoir Samplers

By default, a ReservoirSampler is mutable, however, it is also possible to use an immutable version which supports all the basic operations. It uses Accessors.jl under the hood to update the reservoir.

Let's compare the performance of mutable and immutable samplers with a simple benchmark

using StreamSampling, BenchmarkTools

function fit_iter!(rs, iter)
    for i in iter
        rs = fit!(rs, i) # the reassignment is necessary when `rs` is immutable
    end
    return rs
end

iter = 1:10^7;
1:10000000

Running with both version we get

@btime fit_iter!(rs, $iter) setup=(rs = ReservoirSampler{Int}(10, AlgRSWRSKIP(); mutable = true))
  6.535 ms (2 allocations: 144 bytes)
@btime fit_iter!(rs, $iter) setup=(rs = ReservoirSampler{Int}(10, AlgRSWRSKIP(); mutable = false))
  4.816 ms (2 allocations: 144 bytes)

As you can see, the immutable version is 50% faster than the mutable one. In general, the smaller the ratio between reservoir size and stream size, the faster the immutable version will be than the mutable one. Be careful though, because calling fit! on an immutable sampler won't modify it in-place, but only create a new updated instance.

Parallel Sampling from Multiple Streams

Let's say that you want to split the sampling of an iterator. If you can split the iterator into different partitions then you can update in parallel a reservoir sample for each partition and then merge them together at the end.

Suppose for instance to have these 2 iterators

iters = [1:100, 101:200]
2-element Vector{UnitRange{Int64}}:
 1:100
 101:200

then you create two reservoirs of the same type

rs = [ReservoirSampler{Int}(10, AlgRSWRSKIP()) for i in 1:length(iters)]
2-element Vector{StreamSampling.MultiAlgRSWRSKIPSampler_Mut{Nothing, Int64, Random.TaskLocalRNG}}:
 MultiAlgRSWRSKIPSampler_Mut: n=0 | value=Int64[]
 MultiAlgRSWRSKIPSampler_Mut: n=0 | value=Int64[]

and after that you can just update them in parallel like so

Threads.@threads for i in 1:length(iters)
    for e in iters[i]
        fit!(rs[i], e)
    end
end

then you can obtain a unique reservoir containing a summary of the union of the streams with

merge(rs...)
MultiAlgRSWRSKIPSampler_Mut: n=200 | value=[43, 30, 96, 178, 121, 143, 126, 187, 128, 183]