Reservoir sampling is a family of randomized algorithms for randomly choosing k samples from a list of n items, where n is either a very large or unknown number. Typically n is large enough that the list doesn’t fit into main memory. For example, a list of search queries in Google and Facebook.
So we are given a big array (or stream) of numbers (to simplify), and we need to write an efficient function to randomly select k numbers where 1 <= k <= n. Let the input array be stream[].
A simple solution is to create an array reservoir[] of maximum size k. One by one randomly select an item from stream[0..n-1]. If the selected item is not previously selected, then put it in reservoir[]. To check if an item is previously selected or not, we need to search the item in reservoir[]. The time complexity of this algorithm will be O(k^2). This can be costly if k is big. Also, this is not efficient if the input is in the form of a stream.
It can be solved in O(n) time. The solution also suits well for input in the form of stream. The idea is similar to this post. Following are the steps.
1) Create an array reservoir[0..k-1] and copy first k items of stream[] to it.
2) Now one by one consider all items from (k+1)th item to nth item.
…a) Generate a random number from 0 to i where i is the index of the current item in stream[]. Let the generated random number is j.
…b) If j is in range 0 to k-1, replace reservoir[j] with stream[i]
Following is the implementation of the above algorithm.
C++
#include <bits/stdc++.h>
#include <time.h>
using namespace std;
void printArray( int stream[], int n)
{
for ( int i = 0; i < n; i++)
cout << stream[i] << " " ;
cout << endl;
}
void selectKItems( int stream[], int n, int k)
{
int i;
int reservoir[k];
for (i = 0; i < k; i++)
reservoir[i] = stream[i];
srand ( time (NULL));
for (; i < n; i++)
{
int j = rand () % (i + 1);
if (j < k)
reservoir[j] = stream[i];
}
cout << "Following are k randomly selected items \n" ;
printArray(reservoir, k);
}
int main()
{
int stream[] = {1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11, 12};
int n = sizeof (stream)/ sizeof (stream[0]);
int k = 5;
selectKItems(stream, n, k);
return 0;
}
|
C
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
void printArray( int stream[], int n)
{
for ( int i = 0; i < n; i++)
printf ( "%d " , stream[i]);
printf ( "\n" );
}
void selectKItems( int stream[], int n, int k)
{
int i;
int reservoir[k];
for (i = 0; i < k; i++)
reservoir[i] = stream[i];
srand ( time (NULL));
for (; i < n; i++)
{
int j = rand () % (i+1);
if (j < k)
reservoir[j] = stream[i];
}
printf ( "Following are k randomly selected items \n" );
printArray(reservoir, k);
}
int main()
{
int stream[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
int n = sizeof (stream)/ sizeof (stream[0]);
int k = 5;
selectKItems(stream, n, k);
return 0;
}
|
Java
import java.util.Arrays;
import java.util.Random;
public class ReservoirSampling {
static void selectKItems( int stream[], int n, int k)
{
int i;
int reservoir[] = new int [k];
for (i = 0 ; i < k; i++)
reservoir[i] = stream[i];
Random r = new Random();
for (; i < n; i++) {
int j = r.nextInt(i + 1 );
if (j < k)
reservoir[j] = stream[i];
}
System.out.println(
"Following are k randomly selected items" );
System.out.println(Arrays.toString(reservoir));
}
public static void main(String[] args)
{
int stream[]
= { 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 };
int n = stream.length;
int k = 5 ;
selectKItems(stream, n, k);
}
}
|
Python3
import random
def printArray(stream,n):
for i in range (n):
print (stream[i],end = " " );
print ();
def selectKItems(stream, n, k):
i = 0 ;
reservoir = [ 0 ] * k;
for i in range (k):
reservoir[i] = stream[i];
while (i < n):
j = random.randrange(i + 1 );
if (j < k):
reservoir[j] = stream[i];
i + = 1 ;
print ( "Following are k randomly selected items" );
printArray(reservoir, k);
if __name__ = = "__main__" :
stream = [ 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 ];
n = len (stream);
k = 5 ;
selectKItems(stream, n, k);
|
C#
using System;
using System.Collections;
public class ReservoirSampling
{
static void selectKItems( int []stream,
int n, int k)
{
int i;
int [] reservoir = new int [k];
for (i = 0; i < k; i++)
reservoir[i] = stream[i];
Random r = new Random();
for (; i < n; i++)
{
int j = r.Next(i + 1);
if (j < k)
reservoir[j] = stream[i];
}
Console.WriteLine( "Following are k " +
"randomly selected items" );
for (i = 0; i < k; i++)
Console.Write(reservoir[i]+ " " );
}
static void Main()
{
int []stream = {1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 12};
int n = stream.Length;
int k = 5;
selectKItems(stream, n, k);
}
}
|
Javascript
<script>
function printArray(stream, n)
{
for (let i = 0; i < n; i++)
document.write(stream[i] + " " );
document.write( '\n' );
}
function selectKItems(stream, n, k)
{
let i;
let reservoir = [];
for (i = 0; i < k; i++)
reservoir[i] = stream[i];
for (; i < n; i++)
{
let j = (Math.floor(Math.random() *
100000000) % (i + 1));
if (j < k)
reservoir[j] = stream[i];
}
document.write( "Following are k randomly " +
"selected items \n" );
printArray(reservoir, k);
}
let stream = [ 1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11, 12 ];
let n = stream.length;
let k = 5;
selectKItems(stream, n, k);
</script>
|
PHP
<?php
function printArray( $stream , $n )
{
for ( $i = 0; $i < $n ; $i ++)
echo $stream [ $i ]. " " ;
echo "\n" ;
}
function selectKItems( $stream , $n , $k )
{
$i ;
$reservoir = array_fill (0, $k , 0);
for ( $i = 0; $i < $k ; $i ++)
$reservoir [ $i ] = $stream [ $i ];
for (; $i < $n ; $i ++)
{
$j = rand(0, $i + 1);
if ( $j < $k )
$reservoir [ $j ] = $stream [ $i ];
}
echo "Following are k randomly " .
"selected items\n" ;
printArray( $reservoir , $k );
}
$stream = array (1, 2, 3, 4, 5, 6, 7,
8, 9, 10, 11, 12);
$n = count ( $stream );
$k = 5;
selectKItems( $stream , $n , $k );
?>
|
Output:
Following are k randomly selected items
6 2 11 8 12
Note: Output will differ every time as it selects and prints random elements
Time Complexity: O(n)
Auxiliary Space: O(k)
How does this work?
To prove that this solution works perfectly, we must prove that the probability that any item stream[i] where 0 <= i < n will be in final reservoir[] is k/n. Let us divide the proof in two cases as first k items are treated differently.
Case 1: For last n-k stream items, i.e., for stream[i] where k <= i < n
For every such stream item stream[i], we pick a random index from 0 to i and if the picked index is one of the first k indexes, we replace the element at picked index with stream[i]
To simplify the proof, let us first consider the last item. The probability that the last item is in final reservoir = The probability that one of the first k indexes is picked for last item = k/n (the probability of picking one of the k items from a list of size n)
Let us now consider the second last item. The probability that the second last item is in final reservoir[] = [Probability that one of the first k indexes is picked in iteration for stream[n-2]] X [Probability that the index picked in iteration for stream[n-1] is not same as index picked for stream[n-2] ] = [k/(n-1)]*[(n-1)/n] = k/n.
Similarly, we can consider other items for all stream items from stream[n-1] to stream[k] and generalize the proof.
Case 2: For first k stream items, i.e., for stream[i] where 0 <= i < k
The first k items are initially copied to reservoir[] and may be removed later in iterations for stream[k] to stream[n].
The probability that an item from stream[0..k-1] is in final array = Probability that the item is not picked when items stream[k], stream[k+1], …. stream[n-1] are considered = [k/(k+1)] x [(k+1)/(k+2)] x [(k+2)/(k+3)] x … x [(n-1)/n] = k/n
References:
http://en.wikipedia.org/wiki/Reservoir_sampling
Please write comments if you find anything incorrect, or you want to share more information about the topic discussed above.
Feeling lost in the world of random DSA topics, wasting time without progress? It's time for a change! Join our DSA course, where we'll guide you on an exciting journey to master DSA efficiently and on schedule.
Ready to dive in? Explore our Free Demo Content and join our DSA course, trusted by over 100,000 geeks!
Last Updated :
30 Oct, 2023
Like Article
Save Article