Reservoir sampling is an algorithm for sampling elements with equal probability from a data stream. Greg Grothaus has written a great post explaining the magic behind this simple but elegant algorithm. I strongly recommend you to refer to this post if reservoir sampling is a new term to you.
So far, I suppose you have a rough idea of how reservoir sampling works. Our story starts from here. Several months ago, I was asked the question of reservoir sampling in a tech interview. After digging into the math behind this algorithm, my interviewer threw out an open-ended interesting question to me, “since we had just proved that reservoir sampling is mathematically sound, can we adopt this algorithm in any real-world scenarios from the perspective of product and user experience?” To be honest, the first reaction arose in my mind was “why not? (since it’s so mathematically perfect and so easy to implement)”. But wait, “from the perspective of product and user experience” — what does that mean exactly?
Then my interviewer raised a scenario as a hint. Suppose we are having a lottery and there will be three winners. We want to tell all participants whether they are the lucky dogs or not once they enter the lottery. To put it more intuitive, imagine there is a big screen displaying the names of three winners. The participants enter the lottery sequentially and the screen will update the names accordingly. If we model this process within the context of reservoir sampling, it would be “sampling 3 elements with equal probability from a data stream of unknown size”. Then the question comes, does reservoir sampling really fits in this scenario? If not, what kind of problem it may cause?
OK, let’s go through the process from scratch. Suppose Amy, Bob and Cathy are the first three participants entering the lottery. Unsurprisingly, their names will be displayed on the big screen. Then Doreen, the fourth participant, enters the lottery. According to the rule of reservoir sampling, we generate a random number between (0,1). If this number is less equal than 3/4, we will kick out one of Amy, Bob and Cathy and put Doreen in the corresponding position. Since 3/4 is not a very small number, we are very likely to see the screen update dynamically by replacing Amy or Bob or Cathy with Doreen. Then the fifths participant Eddie comes. Following the same paradigm as before, Eddie has 3/5 probability to kick out one of the current winners, so on and so forth. Then what’s the probability for the
i‘th participant to seize a winner place?
3/i, of course.
There comes the point. The
3/i produces an illusion that the later you come, the smaller probability you will have to be the lucky dog since
i is increasing and
3/i is decreasing. Then what will the audience and all participants see from the big screen? They will see that at the very beginning, the names of three winners are updated very frequently: Bob is replaced by Doreen, and Doreen is replaced by Tom, etc. Everything seems really “random” and we take “random” as a significant feature of a fair game. But in the later stage, the screen seems to maintain a static state: there are always Lily, Matt and Jeff. It seems like no one has the power to kick them out any more! That is a little bit frustrating because the game seems not to be fair any more - the newcomers are doomed to be losers, even though we have mathematically soundly proved that everyone indeed have totally the same chance to win the lottery!
I hope this example give both you and me an “aha!” moment. Reservoir sampling is a rigorous algorithm but it does not fit in some real-world cases like the lottery discussed above. After all, not every one is a mathematician / statistician / data scientist, who would love to sit down at a desk, draw a paper, and derive those formulas. They even don’t want to wasting their time listening to your explanation! Under the assumption that most of our users are not armed with solid domain expertise, they are more likely to believe what they actually see, which follow their intuition. Thus things that technically sound does not definitely mean they will be the perfect choice in real-world user-oriented product.