2020-02-08

Fisher-Yates algorithm for sampling




TL;DR

No matter how simple your code may be, there's no substitute for testing it to make sure it's actually doing what you think it is.




When asked to perform hold-out validation, Vita takes the input dataframe, containing all the available expressions, and moves a given percentage of random examples to a validation dataframe.

Initially the two steps of the partitioning were:
  1. a complete shuffle of the input dataframe via std::shuffle;
  2. a shift of the last portion of the shuffled input dataframe to a new validation dataframe.

Step 1 was sub-optimal since the validation dataset is usually small compared to the training set. We can easily perform a partial shuffle with the Fisher-Yates algorithm.

Fisher-Yates (modern) algorithm


The general idea is to start from the last element and swap it with a randomly selected element from the whole array including the last one (so in the $0$ to $n-1$ range). Then consider the slice from $0$ to $n-2$ (size reduced by $1$), and repeat the process till hitting the first element:


from random import randint

def shuffle(arr):
    for i in range(len(arr)-1, 0, -1):
        j = randint(0, i)
        arr[i], arr[j] = arr[j], arr[i]

    return arr


and, for a partial shuffle:

from random import randint

def shuffle(arr, skip):
    for i in range(len(arr)-1, skip-1, -1):
        j = randint(0, i)
        arr[i], arr[j] = arr[j], arr[i]

    return arr

After such a partial shuffle, the slice $arr[k], arr[k+1], ..., arr[n-1]$ contains a randomly shuffled uniform sample of $n-k$ values drawn, without replacement, from the original array arr.

The entire shuffle can be performed in-place, without any extra space. Time complexity is $O(k)$ (the algorithm needs to shuffle each item only once).

Hint of correctness

The probability that a given element of the original array (say the $i$-th element) goes to last position of the shuffled array is $1/n$, because we randomly pick an element in first iteration.

The probability that $i$-th element goes to second last position can be proved to be $1/n$ considering the two cases:

  1. $i = n-1$ (index of last element).

    The probability of last element going to second last position is:

    $$
    \mathbb{A} = "last\ element\ doesn't\ stay\ at\ its\ original\ position"\\
    \mathbb{B} = "index\ picked\ in\ previous\ step\ is\ picked\ again"\\
    \begin{eqnarray}P(\mathbb{A}) \cdot P(\mathbb{B}) = \frac{n-1}{n} \cdot \frac{1}{n-1} = \frac{1}{n}\end{eqnarray}
    $$
  2. $0 < i < n-1$ (index of non-last).

    The probability of $i$-th element going to second position is:

    $$
    \mathbb{A} = "i-th\ element\ is\ not\ picked\ in\ previous\ iteration"\\
    \mathbb{B} = "i-th\ element\ is\ picked\ in\ this\ iteration"\\
    \begin{eqnarray}P(\mathbb{A}) \cdot P(\mathbb{B}) = \frac{n-1}{n} \cdot \frac{1}{n-1} = \frac{1}{n}\end{eqnarray}
    $$

The proof can be generalized for any other position.

The algorithm can also be rewritten in a recursive way:


from random import randint

def shuffle(arr):
    def shuffle(arr, N):
        j = randint(0, N-1)

        arr[N-1], arr[j] = arr[j], arr[N-1]

        if N > 2:
            shuffle(arr, N-1)

    shuffle(arr, len(arr))
    return arr

Now proving correctness isn't too difficult. The last element is chosen with uniform probability. Recursively, given fixed values for $arr[i+1..N-1]$, $arr[i]$ is chosen from the remaining elements uniformly as well. Overall, this means that all $N!$ permutations are equally probable.

Considerations

Despite its simplicity, when developers attempt to code this from scratch it's extremely common that they make ‘off by one’ errors, which results in a notably biased permutation.

Actually this was the case for the initial implementation of the algorithm in Vita: one of the element of the input dataframe was always selected.

That is the same old story: it's possible for the code to be both simple and wrong and everything must be tested.

See also

No comments:

Post a comment