TL;DR

No matter how simple your code may be, there's no substitute for testing it to make sure it's actually doing what you think it is.When asked to perform hold-out validation, Vita takes the input

`dataframe`

, containing all the available expressions, and moves a given percentage of random examples to a validation `dataframe`

.Initially the two steps of the partitioning were:

- a complete shuffle of the input
`dataframe`

via`std::shuffle`

; - a shift of the last portion of the shuffled input
`dataframe`

to a new validation`dataframe`

.

Step 1 was sub-optimal since the validation dataset is usually small compared to the training set. We can

*easily*perform a partial shuffle with the Fisher-Yates algorithm.

### Fisher-Yates (modern) algorithm

The general idea is to start from the last element and swap it with a randomly selected element from the whole array including the last one (so in the $0$ to $n-1$ range). Then consider the slice from $0$ to $n-2$ (size reduced by $1$), and repeat the process till hitting the first element:

from random import randint def shuffle(arr): for i in range(len(arr)-1, 0, -1): j = randint(0, i) arr[i], arr[j] = arr[j], arr[i] return arr

and, for a

*partial*shuffle:

from random import randint def shuffle(arr, skip): for i in range(len(arr)-1, skip-1, -1): j = randint(0, i) arr[i], arr[j] = arr[j], arr[i] return arr

After such a partial shuffle, the slice $arr[k], arr[k+1], ..., arr[n-1]$ contains a randomly shuffled uniform sample of $n-k$ values drawn, without replacement, from the original array

`arr`

.The entire shuffle can be performed in-place, without any extra space. Time complexity is $O(k)$ (the algorithm needs to shuffle each item only once).

### Hint of correctness

The probability that a given element of the original array (say the $i$-th element) goes to last position of the shuffled array is $1/n$, because we randomly pick an element in first iteration.The probability that $i$-th element goes to second last position can be proved to be $1/n$ considering the two cases:

**$i = n-1$ (index of last element)**.

The probability of last element going to second last position is:

$$

\mathbb{A} = "last\ element\ doesn't\ stay\ at\ its\ original\ position"\\

\mathbb{B} = "index\ picked\ in\ previous\ step\ is\ picked\ again"\\

\begin{eqnarray}P(\mathbb{A}) \cdot P(\mathbb{B}) = \frac{n-1}{n} \cdot \frac{1}{n-1} = \frac{1}{n}\end{eqnarray}

$$

**$0 < i < n-1$ (index of non-last)**.

The probability of $i$-th element going to second position is:

$$

\mathbb{A} = "i-th\ element\ is\ not\ picked\ in\ previous\ iteration"\\

\mathbb{B} = "i-th\ element\ is\ picked\ in\ this\ iteration"\\

\begin{eqnarray}P(\mathbb{A}) \cdot P(\mathbb{B}) = \frac{n-1}{n} \cdot \frac{1}{n-1} = \frac{1}{n}\end{eqnarray}

$$

The

*proof*can be generalized for any other position.

The algorithm can also be rewritten in a recursive way:

from random import randint def shuffle(arr): def shuffle(arr, N): j = randint(0, N-1) arr[N-1], arr[j] = arr[j], arr[N-1] if N > 2: shuffle(arr, N-1) shuffle(arr, len(arr)) return arr

Now proving correctness isn't too difficult. The last element is chosen with uniform probability. Recursively, given fixed values for $arr[i+1..N-1]$, $arr[i]$ is chosen from the remaining elements uniformly as well. Overall, this means that all $N!$ permutations are equally probable.

### Considerations

Despite its simplicity, when developers attempt to code this from scratch it's extremely common that they make ‘off by one’ errors, which results in a notably biased permutation.Actually this was the case for the initial implementation of the algorithm in Vita: one of the element of the input dataframe was always selected.

That is the same old story: it's possible for the code to be both simple and wrong and everything must be tested.

## No comments:

## Post a Comment