It is there to process last <32 elements. The vectorized loop processes up to 32 elements per iteration. The iteration does not happen if there are less than 32 elements left, because it wants to load 32 bytes as input. This is very typical in vectorized loops - process N elements per iteration and second loop that does tail of <N elements.
'Tail handling' in general is an annoying aspect of simd. Masks are great, but no panacea--in particular, if you unroll, then you cannot take care of the tail with a single masked instruction. There are various solutions to this. I favour overlapping accesses, where that's feasible (following a great deal of evangelism from mateusz guzik); a colleague uses a variant of duff's device; you can also just generate multiple masks.
I would expect the linked code is just intended as a quick poc, so it does not bother to be optimal.
Roughly speaking, while you do need idempotence for reductions, you do not need it for maps. (Of course, plenty of things don't look quite like reductions or maps--or they look like both at once--and each is unique.)