I guess you were expecting an algorithm faster than O(n)? That was the best we could do, given n processors with O(1) space each :-D