There isn't any evidence that models are doing any kind of "system 2 thinking" h...

There isn't any evidence that models are doing any kind of "system 2 thinking" here. The model's response is guided by both the prompt and its current output so when you tell it to reason step by step the final answer is guided by its current output text. The second best answer is just something it came up with because you asked, the model has no second best answer to give. The second best answers always seem strange because the model doesn't know what it means to come up with a second best answer; it 'believes' the output it gave is the correct answer and helpfully tries to fulfill your request. Sometimes the second best answer is right but most of the time its completely nonsensical and there is no way to distinguish between the two. If you ask to choose it will be strongly influenced by the framing of its prior response and won't be able to spot logical errors.

Asking it to do lateral thinking and provide examples isn't really helpful because its final output is mostly driven by the step by step reasoning text, not by examples it has generated. At best, the examples are all wrong but it ignores that and spits out the right answer. At worst, it can become confused and give the wrong answer.

I've seen gpt-4 make all kinds of errors with prompts like this. Sometimes, all the reasoning is wrong but the answer is right and vice versa.