And yet it is often able to make surprising references to previous text. This is not just a markov chain, and is capable of what the author describes as chain of thought. I think there are deeper relationships encoded in the model that allow it to keep to a consistent narrative for a very long time. Its beliefs may change between queries but do not, generally, within the context of a single conversation.
The attention mechanism lets it look backwards to "understand" what was said before and predict what could possibly come next. Whatever consistency it has is due to studying the preceding text.
Thinking ahead is different. All it needs to do is calculate the probability that there is any reasonable completion starting with a particular word. It doesn't need to decide what it's going to say beyond that; it can decide later.
Have you ever played a game where players take turns adding one more word to a sentence? When it's your turn and you're choosing the next word, you don't need to think ahead very much. Also, you don't necessarily need have the same thing in mind as the player who went before you.
In improv there is a "yes, and" where you are always building on what happened before. These algorithms are doing improv all the time.
The algorithm doesn't know or care who wrote the words that came before. It will find a continuation regardless.