`the previous tokens are absolutely needed before predicting the next token'
Maybe this is the key contribution of this paper: demonstrating that LLMs can predict the next n-tokens even if there are incorrect guesses in previous tokens through consistency training?
On the other hand, while mathematically it is true that p(x_t|x_1,...,x_t-1) depends on all x_1 to x_t-1, in practice, it is possible that predicting x_t only requires x_1 to x_t-2, and the attention to x_t-1 is minimal. Thus, predicting x_t with x_1 to x_t-2 and inaccurate x_t-1 is possible.
Maybe this is the key contribution of this paper: demonstrating that LLMs can predict the next n-tokens even if there are incorrect guesses in previous tokens through consistency training?
On the other hand, while mathematically it is true that p(x_t|x_1,...,x_t-1) depends on all x_1 to x_t-1, in practice, it is possible that predicting x_t only requires x_1 to x_t-2, and the attention to x_t-1 is minimal. Thus, predicting x_t with x_1 to x_t-2 and inaccurate x_t-1 is possible.