Are there any ways to show the source of the information retrieved by the model?...

JKCalhoun · on May 9, 2024

As I understand it, pretty sure that is impossible. When it is input a single datum, sure, trivial. As soon as it is fed a second one though the weights are already a kind of blend of the two tokens (so to speak).

spmurrayzzz · on May 9, 2024

Its not impossible, but its definitely difficult. There is some overlap in the methods used to detect benchmark data contamination, though its not entirely the same thing. For the detection use case, you already know the text you're looking for and you are just trying to demonstrate that the model has "seen" the data in its training set. The challenge is proving that it is statistically improbable that the model could stochastically generate the same tokens without having seen them during training.

Some great research exists in this area [1] and I expect much of it may be repurposed for black box attribution in the future (in addition to all the work being done in the mechanistic interpretability field)

[1] https://arxiv.org/abs/2311.04850