Wow this was indeed super comprehensive. A few things I noticed:

- In the cold start section, a couple of the synthetic_data responses say 'context does not provide info..'

- It's strange that retrieval_score would decrease while quality_score increases at the higher chunk sizes. Could this just be that the retrieved chunk is starting to be larger than the reference?

- Gpt 3.5 pricing looks out of date, it's currently $0.0015 for input for the 4k model

- Interesting that pricing needs to be shown on a log scale. Gpt-4 is 46x more expensive than llama 2 70B for ~.3 score increase. Training a simple classifier seems like a great way to handle this.

- I wonder how stable the quality_score assessment is given the exact same configuration. I guess the score differences between falcon-180b, llama-2-70b and gpt-3.5 are insignificant?

Is there a similarly comprehensive deep dive into chunking methods anywhere? Especially for queries that require multiple chunks to answer at all - producing more relevant chunks would have a massive impact on response quality I imagine.