Wow this was indeed super comprehensive. A few things I noticed:
- In the cold start section, a couple of the synthetic_data responses say 'context does not provide info..'
- It's strange that retrieval_score would decrease while quality_score increases at the higher chunk sizes. Could this just be that the retrieved chunk is starting to be larger than the reference?
- Gpt 3.5 pricing looks out of date, it's currently $0.0015 for input for the 4k model
- Interesting that pricing needs to be shown on a log scale. Gpt-4 is 46x more expensive than llama 2 70B for ~.3 score increase. Training a simple classifier seems like a great way to handle this.
- I wonder how stable the quality_score assessment is given the exact same configuration. I guess the score differences between falcon-180b, llama-2-70b and gpt-3.5 are insignificant?
Is there a similarly comprehensive deep dive into chunking methods anywhere? Especially for queries that require multiple chunks to answer at all - producing more relevant chunks would have a massive impact on response quality I imagine.
- In the cold start section, a couple of the synthetic_data responses say 'context does not provide info..'
- It's strange that retrieval_score would decrease while quality_score increases at the higher chunk sizes. Could this just be that the retrieved chunk is starting to be larger than the reference?
- Gpt 3.5 pricing looks out of date, it's currently $0.0015 for input for the 4k model
- Interesting that pricing needs to be shown on a log scale. Gpt-4 is 46x more expensive than llama 2 70B for ~.3 score increase. Training a simple classifier seems like a great way to handle this.
- I wonder how stable the quality_score assessment is given the exact same configuration. I guess the score differences between falcon-180b, llama-2-70b and gpt-3.5 are insignificant?
Is there a similarly comprehensive deep dive into chunking methods anywhere? Especially for queries that require multiple chunks to answer at all - producing more relevant chunks would have a massive impact on response quality I imagine.