My question is: if I want to use LLM to help me sift through a large amount of structured data, say for example all the logs for a bunch of different applications from a certain cloud environment, each with their own idiosyncrasies and specific formats (many GBs of data), can the RAG pattern be useful here?
Some of my concerns:
1) Is sentence embedding using an off-the-shelf embedding model going to capture the "meaning" of my logs? My answer is "probably not". For example, if a portion of my logs is in this format
Will I be able to get meaningful embeddings that satisfy a query such as "what components in my system exhibited an anomalously high latency lately?" (this is just an example among many different queries I’d have)
Based on the little I know, it seems to me off-the-shelf embeddings wouldn't be able to match the embedding of my query with the embeddings for the relevant log lines, given the complexity of this task.
2) Is it going to be even feasible (cost/performance-wise) to use embeddings when one has a firehose of data coming through, or is it better suited for a mostly-static corpus of data (e.g. your typical corporate documentation or product catalog)?
I know that I can achieve something similar with a Code Interpreter-like approach, so in theory I could build a multi-step reasoning agent that starting from my query and the data would try to (1) discover the schema and then (2) crunch the data to try to get to my answer, but I don't know how scalable this approach would effectively be.
Just to clarify - are you wanting the LLM itself to identify what a "anomalous latency" would be based on the data itself? If so then I don't think this will help you at all until we can actually fit the log into the context.
What RAG here is doing is using embeddings and a vector store to identify close pieces of information, for example "in this django project add a textfield" will be very close to documentation in the django docs that say "textfield", and it will then add that to the prompt so the LLM has the relevant docs in its context.
The problem is that you'll need a heuristic to identify at least "potentially anomalous" and even then you'll still have to make sure there's enough context for it to know "is this a normal daily fluctuation".
A multi-step agent is definitely what you want, you could have it build an SQL query itself, for example "was there any high latency requests yesterday?" it may identify it should filter the time, possibly design the query to determine what is "high".
---
At the moment I don't think it's well suited to identifying when the "latency is abnormally high". However if you have some other system/human identify heuristics to feed to the LLM, it may then be able to do at least answer the query.
I was trying to understand if there is an opportunity to introduce some of this technology to solve “anomaly detection” on large amount of structured data, where anomaly might be an incredibly overloaded term (it might imply a performance regression, a security issue, etc). That is a business need I have today.
It seems that what is possible today is an assistant that can aid a user to get to these answers faster (by, for instance, suggesting a SQL query based on the schema, etc). Again, roughly the equivalent of what Code Interpreter does, just without the local environment limitations.
From your questions it looks like you are only interesting in the R part. RAG implies the retrieval step is then used to augment a user prompt.
To answer 1, a good heuristic would be "can a human reasonably familiar with the terminology answer questions about the meaning?" If a human would need extra info to make sense of your data then so would an LLM.
This is where RAG typically comes in. For example if you had documentation about ClassName and FunctionName, a retrieval model might be able to find the most likely candidates based on a file containing full definitions of these classes and function, then pass that info into the LLM appended to your query.
For 2: It depends if the fire house is the query or the data. If you have queries coming in very quickly, then you might be able to if your firehose doesn't have too much volume since you can batch requests and get responses fairly quickly.
If the fire hose is the data going into the vector DB then you might have some difficultly inserting and indexing the data fast enough.
For this kind of structured data and this kind of structured queries, it may be more useful to stick to a data query language (SQL, or some analytics engine).
Thanks. I wonder if a reasonable approach could then be to first insert the data in a datawarehouse-like database suitable for analytics, and then use an LLM application to (1) generate SQL queries that could answer my question, reasoning about the schema (2) potentially summarize the output result set. It could still result in a significant boost of productivity.
Indeed, that is a promising path. Fundamentally you still want to reply on a human to figure out what analytics are interesting to consider, then having the LLM act as a helper that generates queries corresponding to the analytics.
Some of my concerns:
1) Is sentence embedding using an off-the-shelf embedding model going to capture the "meaning" of my logs? My answer is "probably not". For example, if a portion of my logs is in this format
Will I be able to get meaningful embeddings that satisfy a query such as "what components in my system exhibited an anomalously high latency lately?" (this is just an example among many different queries I’d have)Based on the little I know, it seems to me off-the-shelf embeddings wouldn't be able to match the embedding of my query with the embeddings for the relevant log lines, given the complexity of this task.
2) Is it going to be even feasible (cost/performance-wise) to use embeddings when one has a firehose of data coming through, or is it better suited for a mostly-static corpus of data (e.g. your typical corporate documentation or product catalog)?
I know that I can achieve something similar with a Code Interpreter-like approach, so in theory I could build a multi-step reasoning agent that starting from my query and the data would try to (1) discover the schema and then (2) crunch the data to try to get to my answer, but I don't know how scalable this approach would effectively be.