My question is: if I want to use LLM to help me sift through a large amount of s...

thewataccount · on Sept 14, 2023

Just to clarify - are you wanting the LLM itself to identify what a "anomalous latency" would be based on the data itself? If so then I don't think this will help you at all until we can actually fit the log into the context.

What RAG here is doing is using embeddings and a vector store to identify close pieces of information, for example "in this django project add a textfield" will be very close to documentation in the django docs that say "textfield", and it will then add that to the prompt so the LLM has the relevant docs in its context.

The problem is that you'll need a heuristic to identify at least "potentially anomalous" and even then you'll still have to make sure there's enough context for it to know "is this a normal daily fluctuation".

A multi-step agent is definitely what you want, you could have it build an SQL query itself, for example "was there any high latency requests yesterday?" it may identify it should filter the time, possibly design the query to determine what is "high".

---

At the moment I don't think it's well suited to identifying when the "latency is abnormally high". However if you have some other system/human identify heuristics to feed to the LLM, it may then be able to do at least answer the query.

deanmoriarty · on Sept 14, 2023

Yes, this clarifies well what is possible vs not.

I was trying to understand if there is an opportunity to introduce some of this technology to solve “anomaly detection” on large amount of structured data, where anomaly might be an incredibly overloaded term (it might imply a performance regression, a security issue, etc). That is a business need I have today.

It seems that what is possible today is an assistant that can aid a user to get to these answers faster (by, for instance, suggesting a SQL query based on the schema, etc). Again, roughly the equivalent of what Code Interpreter does, just without the local environment limitations.

gogogo_allday · on Sept 15, 2023

This is a feature in many commercial products already, as well as open source libraries like PyOD. https://github.com/yzhao062/pyod

IKantRead · on Sept 14, 2023

> can the RAG pattern be useful here?

From your questions it looks like you are only interesting in the R part. RAG implies the retrieval step is then used to augment a user prompt.

To answer 1, a good heuristic would be "can a human reasonably familiar with the terminology answer questions about the meaning?" If a human would need extra info to make sense of your data then so would an LLM.

This is where RAG typically comes in. For example if you had documentation about ClassName and FunctionName, a retrieval model might be able to find the most likely candidates based on a file containing full definitions of these classes and function, then pass that info into the LLM appended to your query.

For 2: It depends if the fire house is the query or the data. If you have queries coming in very quickly, then you might be able to if your firehose doesn't have too much volume since you can batch requests and get responses fairly quickly.

If the fire hose is the data going into the vector DB then you might have some difficultly inserting and indexing the data fast enough.

warkdarrior · on Sept 14, 2023

For this kind of structured data and this kind of structured queries, it may be more useful to stick to a data query language (SQL, or some analytics engine).

deanmoriarty · on Sept 14, 2023

Thanks. I wonder if a reasonable approach could then be to first insert the data in a datawarehouse-like database suitable for analytics, and then use an LLM application to (1) generate SQL queries that could answer my question, reasoning about the schema (2) potentially summarize the output result set. It could still result in a significant boost of productivity.

warkdarrior · on Sept 14, 2023

Indeed, that is a promising path. Fundamentally you still want to reply on a human to figure out what analytics are interesting to consider, then having the LLM act as a helper that generates queries corresponding to the analytics.