A simple R package which helps with annotation of single cell experiments such as single cell RNA-seq. With up and down regulated genes per cell cluster, the local LLM guesses the cell type annotation and creates an overall extensive report.
Can anyone explain how an LLM is useful here? The clustering is done traditionally right? Then the llm is given the centroids and asked to give a label? Assumption being that the llm corpus already contained some mapping from gene up/down regulations to clusters of differentiation?
I’ve been doing this for the past three years, it’s very challenging. I think most of these tools do well on very broad cell types like seen on the GitHub page. But the thing is that if you’re working with e.g. immune cell types then you can effortlessly label high-level clusters yourself.
The real challenge is identifying fine cell subsets, like different types of CD8 T cells: naïve, central memory, effector memory, Temra, etc. I don’t think it’s a problem that can be solved by a tool though. One issue is that “classical” cell type definitions are based on flow cytometry which uses antibodies to define cell types. These definitions don’t translate that well to scRNA-seq as it’s a completely different protocol. For example, naïve and central memory cells are separated based on CD45 isoforms and this information isn’t available in single cell gene expression with 10x Genomics(most popular protocol).
Another issue is that people use ad-hoc cell type definitions. It’s common to come up with a random gene as a cell state marker. Which means the definitions aren’t comparable between studies and a lot of manual curation is required. Which makes some sense because “cell type” is an abstraction. In real world, the cell types and states are much more complex and are often continuous rather than discrete.
Taken together, building cell type classifier is a very difficult task that depends much more on the data quality, context(which tissue data comes from), and training labels. You can build a very decent classifier with a regression model if you have good data.
You’re right in most of your comments however a fine-tuned model will be able to pick up nuances which are missed by a general model. For instance, AVP is used for HSCs, yet not related to flow cytometry at all. If you fine-tune an LLM with experts by your side one time it will be able to give you the grained cell types such as T-cell subsets. Plus, a regression model won’t give you the reasoning behind the given cell type annotation.
Users of these kinds of tools should check that their marker genes are associated with the labelled cell types. There are known markers for many cell types across multiple organisms.
I'm surprised that this is using plain llama3.1 rather than a fine-tune. Have you checked the accuracy of the results on the common benchmarks? Also, given it provides just the answers just based on the up/down lists, (or did I miss something?) isn't that something that could be extracted into a more efficient lookup with only a 2d grid of weights? (Or 3d if we there are group-of-genes effects)
This is really useful thanks for sharing. My students and myself tend to waste a lot of time annotating clusters and have not found a reasonable solution yet. This will be fun to try.
I have written a neural network architecture (way smaller than llama) that can be trained to automate this process. Check out the Custom-Data-Tutorial in the repo!
Could this also be adapted for gene set enrichment? For example, if I had a set(s) of genes from an ATAC-seq experiment would it be able to guess their function / cell types?
If you set the seed and temperature to 0, it is. I did not have any intentions to publish it, but I might think about a 2-pager Bioinformatics paper if I have time.