Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: CeLLama – Single cell annotation with local LLMs (github.com/celvoxes)
132 points by celltalk on July 28, 2024 | hide | past | favorite | 22 comments
A simple R package which helps with annotation of single cell experiments such as single cell RNA-seq. With up and down regulated genes per cell cluster, the local LLM guesses the cell type annotation and creates an overall extensive report.


Can anyone explain how an LLM is useful here? The clustering is done traditionally right? Then the llm is given the centroids and asked to give a label? Assumption being that the llm corpus already contained some mapping from gene up/down regulations to clusters of differentiation?


Yes, it basically automates the cell type annotation process, plus gives a reasoning for the label.


How easy is it to check the results of cell annotations for mistakes?

Is it easy for a person to do, and this will save them a bunch of time getting a baseline? Or could this lead to a bunch of mislabeled data?


I’ve been doing this for the past three years, it’s very challenging. I think most of these tools do well on very broad cell types like seen on the GitHub page. But the thing is that if you’re working with e.g. immune cell types then you can effortlessly label high-level clusters yourself.

The real challenge is identifying fine cell subsets, like different types of CD8 T cells: naïve, central memory, effector memory, Temra, etc. I don’t think it’s a problem that can be solved by a tool though. One issue is that “classical” cell type definitions are based on flow cytometry which uses antibodies to define cell types. These definitions don’t translate that well to scRNA-seq as it’s a completely different protocol. For example, naïve and central memory cells are separated based on CD45 isoforms and this information isn’t available in single cell gene expression with 10x Genomics(most popular protocol).

Another issue is that people use ad-hoc cell type definitions. It’s common to come up with a random gene as a cell state marker. Which means the definitions aren’t comparable between studies and a lot of manual curation is required. Which makes some sense because “cell type” is an abstraction. In real world, the cell types and states are much more complex and are often continuous rather than discrete.

Taken together, building cell type classifier is a very difficult task that depends much more on the data quality, context(which tissue data comes from), and training labels. You can build a very decent classifier with a regression model if you have good data.


You’re right in most of your comments however a fine-tuned model will be able to pick up nuances which are missed by a general model. For instance, AVP is used for HSCs, yet not related to flow cytometry at all. If you fine-tune an LLM with experts by your side one time it will be able to give you the grained cell types such as T-cell subsets. Plus, a regression model won’t give you the reasoning behind the given cell type annotation.


Users of these kinds of tools should check that their marker genes are associated with the labelled cell types. There are known markers for many cell types across multiple organisms.


It still not 100% accurate but it should be useful for baseline annotations.


Do you have any benchmark comparisons to e.g. the CellTypist corpus?


No, but the help is appreciated!


I'm surprised that this is using plain llama3.1 rather than a fine-tune. Have you checked the accuracy of the results on the common benchmarks? Also, given it provides just the answers just based on the up/down lists, (or did I miss something?) isn't that something that could be extracted into a more efficient lookup with only a 2d grid of weights? (Or 3d if we there are group-of-genes effects)


I don’t have any benchmarking yet, but any help is appreciated. We do have fine-tuned model for anyone interested.


This is looks very cool and extremely useful; where can I get hands on the fine-tuned model?


This is really useful thanks for sharing. My students and myself tend to waste a lot of time annotating clusters and have not found a reasonable solution yet. This will be fun to try.


I have written a neural network architecture (way smaller than llama) that can be trained to automate this process. Check out the Custom-Data-Tutorial in the repo!

GitHub: https://github.com/wwu-mmll/gatenet Paper: https://www.sciencedirect.com/science/article/pii/S001048252...


Will check it out. Thanks a lot


Thank you. Hope it is useful!


Could this also be adapted for gene set enrichment? For example, if I had a set(s) of genes from an ATAC-seq experiment would it be able to guess their function / cell types?


It should be okay if you edit the base prompt properly.


Cool thanks


Interesting.

Is this deterministic? Any plans for publishing?


If you set the seed and temperature to 0, it is. I did not have any intentions to publish it, but I might think about a 2-pager Bioinformatics paper if I have time.


Thanks for sharing!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: