Show HN: CeLLama – Single cell annotation with local LLMs

bob88jg · on July 28, 2024

Can anyone explain how an LLM is useful here? The clustering is done traditionally right? Then the llm is given the centroids and asked to give a label? Assumption being that the llm corpus already contained some mapping from gene up/down regulations to clusters of differentiation?

celltalk · on July 28, 2024

Yes, it basically automates the cell type annotation process, plus gives a reasoning for the label.

ibash · on July 28, 2024

How easy is it to check the results of cell annotations for mistakes?

Is it easy for a person to do, and this will save them a bunch of time getting a baseline? Or could this lead to a bunch of mislabeled data?

f6v · on July 29, 2024

I’ve been doing this for the past three years, it’s very challenging. I think most of these tools do well on very broad cell types like seen on the GitHub page. But the thing is that if you’re working with e.g. immune cell types then you can effortlessly label high-level clusters yourself.

The real challenge is identifying fine cell subsets, like different types of CD8 T cells: naïve, central memory, effector memory, Temra, etc. I don’t think it’s a problem that can be solved by a tool though. One issue is that “classical” cell type definitions are based on flow cytometry which uses antibodies to define cell types. These definitions don’t translate that well to scRNA-seq as it’s a completely different protocol. For example, naïve and central memory cells are separated based on CD45 isoforms and this information isn’t available in single cell gene expression with 10x Genomics(most popular protocol).

Another issue is that people use ad-hoc cell type definitions. It’s common to come up with a random gene as a cell state marker. Which means the definitions aren’t comparable between studies and a lot of manual curation is required. Which makes some sense because “cell type” is an abstraction. In real world, the cell types and states are much more complex and are often continuous rather than discrete.

Taken together, building cell type classifier is a very difficult task that depends much more on the data quality, context(which tissue data comes from), and training labels. You can build a very decent classifier with a regression model if you have good data.

celltalk · on July 29, 2024

You’re right in most of your comments however a fine-tuned model will be able to pick up nuances which are missed by a general model. For instance, AVP is used for HSCs, yet not related to flow cytometry at all. If you fine-tune an LLM with experts by your side one time it will be able to give you the grained cell types such as T-cell subsets. Plus, a regression model won’t give you the reasoning behind the given cell type annotation.

gww · on July 28, 2024

Users of these kinds of tools should check that their marker genes are associated with the labelled cell types. There are known markers for many cell types across multiple organisms.

celltalk · on July 28, 2024

It still not 100% accurate but it should be useful for baseline annotations.

dunomaybe · on July 28, 2024

Do you have any benchmark comparisons to e.g. the CellTypist corpus?

celltalk · on July 28, 2024

No, but the help is appreciated!

viraptor · on July 28, 2024

I'm surprised that this is using plain llama3.1 rather than a fine-tune. Have you checked the accuracy of the results on the common benchmarks? Also, given it provides just the answers just based on the up/down lists, (or did I miss something?) isn't that something that could be extracted into a more efficient lookup with only a 2d grid of weights? (Or 3d if we there are group-of-genes effects)

celltalk · on July 28, 2024

I don’t have any benchmarking yet, but any help is appreciated. We do have fine-tuned model for anyone interested.

givinguflac · on July 28, 2024

This is looks very cool and extremely useful; where can I get hands on the fine-tuned model?

gww · on July 28, 2024

This is really useful thanks for sharing. My students and myself tend to waste a lot of time annotating clusters and have not found a reasonable solution yet. This will be fun to try.

codingfisch · on July 28, 2024

I have written a neural network architecture (way smaller than llama) that can be trained to automate this process. Check out the Custom-Data-Tutorial in the repo!

GitHub: https://github.com/wwu-mmll/gatenet Paper: https://www.sciencedirect.com/science/article/pii/S001048252...

gww · on July 28, 2024

Will check it out. Thanks a lot

celltalk · on July 28, 2024

Thank you. Hope it is useful!

gww · on July 28, 2024

Could this also be adapted for gene set enrichment? For example, if I had a set(s) of genes from an ATAC-seq experiment would it be able to guess their function / cell types?

celltalk · on July 28, 2024

It should be okay if you edit the base prompt properly.

gww · on July 28, 2024

Cool thanks

j_bum · on July 28, 2024

Interesting.

Is this deterministic? Any plans for publishing?

celltalk · on July 28, 2024

If you set the seed and temperature to 0, it is. I did not have any intentions to publish it, but I might think about a 2-pager Bioinformatics paper if I have time.

j_bum · on July 28, 2024

Thanks for sharing!