NSCCN/PlantCaduceus：植物 DNA 基础模型，用于植物基因组序列表征学习及跨物种功能预测。

GitHub Repo stars GitHub Issues or Pull Requests

🚀 PlantCAD2 is here! (paper)

A new DNA foundation model for angiosperms, with LoRA fine-tuned models for accessible chromatin, gene expression, and protein translation.

PlantCAD overview
Quick Start
Model summary
Installation
Quick example
Usage guides
Citations

PlantCAD overview

PlantCaduceus, with its short name of PlantCAD, is a plant DNA LM based on the Caduceus architecture, which extends the efficient Mamba linear-time sequence modeling framework to incorporate bi-directionality and reverse complement equivariance, specifically designed for DNA sequences. PlantCAD is pre-trained on a curated dataset of 16 Angiosperm genomes. PlantCAD showed state-of-the-art cross species performance in predicting TIS, TTS, Splice Donor and Splice Acceptor. The zero-shot of PlantCAD enables identifying genome-wide deleterious mutations and known causal variants in Arabidopsis, Sorghum and Maize.

Quick Start

New to PlantCAD? Try our Google Colab demo - no installation required!

For local usage: See installation instructions here, then use notebooks/examples.ipynb to get started.

Model summary

Pre-trained models have been uploaded to HuggingFace 🤗: PlantCAD and PlantCAD2.

Model	Max Input Length	Model Size	Embedding Size
PlantCAD
PlantCaduceus_l20	512bp	20M	384
PlantCaduceus_l24	512bp	40M	512
PlantCaduceus_l28	512bp	128M	768
PlantCaduceus_l32	512bp	225M	1024
PlantCAD2
PlantCAD2-Small	8192bp	88M	768
PlantCAD2-Medium	8192bp	311M	1024
PlantCAD2-Large	8192bp	694M	1536

⚠️ Important: The “Max Input Length” is a hard limit — your input sequences cannot exceed this length. Use -contextSize 512 for PlantCAD models and up to -contextSize 8192 for PlantCAD2 models. See Model Recommendations for guidance on which model to use.

Installation

Option	Best for
Google Colab	Beginners — no installation required
Local installation	Regular use — requires NVIDIA GPU
Docker	Reproducible environments

Quick example

Get sequence embeddings with PlantCAD:

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

device = 'cuda:0'
tokenizer = AutoTokenizer.from_pretrained('kuleshov-group/PlantCaduceus_l32')
model = AutoModelForMaskedLM.from_pretrained(
    'kuleshov-group/PlantCaduceus_l32', trust_remote_code=True
).to(device)

sequence = "CTTAATTAATATTGCCTTTGTAATAACGCGCGAAACACAAATCTTCTCTGCCTAATGCAGTAGTCATGTGTTGACTCCTTCAAAATTTCCAAGAAGTTAGTGGCTGGTGTGTCATTGTCTTCATCTTTTTTTTTTTTTTTTTAAAAATTGAATGCGACATGTACTCCTCAACGTATAAGCTCAATGCTTGTTACTGAAACATCTCTTGTCTGATTTTTTCAGGCTAAGTCTTACAGAAAGTGATTGGGCACTTCAATGGCTTTCACAAATGAAAAAGATGGATCTAAGGGATTTGTGAAGAGAGTGGCTTCATCTTTCTCCATGAGGAAGAAGAAGAATGCAACAAGTGAACCCAAGTTGCTTCCAAGATCGAAATCAACAGGTTCTGCTAACTTTGAATCCATGAGGCTACCTGCAACGAAGAAGATTTCAGATGTCACAAACAAAACAAGGATCAAACCATTAGGTGGTGTAGCACCAGCACAACCAAGAAGGGAAAAGATCGATGATCG"

input_ids = tokenizer.encode_plus(
    sequence, return_tensors="pt", return_attention_mask=False,
    return_token_type_ids=False
)["input_ids"].to(device)

with torch.inference_mode():
    outputs = model(input_ids=input_ids, output_hidden_states=True)

embeddings = outputs.hidden_states[-1].to(torch.float32).cpu().numpy()

# Average forward and reverse complement embeddings
hidden_size = embeddings.shape[-1] // 2
forward = embeddings[..., 0:hidden_size]
reverse = embeddings[..., hidden_size:][:, ::-1, :]
averaged_embeddings = (forward + reverse) / 2
print(averaged_embeddings.shape)

See notebooks/examples.ipynb for more detailed examples.

Usage guides

Guide	Description
Zero-shot SNP & Region Scoring	Score variants (VCF) or genomic regions (BED) using log-likelihood ratios
Zero-shot SV Scoring	Score structural variants (deletions & insertions)
XGBoost Classifiers	Train or use pre-trained classifiers for TIS, TTS, splice sites
In-silico Mutagenesis	Large-scale simulation and analysis of genetic variants
Fine-tuned PlantCAD2 Models	LoRA models for chromatin, expression, translation
Zero-shot Evaluation	PlantCAD2 zero-shot benchmark results
Pre-training	Pre-train or fine-tune PlantCAD models from scratch
Model Recommendations	Which model to use, inference speed benchmarks, GPU memory guide

Citations

If you find PlantCAD useful for your research, please consider citing our paper:

Zhai, J., Gokaslan, A., Schiff, Y., Berthel, A., Liu, Z. Y., Lai, W. L., Miller, Z. R., Scheben, A., Stitzer, M. C., Romay, M. C., Buckler, E. S., & Kuleshov, V. (2025). Cross-species modeling of plant genomes at single nucleotide resolution using a pretrained DNA language model. Proceedings of the National Academy of Sciences, 122(24), e2421738122. https://doi.org/10.1073/pnas.2421738122
Zhai J., Gokaslan A., Hsu SK., Chen SP., Liu ZY., Marroquin E., Czech E., Cannon B., Berthel A., Romay MC., Pennell M., Kuleshov V.* Buckler ES*. PlantCAD2: A Long-Context DNA Language Model for Cross-Species Functional Annotation in Angiosperms. bioRxiv. 2025. Nov 19. doi: https://doi.org/10.1101/2025.08.27.672609

Contact

Maintained by Jingjing Zhai.

For collaboration inquiries: jz963@cornell.edu or zhaijingjing603@gmail.com
General questions, bug reports, and feature requests: please open an issue