目录

Static Badge GitHub Repo stars GitHub Issues or Pull Requests PlantCAD PlantCAD2 PlantCAD Downloads PlantCAD2-Small PlantCAD2-Medium PlantCAD2-Large PlantCAD 🤗 PlantCAD2 🤗

logo

🚀 PlantCAD2 is here! (paper)

A new DNA foundation model for angiosperms, with LoRA fine-tuned models for accessible chromatin, gene expression, and protein translation.

Table of Contents

PlantCAD overview

PlantCaduceus, with its short name of PlantCAD, is a plant DNA LM based on the Caduceus architecture, which extends the efficient Mamba linear-time sequence modeling framework to incorporate bi-directionality and reverse complement equivariance, specifically designed for DNA sequences. PlantCAD is pre-trained on a curated dataset of 16 Angiosperm genomes. PlantCAD showed state-of-the-art cross species performance in predicting TIS, TTS, Splice Donor and Splice Acceptor. The zero-shot of PlantCAD enables identifying genome-wide deleterious mutations and known causal variants in Arabidopsis, Sorghum and Maize.

Quick Start

New to PlantCAD? Try our Google Colab demo - no installation required!

For local usage: See installation instructions here, then use notebooks/examples.ipynb to get started.

Model summary

Pre-trained models have been uploaded to HuggingFace 🤗: PlantCAD and PlantCAD2.

Model Max Input Length Model Size Embedding Size
PlantCAD
PlantCaduceus_l20 512bp 20M 384
PlantCaduceus_l24 512bp 40M 512
PlantCaduceus_l28 512bp 128M 768
PlantCaduceus_l32 512bp 225M 1024
PlantCAD2
PlantCAD2-Small 8192bp 88M 768
PlantCAD2-Medium 8192bp 311M 1024
PlantCAD2-Large 8192bp 694M 1536

⚠️ Important: The “Max Input Length” is a hard limit — your input sequences cannot exceed this length. Use -contextSize 512 for PlantCAD models and up to -contextSize 8192 for PlantCAD2 models. See Model Recommendations for guidance on which model to use.

Installation

Option Best for
Google Colab Beginners — no installation required
Local installation Regular use — requires NVIDIA GPU
Docker Reproducible environments

Quick example

Get sequence embeddings with PlantCAD:

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

device = 'cuda:0'
tokenizer = AutoTokenizer.from_pretrained('kuleshov-group/PlantCaduceus_l32')
model = AutoModelForMaskedLM.from_pretrained(
    'kuleshov-group/PlantCaduceus_l32', trust_remote_code=True
).to(device)

sequence = "CTTAATTAATATTGCCTTTGTAATAACGCGCGAAACACAAATCTTCTCTGCCTAATGCAGTAGTCATGTGTTGACTCCTTCAAAATTTCCAAGAAGTTAGTGGCTGGTGTGTCATTGTCTTCATCTTTTTTTTTTTTTTTTTAAAAATTGAATGCGACATGTACTCCTCAACGTATAAGCTCAATGCTTGTTACTGAAACATCTCTTGTCTGATTTTTTCAGGCTAAGTCTTACAGAAAGTGATTGGGCACTTCAATGGCTTTCACAAATGAAAAAGATGGATCTAAGGGATTTGTGAAGAGAGTGGCTTCATCTTTCTCCATGAGGAAGAAGAAGAATGCAACAAGTGAACCCAAGTTGCTTCCAAGATCGAAATCAACAGGTTCTGCTAACTTTGAATCCATGAGGCTACCTGCAACGAAGAAGATTTCAGATGTCACAAACAAAACAAGGATCAAACCATTAGGTGGTGTAGCACCAGCACAACCAAGAAGGGAAAAGATCGATGATCG"

input_ids = tokenizer.encode_plus(
    sequence, return_tensors="pt", return_attention_mask=False,
    return_token_type_ids=False
)["input_ids"].to(device)

with torch.inference_mode():
    outputs = model(input_ids=input_ids, output_hidden_states=True)

embeddings = outputs.hidden_states[-1].to(torch.float32).cpu().numpy()

# Average forward and reverse complement embeddings
hidden_size = embeddings.shape[-1] // 2
forward = embeddings[..., 0:hidden_size]
reverse = embeddings[..., hidden_size:][:, ::-1, :]
averaged_embeddings = (forward + reverse) / 2
print(averaged_embeddings.shape)

See notebooks/examples.ipynb for more detailed examples.

Usage guides

Guide Description
Zero-shot SNP & Region Scoring Score variants (VCF) or genomic regions (BED) using log-likelihood ratios
Zero-shot SV Scoring Score structural variants (deletions & insertions)
XGBoost Classifiers Train or use pre-trained classifiers for TIS, TTS, splice sites
In-silico Mutagenesis Large-scale simulation and analysis of genetic variants
Fine-tuned PlantCAD2 Models LoRA models for chromatin, expression, translation
Zero-shot Evaluation PlantCAD2 zero-shot benchmark results
Pre-training Pre-train or fine-tune PlantCAD models from scratch
Model Recommendations Which model to use, inference speed benchmarks, GPU memory guide

Citations

If you find PlantCAD useful for your research, please consider citing our paper:

  • Zhai, J., Gokaslan, A., Schiff, Y., Berthel, A., Liu, Z. Y., Lai, W. L., Miller, Z. R., Scheben, A., Stitzer, M. C., Romay, M. C., Buckler, E. S., & Kuleshov, V. (2025). Cross-species modeling of plant genomes at single nucleotide resolution using a pretrained DNA language model. Proceedings of the National Academy of Sciences, 122(24), e2421738122. https://doi.org/10.1073/pnas.2421738122
  • Zhai J., Gokaslan A., Hsu SK., Chen SP., Liu ZY., Marroquin E., Czech E., Cannon B., Berthel A., Romay MC., Pennell M., Kuleshov V.* Buckler ES*. PlantCAD2: A Long-Context DNA Language Model for Cross-Species Functional Annotation in Angiosperms. bioRxiv. 2025. Nov 19. doi: https://doi.org/10.1101/2025.08.27.672609

Contact

Maintained by Jingjing Zhai.

关于

植物 DNA 基础模型,用于植物基因组序列表征学习及跨物种功能预测。

37.2 MB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802032778号