Multimodal-cfDNA-cfAI

Author: Song Liyang — SonglyPKU@163.com License: Apache-2.0

Overview

cfAI is a Transformer-based framework that represents each cfDNA (cell-free DNA) fragment as tokenized, multimodal vectors and uses a MultiTask Transformer to score ctDNA (circulating tumor DNA) likelihoods at molecule, gene, and sample levels.

The model was developed to: 1) profile multi‑omic cross‑talk at single‑cfDNA‑molecule resolution, 2) increase signal‑to‑noise ratio (SNR) by ctDNA enrichment, 3) enhance early cancer screening and liquid biopsies. cfAI achieved ~10‑fold enrichment of cancer‑derived signals over background noise and reached strong multi‑cancer discrimination performance.

Key highlights:

Single-molecule multi-omics & vectorization: each cfDNA fragment is tokenized and vectorized across methylation, fragmentomics, end-motifs, histone-mark proxies, gene semantics, and 3D features.
Genome2Vec annotation: prebuilt vectorized genomic and epigenomic annotations provide semantic context attached to reads.
Transformer-based modeling: a MultiTask Transformer integrates cross-modal signals to produce molecule-, gene-, and sample-level scores.

This repository provides the code for processing NGS reads, performing cfDNA annotation, and running model pretraining and inference. It also includes the prebuilt vector annotation database and the trained model.

Repository layout (key files)

Multimodal-cfDNA-cfAI/
├─ LICENSE
├─ README.md
├─ paper/
├─ src/
│  ├─ bam2vec.py
│  ├─ reads_anno.py
│  ├─ data_preparing.py
│  └─ ai_dev.py
├─ embeds/
├─ configs/
├─ models/
└─ examples/
   ├─ sample_info.tsv
   └─ example.bam

0. Prepare environment

cfAI runs on Python 3.9. Install dependencies including pysam, numpy, pandas, bedtools, pybedtools, torch, umap-learn, scikit-learn, etc. Versions follow the provided YAML file.
torch and pytorch-cuda are required; choose versions based on your GPU.
This code is compatible with multi‑GPU.

1. Convert cfDNA NGS reads to multi‑omics reads file

This step converts paired-end BAM files into a structured BED6+ file that contains per-read multi-omic features used downstream by Genome2Vec and the model pipeline.

Convert BAM -> BED6+

python src/bam2vec.py -i data/sample.bam -r data/hg38.fa -o output/sample_bed/ --motif_len 4 --min_mapq 30 --unmeth_clip 34

bam2vec.py extracts: genomic coordinates, insert size, strand, base one-hot (A/T/C/G), mismatches/indels (E), CpG methylation (M/U from XM tag), and 5’/3’ end motifs. Output is a tabular BED-like file with the following 15 columns: chr, start, end, read_name, insert_size, strand, A, T, C, G, E, M, U, motif_up, motif_down.

Requirements:

Coordinate-sorted and indexed BAM
Bisulfite XM tags compatible with Bismark when methylation features are expected

2. Annotate reads with Genome2Vec embeddings

reads_anno.py maps reads to the Genome2Vec annotation database stored in embeds/ and appends contextual embeddings/values to each read. Genome2Vec is a collection of vectorized genomic and epigenomic annotations designed to provide compact semantic context for each cfDNA fragment.

Genome2Vec contents (files stored in embeds/)

Feature	Filename	Resolution / Notes	Description
Gene name & coordinates	`gene_name.bed`	gene TSS / loci	Nearest gene name, strand and distance to TSS. Optionally attach scGPT 512-d gene embedding.
Chromatin state (UMAP)	`chromHMM_200bp_UMAPembed.bed`	200 bp; 4-d UMAP	chromHMM emission matrix reduced via UMAP → 4-dim embedding per state.
Insulation score (INS)	`40k_is.sort.bed`	40 kb	Local insulation metric.
Directionality index (DI)	`40k_di.sort.bed`	40 kb	DI for TAD/interaction directionality.
FIRE (frequently interacting regions)	`40k_fire.sort.bed`	40 kb	FIRE score per bin.
A/B compartment	`250k_hesc_ab.sort.bed`	250 kb	AB compartment call and score (hESC).
Hi-C 3D coordinates	`20k_hic.sort.bed`	20 kb / diploid	Mat/pat 3D coordinates from Hi-C/Dip-C processed files.

Usage

Appends annotation fields such as near_gene_name, near_gene_strand, dist_TSS, chromHMM_name, chromHMM_UMAPemb_1..4, is_value, di_value, fi_value, ab_value, hic_mat{x,y,z}, hic_fat{x,y,z} and, if enabled, scGPT gene embedding columns (e.g. scGPT_emb_1..512).

Run:

python src/reads_anno.py -i output/sample_bed/sample.bed -r embeds -o output/anno/

Then, annotated output of multiple samples can be processed with data_preparing.py for batch data handling, including filtering, feature calculation, standardization, and metadata integration.

Run:

python src/data_preparing.py -s examples/sample_info.tsv -i output/anno/ -o data/prep/

data_preparing.py performs cfDNA filtering by TSS-range (default dist_TSS ∈ [-8192,8192]), methylation ratio calculation, and feature standardization.

3. Model training & inference (`ai_dev.py`)

ai_dev.py contains the full training, resume and test logic. Configuration (paths, hyperparameters such as d_model, nhead, num_layers, seq_length, batch_size, lr, dna_length, proj_dims, etc.) is defined at the top of ai_dev.py in the Config class; modify those fields directly in the script prior to large runs.

Commands:

# train
python src/ai_dev.py train

# resume (provide --ckpt)
python src/ai_dev.py resume --ckpt models/ckpt_10000.pt

# test / inference
python src/ai_dev.py test --ckpt models/best.pt

Outputs and logs are controlled by paths set in ai_dev.py::Config (e.g. checkpoint_dir, log_dir). Test mode produces test_predictions.csv (location controlled by Config) with per-batch predictions and columns including: sample_id, gene, prdicted_origin, origin_confidence, health, reads_scores, last layer <cls> embeds.

The reads_scores shows the tumor-derived prediction, while the prdicted_origin contains the distinctions of tissue of origin of the cancer. The last layer <cls> embeds contains the hidden representation of this batch of cfDNA at corresponding gene.

Credits and citation

cfAI was written by Song Liyang in ByteDance. Please follow https://www.linkedin.com/in/liyang-song/.

Please cite the work:

Song Liyang et al., Multimodal AI for Single cfDNA Profiling and Cancer Screening (manuscript).

Contributing & contact

Please open issues or pull requests for bugs or feature requests. For data requests and direct correspondence contact: SonglyPKU@163.com.

License

Apache License 2.0