cfAI is a Transformer-based framework that represents each cfDNA (cell-free DNA) fragment as tokenized, multimodal vectors and uses a MultiTask Transformer to score ctDNA (circulating tumor DNA) likelihoods at molecule, gene, and sample levels.
The model was developed to: 1) profile multi‑omic cross‑talk at single‑cfDNA‑molecule resolution, 2) increase signal‑to‑noise ratio (SNR) by ctDNA enrichment, 3) enhance early cancer screening and liquid biopsies. cfAI achieved ~10‑fold enrichment of cancer‑derived signals over background noise and reached strong multi‑cancer discrimination performance.
Key highlights:
Single-molecule multi-omics & vectorization: each cfDNA fragment is tokenized and vectorized across methylation, fragmentomics, end-motifs, histone-mark proxies, gene semantics, and 3D features.
Genome2Vec annotation: prebuilt vectorized genomic and epigenomic annotations provide semantic context attached to reads.
Transformer-based modeling: a MultiTask Transformer integrates cross-modal signals to produce molecule-, gene-, and sample-level scores.
This repository provides the code for processing NGS reads, performing cfDNA annotation, and running model pretraining and inference. It also includes the prebuilt vector annotation database and the trained model.
cfAI runs on Python 3.9. Install dependencies including pysam, numpy, pandas, bedtools, pybedtools, torch, umap-learn, scikit-learn, etc. Versions follow the provided YAML file. torch and pytorch-cuda are required; choose versions based on your GPU. This code is compatible with multi‑GPU.
1. Convert cfDNA NGS reads to multi‑omics reads file
This step converts paired-end BAM files into a structured BED6+ file that contains per-read multi-omic features used downstream by Genome2Vec and the model pipeline.
bam2vec.py extracts: genomic coordinates, insert size, strand, base one-hot (A/T/C/G), mismatches/indels (E), CpG methylation (M/U from XM tag), and 5’/3’ end motifs. Output is a tabular BED-like file with the following 15 columns: chr, start, end, read_name, insert_size, strand, A, T, C, G, E, M, U, motif_up, motif_down.
Requirements:
Coordinate-sorted and indexed BAM
Bisulfite XM tags compatible with Bismark when methylation features are expected
2. Annotate reads with Genome2Vec embeddings
reads_anno.py maps reads to the Genome2Vec annotation database stored in embeds/ and appends contextual embeddings/values to each read. Genome2Vec is a collection of vectorized genomic and epigenomic annotations designed to provide compact semantic context for each cfDNA fragment.
Genome2Vec contents (files stored in embeds/)
Feature
Filename
Resolution / Notes
Description
Gene name & coordinates
gene_name.bed
gene TSS / loci
Nearest gene name, strand and distance to TSS. Optionally attach scGPT 512-d gene embedding.
Chromatin state (UMAP)
chromHMM_200bp_UMAPembed.bed
200 bp; 4-d UMAP
chromHMM emission matrix reduced via UMAP → 4-dim embedding per state.
Insulation score (INS)
40k_is.sort.bed
40 kb
Local insulation metric.
Directionality index (DI)
40k_di.sort.bed
40 kb
DI for TAD/interaction directionality.
FIRE (frequently interacting regions)
40k_fire.sort.bed
40 kb
FIRE score per bin.
A/B compartment
250k_hesc_ab.sort.bed
250 kb
AB compartment call and score (hESC).
Hi-C 3D coordinates
20k_hic.sort.bed
20 kb / diploid
Mat/pat 3D coordinates from Hi-C/Dip-C processed files.
Usage
Appends annotation fields such as near_gene_name, near_gene_strand, dist_TSS, chromHMM_name, chromHMM_UMAPemb_1..4, is_value, di_value, fi_value, ab_value, hic_mat{x,y,z}, hic_fat{x,y,z} and, if enabled, scGPT gene embedding columns (e.g. scGPT_emb_1..512).
Then, annotated output of multiple samples can be processed with data_preparing.py for batch data handling, including filtering, feature calculation, standardization, and metadata integration.
data_preparing.py performs cfDNA filtering by TSS-range (default dist_TSS ∈ [-8192,8192]), methylation ratio calculation, and feature standardization.
3. Model training & inference (ai_dev.py)
ai_dev.py contains the full training, resume and test logic. Configuration (paths, hyperparameters such as d_model, nhead, num_layers, seq_length, batch_size, lr, dna_length, proj_dims, etc.) is defined at the top of ai_dev.py in the Config class; modify those fields directly in the script prior to large runs.
Outputs and logs are controlled by paths set in ai_dev.py::Config (e.g. checkpoint_dir, log_dir). Test mode produces test_predictions.csv (location controlled by Config) with per-batch predictions and columns including: sample_id, gene, prdicted_origin, origin_confidence, health, reads_scores, last layer <cls> embeds.
The reads_scores shows the tumor-derived prediction, while the prdicted_origin contains the distinctions of tissue of origin of the cancer. The last layer <cls> embeds contains the hidden representation of this batch of cfDNA at corresponding gene.
Multimodal-cfDNA-cfAI
Author: Song Liyang — SonglyPKU@163.com License: Apache-2.0
Overview
cfAI is a Transformer-based framework that represents each cfDNA (cell-free DNA) fragment as tokenized, multimodal vectors and uses a MultiTask Transformer to score ctDNA (circulating tumor DNA) likelihoods at molecule, gene, and sample levels.
The model was developed to: 1) profile multi‑omic cross‑talk at single‑cfDNA‑molecule resolution, 2) increase signal‑to‑noise ratio (SNR) by ctDNA enrichment, 3) enhance early cancer screening and liquid biopsies. cfAI achieved ~10‑fold enrichment of cancer‑derived signals over background noise and reached strong multi‑cancer discrimination performance.
Key highlights:
This repository provides the code for processing NGS reads, performing cfDNA annotation, and running model pretraining and inference. It also includes the prebuilt vector annotation database and the trained model.
Repository layout (key files)
0. Prepare environment
cfAI runs on
Python 3.9. Install dependencies includingpysam,numpy,pandas,bedtools,pybedtools,torch,umap-learn,scikit-learn, etc. Versions follow the provided YAML file.torchandpytorch-cudaare required; choose versions based on your GPU.This code is compatible with multi‑GPU.
1. Convert cfDNA NGS reads to multi‑omics reads file
This step converts paired-end BAM files into a structured BED6+ file that contains per-read multi-omic features used downstream by Genome2Vec and the model pipeline.
Convert BAM -> BED6+
bam2vec.pyextracts: genomic coordinates, insert size, strand, base one-hot (A/T/C/G), mismatches/indels (E), CpG methylation (M/U from XM tag), and 5’/3’ end motifs. Output is a tabular BED-like file with the following 15 columns:chr, start, end, read_name, insert_size, strand, A, T, C, G, E, M, U, motif_up, motif_down.Requirements:
2. Annotate reads with Genome2Vec embeddings
reads_anno.pymaps reads to the Genome2Vec annotation database stored inembeds/and appends contextual embeddings/values to each read. Genome2Vec is a collection of vectorized genomic and epigenomic annotations designed to provide compact semantic context for each cfDNA fragment.Genome2Vec contents (files stored in
embeds/)gene_name.bedchromHMM_200bp_UMAPembed.bed40k_is.sort.bed40k_di.sort.bed40k_fire.sort.bed250k_hesc_ab.sort.bed20k_hic.sort.bedUsage
Appends annotation fields such as
near_gene_name, near_gene_strand, dist_TSS, chromHMM_name, chromHMM_UMAPemb_1..4, is_value, di_value, fi_value, ab_value, hic_mat{x,y,z}, hic_fat{x,y,z}and, if enabled, scGPT gene embedding columns (e.g.scGPT_emb_1..512).Run:
Then, annotated output of multiple samples can be processed with
data_preparing.pyfor batch data handling, including filtering, feature calculation, standardization, and metadata integration.Run:
data_preparing.pyperforms cfDNA filtering by TSS-range (defaultdist_TSS ∈ [-8192,8192]), methylation ratio calculation, and feature standardization.3. Model training & inference (
ai_dev.py)ai_dev.pycontains the full training, resume and test logic. Configuration (paths, hyperparameters such asd_model,nhead,num_layers,seq_length,batch_size,lr,dna_length,proj_dims, etc.) is defined at the top ofai_dev.pyin theConfigclass; modify those fields directly in the script prior to large runs.Commands:
Outputs and logs are controlled by paths set in
ai_dev.py::Config(e.g.checkpoint_dir,log_dir). Test mode producestest_predictions.csv(location controlled byConfig) with per-batch predictions and columns including:sample_id, gene, prdicted_origin, origin_confidence, health, reads_scores, last layer <cls> embeds.The
reads_scoresshows the tumor-derived prediction, while theprdicted_origincontains the distinctions of tissue of origin of the cancer. Thelast layer <cls> embedscontains the hidden representation of this batch of cfDNA at corresponding gene.Credits and citation
cfAI was written by Song Liyang in ByteDance. Please follow https://www.linkedin.com/in/liyang-song/.
Please cite the work:
Contributing & contact
Please open issues or pull requests for bugs or feature requests. For data requests and direct correspondence contact: SonglyPKU@163.com.
License
Apache License 2.0