目录

DNAcycP2 R package

Maintainer: Ji-Ping Wang, <jzwang@northwestern.edu>; Brody Kendall <curtiskendall2025@u.northwestern.edu>; Keren Li, <keren.li@northwestern.edu>

License: Artistic-2.0

Cite DNAcycP2 package:

Kendall, B., Jin, C., Li, K., Ruan, F., Wang, X.A., Wang, J.-P., DNAcycP2: improved estimation of intrinsic DNA cyclizability through data augmentation, Nucleic Acids Research, gkaf145, 2025

Introduction

DNAcycP2, short for DNA cyclizability Prediction v2, is an R package (Python version is also available) developed for precise and unbiased prediction of DNA intrinsic cyclizability scores. This tool builds on a deep learning framework that integrates Inception and Residual network architectures with an LSTM layer, providing a robust and accurate prediction mechanism.

DNAcycP2 is an updated version of the earlier DNAcycP tool released by Li et al. in 2021. While DNAcycP was trained on loop-seq data from Basu et al. (2021), DNAcycP2 improves upon it by training on smoothed predictions derived from this dataset. The predicted score, termed C-score, exhibits high accuracy when compared with experimentally measured cyclizability scores obtained from the loop-seq assay. This makes DNAcycP2 a valuable tool for researchers studying DNA mechanics and structure.

Key differences between DNAcycP2 and DNAcycP

Following the release of DNAcycP, it was found that the intrinsic cyclizability scores derived from Basu et al. (2021) retained residual bias from the biotin effect, resulting in inaccuracies (Kendall et al., 2025). To address this, we employed a data augmentation + moving average smoothing method to produce unbiased estimates of intrinsic DNA cyclizability for each sequence in the original training dataset. A new model, trained on this corrected data but using the same architecture as DNAcycP, was developed, resulting in DNAcycP2. This version also introduces improved computational efficiency through parallelization options. Further details are available in Kendall et al. (2025).

To demonstrate the differences, we compared predictions from DNAcycP and DNAcycP2 in a yeast genomic region at base-pair resolution (Figure 1). The predicted biotin-dependent scores (C~26\tilde C_{26}, C~29\tilde C_{29}, and C~31\tilde C_{31}, model trained separately) show 10-bp periodic oscillations due to biotin biases, each with distinct phases. DNAcycP’s predictions improved over the biotin-dependent scores, while still show substantial local fluctuations likely caused by residual bias in the training data (the called intrinsic cyclizability score C^0\hat C_0 from Basu et al. 2021). In contrast, DNAcycP2, trained on corrected intrinsic cyclizability scores, produces much smoother local-scale predictions, indicating a further improvement in removing the biotin bias.

The DNAcycP2 package retains all prediction functions from the original DNAcycP. The improved prediction model, based on smoothed data, can be accessed using the argument smooth=TRUE in the main function (see usage below).

Visualization of difference between DNAcycP2 and DNAcycP.

Available formats of DNAcycP2 and DNAcycP

DNAcycP2 is available in three formats: A web server available at http://DNAcycP.stats.northwestern.edu for real-time prediction and visualization of C-score up to 20K bp, a standalone Python package avilable for free download from https://github.com/jipingw/DNAcycP2-Python, and a new R package will be available for free download from bioconductor ( as well as from https://github.com/jipingw/DNAcycP2).

Architecture of DNAcycP2

The core of DNAcycP2 is a deep learning architecture mixed with an Inception-ResNet structure and an LSTM layer (IR+LSTM, Fig 2) that processes the sequence and its reverse complement separately, the results from which are averaged and detrended to reach the predicted intrinsic score.

A diagram of DNAcycP2.

DNAcycP2 required packages

  • basilisk
  • reticulate

Installation

Current best practice is to install via devtools and github:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("DNAcycP2")

Usage

The DNAcycP2 R package supports input sequences in two formats:

  • FASTA files: Sequence names must begin with >.
  • R objects: Input directly as an R object.

Unlike the web server, which processes only one sequence at a time, the R package allows multiple sequences in a single input. For R object inputs, each sequence (≥ 50bp) is treated as an individual input for prediction. However, for best performance, sequences of exactly 50bp are recommended.

Main Functions

The package provides two primary functions for cyclizability prediction:

  1. cycle: Takes an R object as input.
  2. cycle_fasta: Takes a file path as input.

Selecting the Prediction Model

Both functions use the smooth argument to specify the prediction model:

  • smooth=TRUE: DNAcycP2 (trained on smoothed data, recommended).
  • smooth=FALSE: DNAcycP (trained on original data).

Parallelization with cycle_fasta

The cycle_fasta function is designed for handling larger files and supports parallelization. To enable parallelization, use the following arguments:

  • n_cores: Number of cores to use (default: 1).
  • chunk_length: Sequence length (in bp) each core processes at a time (default: 100,000).

We provide two simple example files with the package to show proper usage:

Example 1: fasta file input

ex1_file <- system.file("extdata", "ex1.fasta", package = "DNAcycP2")
ex1_smooth <- DNAcycP2::cycle_fasta(ex1_file,smooth=TRUE,n_cores=2,chunk_length=1000)
ex1_original <- DNAcycP2::cycle_fasta(ex1_file,smooth=FALSE,n_cores=2,chunk_length=1000)

cycle_fasta takes the file path as input (ex1_file). smooth=TRUE specifies that DNAcycP2 be used to make predictions. smooth=FALSE specifies that DNAcycP be used to make predictions. n_cores=2 specifies that 2 cores are to be used in parallel. chunk_length=1000 specifies that each core will predict on sequences of length 1000 at a given time.

Example 2: txt file input

ex2_file <- system.file("extdata", "ex2.txt", package = "DNAcycP2")
ex2 <- read.csv(ex2_file, header = FALSE)
ex2_smooth <- DNAcycP2::cycle(ex2$V1, smooth=TRUE)
ex2_original <- DNAcycP2::cycle(ex2$V1, smooth=FALSE)

cycle takes the sequences themselves as input where ex2.txt is a text file with each line as a DNA sequence. We first read the file (ex2_file) and then provide the sequences as input (ex2$V1)

Example 3 (Single Sequence):

If you want the predict C-scores for a single sequence, you can follow the same protocol as Example 1 or 2, depending on the input format. We have included two example files representing the same 1000bp stretch of S. Cerevisiae sacCer3 Chromosome I (1:1000) in .fasta and .txt format.

First, we will consider the .fasta format:

ex3_fasta_file <- system.file(
    "extdata", "ex3_single_seq.fasta", package = "DNAcycP2"
)
ex3_fasta_smooth <- DNAcycP2::cycle_fasta(ex3_fasta_file,smooth=TRUE)
ex3_fasta_original <- DNAcycP2::cycle_fasta(ex3_fasta_file,smooth=FALSE)

The output (ex3_fasta_smooth or ex3_fasta_original) is a list with 1 entry named “cycle_1”.

Let’s say we are interested only in the smooth (DNAcycP2), normalized predictions for the subsequence defined by the first 100bp (corresponding to subsequences defined by regions [1,50], [2,51], …, and [51-100], or positions 25, 26, …, and 75). We can access the outputs for this subsequence using the following command:

ex3_fasta_smooth[[1]][1:51,c("position", "C0S_norm")]

Or, equivalently,

ex3_fasta_smooth$cycle_1[1:51,c("position", "C0S_norm")]

Next, we will consider the .txt format:

ex3_txt_file <- system.file(
    "extdata", 
    "ex3_single_seq.txt", 
    package = "DNAcycP2"
)
ex3_txt <- read.csv(ex3_txt_file, header = FALSE)
ex3_txt_smooth <- DNAcycP2::cycle(ex3_txt$V1, smooth=TRUE)
ex3_txt_original <- DNAcycP2::cycle(ex3_txt$V1, smooth=FALSE)

The output (ex3_txt_smooth or ex3_txt_original) is a list with 1 entry (unnamed).

Note, that ex3_fasta_smooth and ex3_txt_smooth are essentially equivalent. The only exceptions are perhaps slight rounding differences that come from the computation, and that the list ex3_fasta_smooth has named entries (‘cycle_1’) while ex3_txt_smooth does not. The same applies for ex3_fasta_original and ex3_txt_original.

Therefore, we can use a similar command to access the outputs for our subsequence of interest:

ex3_txt_smooth[[1]][1:51,c("position", "C0S_norm")]

If there is a sequence (or group of sequences) we want to make predictions on, we can also input them directly as strings. For example:

input_seq1 = 
    "CATGACTGCAGCTAAAACGTTGACCTAGTCGTCAGTCTACGTACTAGCGTAGCTATATCGAGTCTAGCGTCTAG"
input_seq2 = "ATCTTTTGTATATCAAAAGACTAGATCGATTAGCGTACGCCCCTGACTAGATAGATCG"
seq1_smooth = DNAcycP2::cycle(c(input_seq1), smooth=TRUE)
both_seqs_smooth = DNAcycP2::cycle(c(input_seq1, input_seq2), smooth=TRUE)

Example 4: DNAStringSet object input

library(Biostrings)
ex4_string_set <- readDNAStringSet(system.file("extdata", "ex1.fasta", package="DNAcycP2"))
ex4_smooth_output <- DNAcycP2::cycle(ex4_string_set, smooth=TRUE)

ex4_string_set here is a DNAStringSet object using readDNAStringSet function from Biostrings package.

DNAcycP2 output – Normalized vs unnormalized

Both cycle_fasta and cycle output the prediction as a list object. Each item in the list (e.g. ex1_smooth$cycle_1) is a data.frame object with three columns. The first columns is always position. When smooth=TRUE, the second and third columns are C0S_norm and C0S_unnorm; and when smooth=FALSE the second and third columns are C0_norm and C0_unnorm.

In DNAcycP, the model was trained based on the originally called intrinsic cyclizability score (C^0\hat C_0 from Basu et al 2021). The prediction based on this is referred to C0_unnorm. However the cyclizability socre from different loop-seq libraries may be subject to a systematic library-specific constant difference due to its definition (see Basu et al 2021), and hence it’s a relative measure and not direclty comparable between libraries. Therefore in DNAcycP, we also provided a normalized version of intrinsic cyclizability score. We standardized the C0C_0 score (to have mean 0, standard deviation 1) from the Tiling Library of loop-seq data before model training. As such the 50 bp sequences from yeast genome roughly have mean 0 and standard deviation =1 for intrinsic cyclizabilty score. Thus for any sequence under prediciton, the normalized C-score can be more informative in terms of its cyclizabilty relative to the population.

Likewise in DNAcycP2, we obtained a improved estimate of intrinsic cyclizability score of Tiling Library loop-seq data (referred to as C^0s\hat C_0^s) through data augmentation and smoothing. The prediction results using models trained based on unnormalized and normalized new C^0s\hat C_0^s value are referred to as C0S_norm and C0S_unnorm.

If every sequence has length exactly 50bp, both items in the list will be vectors of doubles corresponding to the predicted value for the sequence at the relevant index.

Otherwise (if there as at least one sequence with length >50bp), both items in the list will be lists of vectors corresponding to the predicted values for each subsequence of length 50bp at the relevant list index. For example, as ex2 contains 100 sequences each of length 250bp, ex2_smooth$C0S_norm[[1]] contains the normalized C-scores for every 50bp subsequence of the first sequence in ex2 in order. That is, ex2_smooth$C0S_norm[[1]][1] corresponds to positions 1-50 of the first sequence in ex2, ex2_smooth$C0S_norm[[1]][2] corresponds to positions 2-51 of the first sequence in ex2, and so forth.

Other References

  • Li, K., Carroll, M., Vafabakhsh, R., Wang, X.A. and Wang, J.-P., DNAcycP: A Deep Learning Tool for DNA Cyclizability Prediction, Nucleic Acids Research, 2021

  • Basu, A., Bobrovnikov, D.G., Qureshi, Z., Kayikcioglu, T., Ngo, T.T.M., Ranjan, A., Eustermann, S., Cieza, B., Morgan, M.T., Hejna, M. et al. (2021) Measuring DNA mechanics on the genome scale. Nature, 589, 462-467.

关于

用于预测原核生物基因组中DNA序列的环状化倾向

4.5 MB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802032778号