tar zxfv cycle_finder_<version>.tar.gz
cd cycle_finder_<version>
make
cp cycle_finder <installation_path>
SYNOPSIS
single mode
cycle_finder all -f <SHORT_READS>.fastq
compare mode
cycle_finder all -f1 <SHORT_READS1.fastq> -f2 <SHORT_READS2.fastq>
TEST
# Get the test dataset, which includes simulated reads from repeats-inserted E. coli genomes.
wget https://github.com/rkajitani/cycle_finder/releases/download/v1.0.0/test_data.tar.gz
tar xzfv test_data.tar.gz
# Test for tandem repeats.
# The reads, genome, tandem repeat unit (253 bp; mutation rate, 2%), and correct result are
# reads.fq, genome.fa, rep_unit.fa, and result/*, respectively.
cd test_data/tandem/
bash cmd.sh
# Output: out_T.fa out_T.tsv
cd ../..
# Test for interspersed repeats.
# The reads, genome, interspersed repeat unit (253 bp; mutation rate, 2%), and correct result are
# reads.fq, genome.fa, rep_unit.fa, and result/*, respectively.
cd test_data/interspersed/
bash cmd.sh
# Output: out_I.fa out_I.tsv
TRF (Tandem Repeat Finder; >= 4.07b) G. Benson, “Tandem repeats finder : a program to analyze DNA sequence,”
vol. 27, no. 2, pp. 573–580, 1999.
BLAST+ (>= 2.2.31+) S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller,
and D. J. Lipman, “Gapped BLAST and PSI-BLAST: A new generation of proten
database search programs,” Nucleic Acids Res., vol. 25, no. 17
pp. 3389–3402, 1997.
CD-HIT (>= 4.6) W. Li, L. Fu, B. Niu, S. Wu, and J. Wooley, “Ultrafast clustering algorithm
for metagenomic sequence analysis,” Brief. Bioinform., vol. 13, no. 6, pp
656–668, 2012.
USAGE
COMMON OPTIONS
-t INT : Number of threads (default 1)
-o STR : Prefix of output files (default out)
cycle_finder extract [OPTIONS]
Extracting high frequency k-mer from short reads.
INPUT OPTIONS
single mode
-f FILE1 [FILE2 ...]: Reads file (fasta or fastq format)
compare mode
-f1 FILE1 [FILE2 ...]: Reads file (fasta or fastq format)
-f2 FILE1 [FILE2 ...]: Reads file (fasta or fastq format)
OTHER OPTIONS
-c1 INT : copy difference for detecting repeat
-c2 INT : copy difference for copy estimation
single mode
-jf FILE : if you have jf file that output by jellyfish,
you can skip jellyfish count.
-dm FILE : if you have dump file that output by jellyfish,
you can skip jellyfish dump.
-hs FILE : if you have histo file that output by jellyfish,
you can skip jellyfish histo.
compare mode
-jf1 FILE : if you have jf file that output by jellyfish,
you can skip jellyfish count(corresponds to FILE1).
-jf2 FILE : if you have jf file that output by jellyfish,
you can skip jellyfish count(corresponds to FILE2).
-dm1 FILE : if you have dump file that output by jellyfish,
you can skip jellyfish dump(corresponds to FILE1).
-dm2 FILE : if you have dump file that output by jellyfish,
you can skip jellyfish dump(corresponds to FILE2).
-hs1 FILE : if you have histo file that output by jellyfish,
you can skip jellyfish histo(corresponds to FILE1).
-hs2 FILE : if you have histo file that output by jellyfish,
you can skip jellyfish histo(corresponds to FILE2).
OUTPUT FILES
PREFIX_for_detect${c1}.fa : high frequency k-mer for detecting repeat
PREFIX_for_estimate${c2}.fa : high frequency k-mer for copy estimating
PREFIX_kmer_peak : k-mer peak
PREFIX.jf /PREFIX.jf1 / PREFIX.jf2 : jf file output by jellyfish
PREFIX.dm / PREFIX.dm1 / PREFIX.dm2 : dump file output by jellyfish
PREFIX.hs / PREFIX.hs1 / PREFIX.hs2 : histogram of k-mer frequency
cycle_finder cycle [OPTIONS]
Finding cycle from de Bruijn graph and detect tandem repeats.
INPUT OPTIONS
-f FILE : high frequency k-mer for detecting repeat
-r FASTQ : short read fastq file(only one file)
-c INT : single mode->1 compare mode->2
-l INT : threshold of detecting repeat length
-n INT : threshold of searching nodes
-d INT : threshold of searching depth
-p INT : k-mer peak (single mode)
-p1 INT : k-mer peak (compare mode corresponds to FILE1)
-p2 INT : k-mer peak (compare mode corresponds to FILE2)
-rc FLOAT : Down sampling coverage (default 0.5)
Clustering and copy estimating from detected repeat.
INPUT OPTIONS
-f1 FILE : high frequency k-mer for copy estimating
-f2 FILE : temporal copy estimation
-f3 FILE : temporal minimum copy estimation
-c INT : single mode->1 compare mode->2
-p INT : k-mer peak (single mode)
-p1 INT : k-mer peak (compare mode corresponds to FILE1)
-p2 INT : k-mer peak (compare mode corresponds to FILE2)
-m INT : threshold mismatch of kmer alignment for copy estimation (default 0)
OUTPUT FILES
tandem repeat
PREFIX_T_blst.blastn : k-mer alignment file
PREFIX_T.fa : detected tandem repeat
PREFIX_T.tsv : copy esitimation
PREFIX_T_clst.fa : fasta file of detected tandem repeats
interspersed repeat
PREFIX_I.fa : detected interspersed repeat
PREFIX_I.tsv : copy estimation
PREFIX_I_clst.fa : fasta file of detected tandem repeats
cycle_finder intersperse [OPTIONS]
Finding path from de Bruijn graph that removed cycle and detect interspersed
repeats.
INPUT OPTIONS
-f FILE : high frequency k-mer for detecting repeat
-b FILE : alignment file (output from cycle command)
-r FASTQ : short read fastq file(only one file)
-c INT : single mode->1 compare mode->2
-L INT : threshold of detecting repeat length
-N INT : threshold of searching nodes
-D INT : threshold of searching depth
-p INT : k-mer peak (single mode)
-p1 INT : k-mer peak (compare mode corresponds to FILE1)
-p2 INT : k-mer peak (compare mode corresponds to FILE2)
-rc FLOAT : Down sampling coverage (default 0.5)
OUTPUT FILES
PREFIX_for_detect${c1}.fa_no_cycle : high frequency k-mer for detect without cycle
PREFIX_I_repeat_num : temporal copy estimation
PREFIX_I_repeat_num_min : temporal minimum copy estimation
-f FILE1 [FILE2 ...]: Reads file (fasta or fastq format)
compare mode
-f1 FILE1 [FILE2 ...]: Reads file (fasta or fastq format)
-f2 FILE1 [FILE2 ...]: Reads file (fasta or fastq format)
OTHER OPTIONS
-c1 INT : copy difference for detecting repeat
-c2 INT : copy difference for copy estimation
single mode
-jf FILE : if you have jf file that output by jellyfish,
you can skip jellyfish count.
-dm FILE : if you have dump file that output by jellyfish,
you can skip jellyfish dump.
-hs FILE : if you have histo file that output by jellyfish,
you can skip jellyfish histo.
compare mode
-jf1 FILE : if you have jf file that output by jellyfish,
you can skip jellyfish count(corresponds to FILE1).
-jf2 FILE : if you have jf file that output by jellyfish,
you can skip jellyfish count(corresponds to FILE2).
-dm1 FILE : if you have dump file that output by jellyfish,
you can skip jellyfish dump(corresponds to FILE1).
-dm2 FILE : if you have dump file that output by jellyfish,
you can skip jellyfish dump(corresponds to FILE2).
-hs1 FILE : if you have histo file that output by jellyfish,
you can skip jellyfish histo(corresponds to FILE1).
-hs2 FILE : if you have histo file that output by jellyfish,
you can skip jellyfish histo(corresponds to FILE2).
-l INT : threshold of detecting repeat length
-n INT : threshold of searching nodes
-d INT : threshold of searching depth
-L INT : threshold of detecting repeat length
-N INT : threshold of searching nodes
-D INT : threshold of searching depth
-rc FLOAT : Down sampling coverage (default 0.5)
-m INT : threshold mismatch of kmer alignment for copy estimation (default 0)
OUTPUT FILES
PREFIX_for_detect${c1}.fa : high frequency k-mer for detecting repeat
PREFIX_for_estimate${c2}.fa : high frequency k-mer for copy estimating
PREFIX_kmer_peak : k-mer peak
PREFIX.jf(PREFIX.jf1, PREFIX.jf2) : jf file output by jellyfish
PREFIX.dm / PREFIX.dm1 / PREFIX.dm2 : dump file output by jellyfish
PREFIX.hs / PREFIX.hs1 / PREFIX.hs2 : histogram of k-mer frequency
[tandem repeat]
PREFIX_T_repeat_num : temporal copy estimation
PREFIX_T_repeat_num_min : temporal minimum copy estimation
PREFIX_T.fa : detected tandem repeat
PREFIX_T.tsv : copy esitimation
[interspersed repeat]
PREFIX_I_repeat_num : temporal copy estimation
PREFIX_I_repeat_num_min : temporal minimum copy estimation
PREFIX_I.fa : detected interspersed repeat
PREFIX_I.tsv : copy estimation
cycle_finder README
AUTHOR
Yoshiki Tanaka wrote the original code.
VERSION
1.0.0
DESCRIPTION
cycle_finder is a tool for detecting tandem repeat and interspersed repeat from short reads. The execusion procedures are as follows:
INSTALATION
Using Bioconda (Linux)
From source
SYNOPSIS
single mode
compare mode
TEST
DEPENDENCY
GCC
OpenMP
Jellyfish (>= 2.2.6)
G. Marçais and C. Kingsford, “A fast, lock-free approach for efficien parallel counting of occurrences of k-mers,” Bioinformatics, vol. 27 no. 6, pp. 764–770, 2011. https://academic.oup.com/bioinformatics/article/27/6/764/234905
TRF (Tandem Repeat Finder; >= 4.07b)
G. Benson, “Tandem repeats finder : a program to analyze DNA sequence,” vol. 27, no. 2, pp. 573–580, 1999.
BLAST+ (>= 2.2.31+)
S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman, “Gapped BLAST and PSI-BLAST: A new generation of proten database search programs,” Nucleic Acids Res., vol. 25, no. 17 pp. 3389–3402, 1997.
CD-HIT (>= 4.6)
W. Li, L. Fu, B. Niu, S. Wu, and J. Wooley, “Ultrafast clustering algorithm for metagenomic sequence analysis,” Brief. Bioinform., vol. 13, no. 6, pp 656–668, 2012.
USAGE
COMMON OPTIONS
cycle_finder extract [OPTIONS]
Extracting high frequency k-mer from short reads.
INPUT OPTIONS
single mode
compare mode
OTHER OPTIONS
single mode
compare mode
OUTPUT FILES
cycle_finder cycle [OPTIONS]
Finding cycle from de Bruijn graph and detect tandem repeats.
INPUT OPTIONS
OUTPUT FILES
cycle_finder cluster [OPTIONS]
Clustering and copy estimating from detected repeat.
INPUT OPTIONS
OUTPUT FILES
tandem repeat
interspersed repeat
cycle_finder intersperse [OPTIONS]
Finding path from de Bruijn graph that removed cycle and detect interspersed repeats.
INPUT OPTIONS
OUTPUT FILES
cycle_finder all [OPTIONS]
Run whole pipeline:
extract -> cycle -> cluster -> intersperse > cluster
INPUT OPTIONS
single mode
compare mode
OTHER OPTIONS
single mode
compare mode
OUTPUT FILES
FILE FORMAT
PREFIX_T_repeat_num / PREFIX_T_repeat_num_min / PREFIX_I_repeat_num / PREFIX_I_repeat_num_min
column1 : sequence
column2 : sequence length
column3 : explanation
column4 : difference between column5 and column4
column5 : the copy number (FILE or FILE1)
column6 : the copy number (FILE2 ; in single mode, this column shows 0)
column7 : explanation
column8 : difference between column9 and column10
column9 : the number of bases (FILE or FILE1)
column10 : the number of bases (FILE2 ; in single mode, this column shows 0)
PREFIX_T.tsv / PREFIX_I.tsv
column1 : “Family” name (the number next to ‘_’ shows “Family number”, length,
column2 : difference between column3 and column4
column3 : the number of bases consists of the “Family” (FILE or FILE1)
column4 : the number of bases consists of the “Family” (FILE2 ; in single mode,
column5 : difference between normalized column3 and normalized column4
PREFIX_T_clst.fa / PREFIX_I_clst.fa
fasta file of detected repeats. The sequence which has “*” is a representative sequence of “family”.
PREFIX_T_blst.blastn
column1 : sequence name
column2 : k-mer aligned to the repeat
column3 : the number of match
PREFIX_T.element / PREFIX_I.element
after “>” shows Family, and other lines shows the “family number”.