Specifically, you can check the help message and the options for each subcommand with
caspeak [subcommand] -h
Quick Start
Before running caspeak, you need to prepare the following required input file in FASTA or FASTQ format:
raw Cas9 targeted sequencing data,
the reference genome (e.g., GRCh38),
the consensus sequence of the mobile element.
And of course, you should know the location of the Cas9 target site in the consensus sequence.
If you don’t care about the intermediate files and procedures, exec subcommand provides a shortcut containing all the functions for you:
cd /your/work/dir
caspeak exec \
--read /path/to/read.fa.gz \
--ref /path/to/reference.fa \
--insert /path/to/consensus_seq.fa \
--target-start <NUM> \
--target-end <NUM> \
--thread <NUM> # set the number of parallel CPUs \
--vcf # addtional output in VCF format \
--bedtools-genome
Regarding the option --bedtools-genome, caspeak can locate the genome folder if it is downloaded with bedtools, and pick human.hg38.genome as the default. But in some other conditions, like installation via conda, you may need to manually specify a file containing all the chromosome names and lengths.
Also, you can run align, peak, and valid sequentially like this script:
After running valid or exec, the final result is named with the prefix validate in peak dir.
Finally, plot subcommand can make a dotplot for each peak according to the result MAF file:
caspeak plot --maf result/validate.maf
Subcommands
align
caspeak align utilizes LAST to align your reads to the reference genome and the insertion sequence. It outputs several Multiple Alignment Format (MAF) files.
Option
Description
--read <FILE>
Outer-Cas9 targeted nanopore sequencing data in FASTA or FASTQ format (required).
--ref <FILE>
Reference genome of the sequenced species in FASTA or FASTQ format (required).
--insert <FILE>
Consensus sequence of the mobile element in FASTA or FASTQ format (required).
--thread <INT>
Specify the threads running in parallel (default: 1).
--workdir <DIR>
Specify the working directory for caspeak output (default: current directory). It is recommended that all the commands should be executed in the same directory.
-v, --verbose
Print more progress messages and data to stdout.
peak
caspeak peak can filter the reads by alignments to both the reference genome and the insert consensus sequence, and calculate the coverage peaks. Simultaneously, it prepares several files for caspeak valid to avoid duplicate parameters.
| Option | Description |
| — | — |
| --read <FILE> | Outer-Cas9 targeted nanopore sequencing data in FASTA or FASTQ format (required). |
| --ref <FILE> | Reference genome of the sequenced species in FASTA or FASTQ format (required). |
| --insert <FILE> | Consensus sequence of the mobile element in FASTA or FASTQ format (required). |
| --target-start <INT> | The start position of the Cas9 target site in the consensus sequence (required). |
| --target-end <INT> | The end position of the Cas9 target site in the consensus sequence (required). |
| --genome-maf <FILE> | The read alignment to the reference genome in MAF format. In general, caspeak align outputs it as lastal/read_to_ref.maf under your working directory. You can ignore it if you specify the same --workdir for caspeak align and caspeak peak. |
| --insert-maf <FILE> | The read alignment to the consensus sequence in MAF format. In general, caspeak align outputs it as lastal/read_to_insert.maf and it can also be ignored like --genome-maf. |
| --bedtools-genome <FILE> | A genome file passed to bedtools genomecov -g. In general, several genome files should be included in genomes dir under your path to bedtools, and human.hg38.genome will be used if you ignore this option. |
| -x, --exog | Specify the mobile element as an exogenous element, which means there is no identical sequence in the reference genome. |
| --mask <FILE> | The region to be masked in either RepeatMasker .out format or UCSC’s rmsk format. You can use it to make caspeak execute faster and control the number of false-positive cases, but the false-negative cases might be a bit more. |
| --thread <INT> | Specify the threads running in parallel (default: 1). |
| --workdir <DIR> | Specify the working directory for caspeak output (default: current directory). It is recommended that all the commands should be executed in the same directory. |
||Parameters for filtering reads|
| --min-read-length <INT> | Specify the minimum read length N (default: 500). The reads less than N bases will be filtered out. |
| --max-prop <NUM> | Specify the maximum proportion of the longest continuous alignment on the reference genome to the whole read (default: 0.99). |
| --min-prop <NUM> | Specify the minimum proportion of the longest continuous alignment on the reference genome to the whole read (default: 0.4). |
||Parameters for trimming reads|
| --max-trim-length <INT> | Specify the maximum length N to be trimmed before the first alignment to the insert sequence (default: 100). Reads with distance between the first base and the first alignment longer than N will be filtered out. |
| --padding <INT> | Specify the padding size of the target region (default: 20). This option enables the reads that are not exactly targeted. |
||Parameters for peak detection|
| --min-cov <INT> | Specify the minimum coverage to be considered in peak detection (default: 2). |
| -v, --verbose | Print more progress messages and data to stdin. |
valid
caspeak valid validates peaks listed in BED format. For each peak, caspeak valid collects the reads involved, assembles them by lamassemble, and re-aligns the assembly sequence to validate whether the peak indicates a real non-reference insertion. And it generates the final result to result dir only in MAF format if --vcf is ignored.
| Option | Description |
| — | — |
| --trim-read <FILE> | The read file after filtering and trimming, which is usually generated by caspeak peak and named as peak/trimmed_reads.fasta under your working directory. You can ignore it if the same --workdir is specified in caspeak peak and caspeak valid. |
| --peak-bed <FILE> | The peak file in BED format, usually generated by caspeak peak and named as peak/peaks.bed under your working directory. You can ignore it if the same --workdir is specified in caspeak peak and caspeak valid. |
| --thread <INT> | Specify the threads running in parallel (default: 1). |
| --workdir <DIR> | Specify the working directory for caspeak output (default: current directory). It is recommended that all the commands should be executed in the same directory. |
| --sample <INT> | Specify that at most N pairs of reads are assembled (default: 20). |
| --min-insprop <INT>| Specify the minimum proportion of the mobile element in the detected insert sequence (default: 0.2). That is, at least 20% of the insert sequence should derive from the mobile element by default. |
| --min-inslen <INT>| Specify the minimum length of the mobile element in the detected insert sequence (default: 80). That is, the length of mobile element in the insert sequence should be longer than 80 bp by default. |
| --min-asb-seq <INT> | Specify the minimum number of sequences participating the assembly, i.e, the minimum read number to one direction (default: 1). It is more effective than --min-cov in caspeak peak. |
| --lib <LIB> | Use a sequence set of mobile element ancestral lineage (FASTA/FASTAQ format) for validation, instead of a single mobile element sequence specified by --insert in previous steps. For example, a set of sequences containing L1HS, L1PA2, L1PA3, …, L1MA1, … is suitable for L1HS detection with --names L1HS. Some lib files have been prepared in lib directory. |
| --names <NAME>| In the LIB file, only the sequences specified in this option are treated as real mobile element for --min-insert calculation. Multiple sequence names can be assigned like --names A --names B --names C. |
| --names-re <REGEXP> | Treat all the sequences in lib file with name matching REGEXP as real mobile element. Either --names or --names-re should be set if --lib option is specified. |
| --vcf | Indicate an extra output in VCF format. |
| -v, --verbose | Print more progress messages and data to stdout. |
exec
caspeak exec actually provided a shortcut and wrapper for caspeak align, caspeak peak and caspeak valid. It improves speed by skipping several I/O operations.
caspeak plot makes a dotplot for each peak from the result file in MAF format. It originates from last-dotplot and inherits all options except the input.
| Option | Description |
| — | — |
| --maf <FILE> | The result MAF file used for plotting, usually generated by caspeak valid or caspeak exec (required). |
| --prefix <PREFIX> | Specify the name prefix for plots (default: fig/peak). By default, fig dir will be made if it doesn’t exist, and the plots will be stored as peak1.png, peak2.png, … in fig dir. |
| other options | See last-dotplot document.|
CasPeak
This is a pipeline for finding non-reference mobile element insertions (MEIs) based on outer-Cas9 targeted Nanopore sequencing and peak detection.
Installation
You can install CasPeak (together with all requirements) from Bioconda:
You can also download the source code by
git clone(latest version) or from release.Requirements
The following software is required for CasPeak:
Usage
caspeakconsists of several subcommands:You can use
caspeaklike:Specifically, you can check the help message and the options for each subcommand with
Quick Start
Before running caspeak, you need to prepare the following required input file in FASTA or FASTQ format:
And of course, you should know the location of the Cas9 target site in the consensus sequence.
If you don’t care about the intermediate files and procedures,
execsubcommand provides a shortcut containing all the functions for you:Regarding the option
--bedtools-genome, caspeak can locate thegenomefolder if it is downloaded with bedtools, and pickhuman.hg38.genomeas the default. But in some other conditions, like installation via conda, you may need to manually specify a file containing all the chromosome names and lengths.Also, you can run
align,peak, andvalidsequentially like this script:After running
validorexec, the final result is named with the prefix validate in peak dir.Finally,
plotsubcommand can make a dotplot for each peak according to the result MAF file:Subcommands
align
caspeak alignutilizes LAST to align your reads to the reference genome and the insertion sequence. It outputs several Multiple Alignment Format (MAF) files.--read<FILE>--ref<FILE>--insert<FILE>--thread<INT>--workdir<DIR>-v,--verbosepeak
caspeak peakcan filter the reads by alignments to both the reference genome and the insert consensus sequence, and calculate the coverage peaks. Simultaneously, it prepares several files forcaspeak validto avoid duplicate parameters. | Option | Description | | — | — | |--read<FILE> | Outer-Cas9 targeted nanopore sequencing data in FASTA or FASTQ format (required). | |--ref<FILE> | Reference genome of the sequenced species in FASTA or FASTQ format (required). | |--insert<FILE> | Consensus sequence of the mobile element in FASTA or FASTQ format (required). | |--target-start<INT> | The start position of the Cas9 target site in the consensus sequence (required). | |--target-end<INT> | The end position of the Cas9 target site in the consensus sequence (required). | |--genome-maf<FILE> | The read alignment to the reference genome in MAF format. In general,caspeak alignoutputs it as lastal/read_to_ref.maf under your working directory. You can ignore it if you specify the same--workdirforcaspeak alignandcaspeak peak. | |--insert-maf<FILE> | The read alignment to the consensus sequence in MAF format. In general,caspeak alignoutputs it as lastal/read_to_insert.maf and it can also be ignored like--genome-maf. | |--bedtools-genome<FILE> | A genome file passed tobedtools genomecov -g. In general, several genome files should be included in genomes dir under your path tobedtools, and human.hg38.genome will be used if you ignore this option. | |-x,--exog| Specify the mobile element as an exogenous element, which means there is no identical sequence in the reference genome. | |--mask<FILE> | The region to be masked in either RepeatMasker .out format or UCSC’s rmsk format. You can use it to make caspeak execute faster and control the number of false-positive cases, but the false-negative cases might be a bit more. | |--thread<INT> | Specify the threads running in parallel (default: 1). | |--workdir<DIR> | Specify the working directory for caspeak output (default: current directory). It is recommended that all the commands should be executed in the same directory. | ||Parameters for filtering reads| |--min-read-length<INT> | Specify the minimum read length N (default: 500). The reads less than N bases will be filtered out. | |--max-prop<NUM> | Specify the maximum proportion of the longest continuous alignment on the reference genome to the whole read (default: 0.99). | |--min-prop<NUM> | Specify the minimum proportion of the longest continuous alignment on the reference genome to the whole read (default: 0.4). | ||Parameters for trimming reads| |--max-trim-length<INT> | Specify the maximum length N to be trimmed before the first alignment to the insert sequence (default: 100). Reads with distance between the first base and the first alignment longer than N will be filtered out. | |--padding<INT> | Specify the padding size of the target region (default: 20). This option enables the reads that are not exactly targeted. | ||Parameters for peak detection| |--min-cov<INT> | Specify the minimum coverage to be considered in peak detection (default: 2). | |-v,--verbose| Print more progress messages and data to stdin. |valid
caspeak validvalidates peaks listed in BED format. For each peak,caspeak validcollects the reads involved, assembles them by lamassemble, and re-aligns the assembly sequence to validate whether the peak indicates a real non-reference insertion. And it generates the final result to result dir only in MAF format if--vcfis ignored. | Option | Description | | — | — | |--trim-read<FILE> | The read file after filtering and trimming, which is usually generated bycaspeak peakand named as peak/trimmed_reads.fasta under your working directory. You can ignore it if the same--workdiris specified incaspeak peakandcaspeak valid. | |--peak-bed<FILE> | The peak file in BED format, usually generated bycaspeak peakand named as peak/peaks.bed under your working directory. You can ignore it if the same--workdiris specified incaspeak peakandcaspeak valid. | |--thread<INT> | Specify the threads running in parallel (default: 1). | |--workdir<DIR> | Specify the working directory for caspeak output (default: current directory). It is recommended that all the commands should be executed in the same directory. | |--sample<INT> | Specify that at most N pairs of reads are assembled (default: 20). | |--min-insprop<INT>| Specify the minimum proportion of the mobile element in the detected insert sequence (default: 0.2). That is, at least 20% of the insert sequence should derive from the mobile element by default. | |--min-inslen<INT>| Specify the minimum length of the mobile element in the detected insert sequence (default: 80). That is, the length of mobile element in the insert sequence should be longer than 80 bp by default. | |--min-asb-seq<INT> | Specify the minimum number of sequences participating the assembly, i.e, the minimum read number to one direction (default: 1). It is more effective than--min-covincaspeak peak. | |--lib<LIB> | Use a sequence set of mobile element ancestral lineage (FASTA/FASTAQ format) for validation, instead of a single mobile element sequence specified by--insertin previous steps. For example, a set of sequences containing L1HS, L1PA2, L1PA3, …, L1MA1, … is suitable for L1HS detection with--names L1HS. Some lib files have been prepared inlibdirectory. | |--names<NAME>| In the LIB file, only the sequences specified in this option are treated as real mobile element for--min-insertcalculation. Multiple sequence names can be assigned like--names A --names B --names C. | |--names-re<REGEXP> | Treat all the sequences inlibfile with name matching REGEXP as real mobile element. Either--namesor--names-reshould be set if--liboption is specified. | |--vcf| Indicate an extra output in VCF format. | |-v,--verbose| Print more progress messages and data to stdout. |exec
caspeak execactually provided a shortcut and wrapper forcaspeak align,caspeak peakandcaspeak valid. It improves speed by skipping several I/O operations.--read--ref--insert--target-start--target-end-x,--exog--bedtools-genome--mask--workdir--thread--min-read-length--max-prop--min-prop--max-trim-length--padding--min-cov--sample--min-insprop--min-inslen--min-asb-seq--lib--names--names-re--vcf-v,--verboseplot
caspeak plotmakes a dotplot for each peak from the result file in MAF format. It originates from last-dotplot and inherits all options except the input. | Option | Description | | — | — | |--maf<FILE> | The result MAF file used for plotting, usually generated bycaspeak validorcaspeak exec(required). | |--prefix<PREFIX> | Specify the name prefix for plots (default: fig/peak). By default, fig dir will be made if it doesn’t exist, and the plots will be stored as peak1.png, peak2.png, … in fig dir. | | other options | See last-dotplot document.|