CDSKIT (/sidieskit/) is a Python program that processes DNA sequences, especially protein-coding sequences. Many functions of this program are designed to handle DNA sequences using codons (sets of three nucleotides) as the unit, and therefore, edits the coding sequences without causing a frameshift. All sequence formats supported by Biopython are available in this tool for both inputs and outputs.
Installation
The latest version of CDSKIT is available from Bioconda. For users requiring a conda installation, please refer to Miniforge for a lightweight conda environment.
Install from Bioconda
conda install bioconda::cdskit
Verify the installation by displaying the available options
cdskit -h
(For advanced users) Install the development version from GitHub
plot: Plotting aligned CDS summaries, codon-state maps, or nucleotide alignment views with consensus codon/AA and AA frequency logos using matplotlib (--mode summary|map|msa; default output is PDF, override with --format)
printseq: Print a subset of sequences with a regex
rmseq: Removing a subset of sequences by using a sequence name regex and by detecting problematic sequence characters
split: Splitting 1st, 2nd, and 3rd codon positions
translate: Translating CDS nucleotide sequences to amino acids
trimcodon: Trimming aligned CDS codon columns by occupancy and ambiguity thresholds
validate: Validating aligned CDS quality and reporting issues
Streamlined analysis
CDSKIT is designed for data flow through standard input and output. Streamlined processing may be combined with other sequence processing tools, such as SeqKit, with pipes (|).
Overview
CDSKIT (/sidieskit/) is a Python program that processes DNA sequences, especially protein-coding sequences. Many functions of this program are designed to handle DNA sequences using codons (sets of three nucleotides) as the unit, and therefore, edits the coding sequences without causing a frameshift. All sequence formats supported by Biopython are available in this tool for both inputs and outputs.
Installation
The latest version of CDSKIT is available from Bioconda. For users requiring a
condainstallation, please refer to Miniforge for a lightweight conda environment.Install from Bioconda
Verify the installation by displaying the available options
(For advanced users) Install the development version from GitHub
Subcommands
See Wiki for detailed descriptions.
accession2fasta: Retrieving fasta sequences from a list of GenBank accessionsaggregate: Extracting the longest sequences combined with a sequence name regexbackalign: Back-aligning CDS from unaligned CDS + aligned proteinsbacktrim: Back-translating a trimmed protein alignmentcodonstats: Printing codon-aware per-sequence and aggregate codon-usage statisticsdegeneracy: Extracting aligned 0/2/3/4-fold degenerate nucleotide positionsfilter: Filtering CDS by sequence-level quality rulesgapjust: Adjusting consecutive Ns to the fixed lengthhammer: Removing less-occupied codon columns from a gappy alignmentintersection: Dropping non-overlapping sequence labels between two sequences files or between a sequence file and a gff filelabel: Modifying sequence labelslongestorf: Finding the longest ORF by six-frame translation (+/- strands, 3 frames each)mask: Masking ambiguous and/or stop codonsmaxalign: Removing sequences to maximize codon-based alignment area (MaxAlign)pad: Making nucleotide sequences in-frame by head and tail paddingsparsegb: Converting the GenBank formatplot: Plotting aligned CDS summaries, codon-state maps, or nucleotide alignment views with consensus codon/AA and AA frequency logos using matplotlib (--mode summary|map|msa; default output is PDF, override with--format)printseq: Print a subset of sequences with a regexrmseq: Removing a subset of sequences by using a sequence name regex and by detecting problematic sequence characterssplit: Splitting 1st, 2nd, and 3rd codon positionsstats: Printing sequence statisticstranslate: Translating CDS nucleotide sequences to amino acidstrimcodon: Trimming aligned CDS codon columns by occupancy and ambiguity thresholdsvalidate: Validating aligned CDS quality and reporting issuesStreamlined analysis
CDSKIT is designed for data flow through standard input and output. Streamlined processing may be combined with other sequence processing tools, such as SeqKit, with pipes (
|).Parallel execution
All subcommands support
--threads INTfor multi-threaded processing.--threads 1: single-threaded (default)--threads 2or larger: multi-threaded--threads 0: auto-detect available CPU countCitation
There is no published paper on CDSKIT itself, but we used and cited CDSKIT in several papers including Fukushima & Pollock (2023, Nat Ecol Evol 7: 155-170).
Licensing
This program is BSD-licensed (3 clause). See LICENSE for details.