NSCCN/cdskit：用于处理编码DNA序列（CDS）的Python工具包，提供序列格式转换、比对、统计和操作等功能。

Overview

CDSKIT (/sidieskit/) is a Python program that processes DNA sequences, especially protein-coding sequences. Many functions of this program are designed to handle DNA sequences using codons (sets of three nucleotides) as the unit, and therefore, edits the coding sequences without causing a frameshift. All sequence formats supported by Biopython are available in this tool for both inputs and outputs.

Installation

The latest version of CDSKIT is available from Bioconda. For users requiring a conda installation, please refer to Miniforge for a lightweight conda environment.

Install from Bioconda

conda install bioconda::cdskit

Verify the installation by displaying the available options

cdskit -h

(For advanced users) Install the development version from GitHub

pip install git+https://github.com/kfuku52/cdskit

Subcommands

See Wiki for detailed descriptions.

accession2fasta: Retrieving fasta sequences from a list of GenBank accessions
aggregate: Extracting the longest sequences combined with a sequence name regex
backalign: Back-aligning CDS from unaligned CDS + aligned proteins
backtrim: Back-translating a trimmed protein alignment
codonstats: Printing codon-aware per-sequence and aggregate codon-usage statistics
degeneracy: Extracting aligned 0/2/3/4-fold degenerate nucleotide positions
filter: Filtering CDS by sequence-level quality rules
gapjust: Adjusting consecutive Ns to the fixed length
hammer: Removing less-occupied codon columns from a gappy alignment
intersection: Dropping non-overlapping sequence labels between two sequences files or between a sequence file and a gff file
label: Modifying sequence labels
longestorf: Finding the longest ORF by six-frame translation (+/- strands, 3 frames each)
mask: Masking ambiguous and/or stop codons
maxalign: Removing sequences to maximize codon-based alignment area (MaxAlign)
pad: Making nucleotide sequences in-frame by head and tail paddings
parsegb: Converting the GenBank format
plot: Plotting aligned CDS summaries, codon-state maps, or nucleotide alignment views with consensus codon/AA and AA frequency logos using matplotlib (--mode summary|map|msa; default output is PDF, override with --format)
printseq: Print a subset of sequences with a regex
rmseq: Removing a subset of sequences by using a sequence name regex and by detecting problematic sequence characters
split: Splitting 1st, 2nd, and 3rd codon positions
stats: Printing sequence statistics
translate: Translating CDS nucleotide sequences to amino acids
trimcodon: Trimming aligned CDS codon columns by occupancy and ambiguity thresholds
validate: Validating aligned CDS quality and reporting issues

Streamlined analysis

CDSKIT is designed for data flow through standard input and output. Streamlined processing may be combined with other sequence processing tools, such as SeqKit, with pipes (|).

# Example 
seqkit seq input.fasta.gz | cdskit pad | cdskit mask | seqkit translate | cdskit aggregate -x ":.*"  > output.fasta

Parallel execution

All subcommands support --threads INT for multi-threaded processing.

--threads 1: single-threaded (default)
--threads 2 or larger: multi-threaded
--threads 0: auto-detect available CPU count

Citation

There is no published paper on CDSKIT itself, but we used and cited CDSKIT in several papers including Fukushima & Pollock (2023, Nat Ecol Evol 7: 155-170).

Licensing

This program is BSD-licensed (3 clause). See LICENSE for details.