NSCCN/seq2onehot：将生物序列（如DNA、RNA、蛋白质）转换为one-hot编码格式

Description

seq2onehot is a command-line tool encoding DNA/RNA/protein sequences to a one-hot numpy array.
z

All sequences must be the same lengths.

To decode a one-hot numpy array to sequences, use onehot2seq.
https://github.com/akikuno/onehot2seq

Installation

You can install seq2onehot using pip or bioconda:

pip install seq2onehot

conda install -c bioconda seq2onehot

Usage

seq2onehot [options] -t/--type <dna/rna/protein> -i/--input <in.fasta> -o/--output <out.npy>

Options

-a/--ambiguous: include ambiguous characters

The ambigous characters are:

XBZJ for amino acid
NVHDBMRWSYK for DNA and RNA

The detail of ambiguous characters is described here:
https://meme-suite.org/meme/doc/alphabets.html

Examples

# DNA sequences
seq2onehot -t dna -i example/dna.fasta -o dna.npy

# RNA sequences
seq2onehot -t rna -i example/rna.fasta -o rna.npy

# Protein sequences
seq2onehot -t protein -i example/protein.fasta -o protein.npy

One-hot array

The output file contains 3d one-hot array of RxNxL (Read x Nucreotide/Amino acid x Letter)

The order of nucreotide is ACGT (+ NVHDBMRWSYK) for DNA, ACGU (+ NVHDBMRWSYK) for RNA
The order of amino acid is ACDEFGHIKLMNPQRSTVWY (+ XBZJ)

# Original sequences:
## ACGTACGTACGTACGT
## CCCCCCCCTTTTTTTT

onehot = np.load("dna.npy")

onehot.shape
# (2, 16, 4) <- 2 reads x 16 nucreotides x 4 letters (ACGT)

onehot
# array([[[1., 0., 0., 0.],
#         [0., 1., 0., 0.],
#         [0., 0., 1., 0.],
#         [0., 0., 0., 1.],
#         [1., 0., 0., 0.],
#         [0., 1., 0., 0.],
#         [0., 0., 1., 0.],
#         [0., 0., 0., 1.],
#         [1., 0., 0., 0.],
#         [0., 1., 0., 0.],
#         [0., 0., 1., 0.],
#         [0., 0., 0., 1.],
#         [1., 0., 0., 0.],
#         [0., 1., 0., 0.],
#         [0., 0., 1., 0.],
#         [0., 0., 0., 1.]],

#        [[0., 1., 0., 0.],
#         [0., 1., 0., 0.],
#         [0., 1., 0., 0.],
#         [0., 1., 0., 0.],
#         [0., 1., 0., 0.],
#         [0., 1., 0., 0.],
#         [0., 1., 0., 0.],
#         [0., 1., 0., 0.],
#         [0., 0., 0., 1.],
#         [0., 0., 0., 1.],
#         [0., 0., 0., 1.],
#         [0., 0., 0., 1.],
#         [0., 0., 0., 1.],
#         [0., 0., 0., 1.],
#         [0., 0., 0., 1.],
#         [0., 0., 0., 1.]]])