add bioconda
seq2onehot is a command-line tool encoding DNA/RNA/protein sequences to a one-hot numpy array.z
seq2onehot
All sequences must be the same lengths.
To decode a one-hot numpy array to sequences, use onehot2seq.https://github.com/akikuno/onehot2seq
onehot2seq
You can install seq2onehot using pip or bioconda:
pip install seq2onehot
conda install -c bioconda seq2onehot
seq2onehot [options] -t/--type <dna/rna/protein> -i/--input <in.fasta> -o/--output <out.npy>
-a/--ambiguous: include ambiguous characters
The ambigous characters are:
XBZJ
NVHDBMRWSYK
The detail of ambiguous characters is described here:https://meme-suite.org/meme/doc/alphabets.html
# DNA sequences seq2onehot -t dna -i example/dna.fasta -o dna.npy # RNA sequences seq2onehot -t rna -i example/rna.fasta -o rna.npy # Protein sequences seq2onehot -t protein -i example/protein.fasta -o protein.npy
The output file contains 3d one-hot array of RxNxL (Read x Nucreotide/Amino acid x Letter)
RxNxL
ACGT
ACGU
ACDEFGHIKLMNPQRSTVWY
# Original sequences: ## ACGTACGTACGTACGT ## CCCCCCCCTTTTTTTT onehot = np.load("dna.npy") onehot.shape # (2, 16, 4) <- 2 reads x 16 nucreotides x 4 letters (ACGT) onehot # array([[[1., 0., 0., 0.], # [0., 1., 0., 0.], # [0., 0., 1., 0.], # [0., 0., 0., 1.], # [1., 0., 0., 0.], # [0., 1., 0., 0.], # [0., 0., 1., 0.], # [0., 0., 0., 1.], # [1., 0., 0., 0.], # [0., 1., 0., 0.], # [0., 0., 1., 0.], # [0., 0., 0., 1.], # [1., 0., 0., 0.], # [0., 1., 0., 0.], # [0., 0., 1., 0.], # [0., 0., 0., 1.]], # [[0., 1., 0., 0.], # [0., 1., 0., 0.], # [0., 1., 0., 0.], # [0., 1., 0., 0.], # [0., 1., 0., 0.], # [0., 1., 0., 0.], # [0., 1., 0., 0.], # [0., 1., 0., 0.], # [0., 0., 0., 1.], # [0., 0., 0., 1.], # [0., 0., 0., 1.], # [0., 0., 0., 1.], # [0., 0., 0., 1.], # [0., 0., 0., 1.], # [0., 0., 0., 1.], # [0., 0., 0., 1.]]])
将生物序列(如DNA、RNA、蛋白质)转换为one-hot编码格式
版权所有:中国计算机学会技术支持:开源发展技术委员会 京ICP备13000930号-9 京公网安备 11010802032778号
Description
seq2onehotis a command-line tool encoding DNA/RNA/protein sequences to a one-hot numpy array.z
To decode a one-hot numpy array to sequences, use
onehot2seq.https://github.com/akikuno/onehot2seq
Installation
You can install
seq2onehotusing pip or bioconda:Usage
Options
The ambigous characters are:
XBZJfor amino acidNVHDBMRWSYKfor DNA and RNAThe detail of ambiguous characters is described here:
https://meme-suite.org/meme/doc/alphabets.html
Examples
One-hot array
The output file contains 3d one-hot array of
RxNxL(Read x Nucreotide/Amino acid x Letter)ACGT(+NVHDBMRWSYK) for DNA,ACGU(+NVHDBMRWSYK) for RNAACDEFGHIKLMNPQRSTVWY(+XBZJ)