目录

licence PyPI version install with bioconda

Description

seq2onehot is a command-line tool encoding DNA/RNA/protein sequences to a one-hot numpy array.
z

All sequences must be the same lengths.

To decode a one-hot numpy array to sequences, use onehot2seq.
https://github.com/akikuno/onehot2seq

Installation

You can install seq2onehot using pip or bioconda:

pip install seq2onehot
conda install -c bioconda seq2onehot

Usage

seq2onehot [options] -t/--type <dna/rna/protein> -i/--input <in.fasta> -o/--output <out.npy>

Options

-a/--ambiguous: include ambiguous characters

The ambigous characters are:

  • XBZJ for amino acid
  • NVHDBMRWSYK for DNA and RNA

The detail of ambiguous characters is described here:
https://meme-suite.org/meme/doc/alphabets.html

Examples

# DNA sequences
seq2onehot -t dna -i example/dna.fasta -o dna.npy

# RNA sequences
seq2onehot -t rna -i example/rna.fasta -o rna.npy

# Protein sequences
seq2onehot -t protein -i example/protein.fasta -o protein.npy

One-hot array

The output file contains 3d one-hot array of RxNxL (Read x Nucreotide/Amino acid x Letter)

  • The order of nucreotide is ACGT (+ NVHDBMRWSYK) for DNA, ACGU (+ NVHDBMRWSYK) for RNA
  • The order of amino acid is ACDEFGHIKLMNPQRSTVWY (+ XBZJ)
# Original sequences:
## ACGTACGTACGTACGT
## CCCCCCCCTTTTTTTT

onehot = np.load("dna.npy")

onehot.shape
# (2, 16, 4) <- 2 reads x 16 nucreotides x 4 letters (ACGT)

onehot
# array([[[1., 0., 0., 0.],
#         [0., 1., 0., 0.],
#         [0., 0., 1., 0.],
#         [0., 0., 0., 1.],
#         [1., 0., 0., 0.],
#         [0., 1., 0., 0.],
#         [0., 0., 1., 0.],
#         [0., 0., 0., 1.],
#         [1., 0., 0., 0.],
#         [0., 1., 0., 0.],
#         [0., 0., 1., 0.],
#         [0., 0., 0., 1.],
#         [1., 0., 0., 0.],
#         [0., 1., 0., 0.],
#         [0., 0., 1., 0.],
#         [0., 0., 0., 1.]],

#        [[0., 1., 0., 0.],
#         [0., 1., 0., 0.],
#         [0., 1., 0., 0.],
#         [0., 1., 0., 0.],
#         [0., 1., 0., 0.],
#         [0., 1., 0., 0.],
#         [0., 1., 0., 0.],
#         [0., 1., 0., 0.],
#         [0., 0., 0., 1.],
#         [0., 0., 0., 1.],
#         [0., 0., 0., 1.],
#         [0., 0., 0., 1.],
#         [0., 0., 0., 1.],
#         [0., 0., 0., 1.],
#         [0., 0., 0., 1.],
#         [0., 0., 0., 1.]]])
关于

将生物序列(如DNA、RNA、蛋白质)转换为one-hot编码格式

41.0 KB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802032778号