目录

complexcgr

This library have classes around the Chaos Game Representation for DNA sequence

The FCGR helps to visualize a k-mer distribution The FCGR of a sequence is an image showing the distribution of the kk-mers given a chosen kk. The frequencies of all kk-mers are distributed in the position of a matrix of 2k×2k2^k \times 2^k, which considers all the possible kk-mers: 4k4^k.

The position that a kk-mer uses in the matrix depends on the encoding given by the CGR.

Some examples of bacterial assemblies (see reference) are shown below. The name of the species and the sample_id is in the title of each image (see an example with the first image). These images were created using the 6-mers of each assembly and the class FCGR of this library.

FCGR of 10 bacteria
10 different species of bacteria represented by their FCGR (6-mers)

Installation

pypi


pip install complexcgr

to update to the latest version

pip install complexcgr --upgrade

How to use


1. CGR Chaos Game Representation of DNA

from complexcgr import CGR

# Instantiate class CGR
cgr = CGR()

# encode a sequence
cgr.encode("ACGT")
# > CGRCoords(N=4, x=0.1875, y=-0.5625)

# recover a sequence from CGR coordinates
cgr.decode(N=4,x=0.1875,y=-0.5625)
# > "ACGT"

2. FCGR Frequency Matrix of Chaos Game Representation of DNA

Input for FCGR only accept sequences in {A,C,G,T,N}\{A,C,G,T,N\}, but all kk-mers that contains an NN will not be considered for the calculation of the frequency matrix CGR

import random; random.seed(42)
from complexcgr import FCGR

# set the k-mer
fcgr = FCGR(k=8) # (256x256) array

# Generate a random sequence without T's
seq = "".join(random.choice("ACG") for _ in range(300_000))
chaos = fcgr(seq) # an array with the frequencies of each k-mer
fcgr.plot(chaos)
FCGR for a sequence without T's
FCGR representation for a sequence without T’s

You can save the image with

fcgr.save_img(chaos, path="img/ACG.jpg")

Formats allowed are defined by PIL.

You can also generate the image in 16 (or more bits), to avoid losing information of k-mer frequencies

# Generate image in 16-bits (default is 8-bits)
fcgr = FCGR(k=8, bits=16) # (256x256) array. When using plot() it will be rescaled to [0,65535] colors
# Generate a random sequence without T's and lots of N's
seq = "".join(random.choice("ACGN") for _ in range(300_000))
chaos = fcgr(seq) # an array with the probabilities of each k-mer
fcgr.plot(chaos)
FCGR for a sequence without T's
FCGR representation for a sequence without T’s and lots of N’s

3. iCGR integer Chaos Game Representation of DNA

from complexcgr import iCGR

# Instantiate class CGR
icgr = iCGR()

# encode a sequence
icgr.encode("ACGT")
# > CGRCoords(N=4, x=3, y=-9)

# recover a sequence from CGR coordinates
icgr.decode(N=4,x=3,y=-9)
# > "ACGT"

4. ComplexCGR Complex Chaos Game Representation of DNA (ComplexCGR)

from complexcgr import ComplexCGR

# Instantiate class CGR
ccgr = ComplexCGR()

# encode a sequence
ccgr.encode("ACGT")
# > CGRCoords(k=228,N=4)

# recover a sequence from ComplexCGR coordinates
ccgr.decode(k=228,N=4)
# > "ACGT"

5. ComplexFCGR Frequency Matrix of Complex Chaos Game Representation of DNA

Input for FCGR only accept sequences in {A,C,G,T,N}\{A,C,G,T,N\}, but all kk-mers that contains an NN will not be considered for the calculation of the frequency matrix CGR

import random; random.seed(42)
from complexcgr import FCGR

# set the k-mer desired
cfcgr = ComplexFCGR(k=8) # 8-mers

# Generate a random sequence without T's
seq = "".join(random.choice("ACG") for _ in range(300_000))
fig = cfcgr(seq)
FCGR for a sequence without T's
ComplexFCGR representation for a sequence without T’s

You can save the image with

cfcgr.save(fig, path="img/ACG-ComplexCGR.png")

Currently the plot must be saved as png


Advice for Real applications

Count k-mers could be the bottleneck for large sequences (> 100000 bp). Note that the class FCGR (and ComplexCGR) has implemented a naive approach to count k-mers, this is intended since in practice state-of-the-art tools like KMC or Jellyfish are used to count k-mers very efficiently.

We provide the class FCGRKmc, that receives as input the file generated by the following pipeline using KMC3

Make sure to have kmc installed. One recommended way is to create a conda environment and install it there

kmer_size=6
input="path/to/sequence.fa"
output="path/to/count-kmers.txt"

mkdir -p tmp-kmc
kmc -v -k$kmer_size -m4 -sm -ci0 -cs100000 -b -t4 -fa $input $input "tmp-kmc"
kmc_tools -t4 -v transform $input dump $output 
rm -r $input.kmc_pre $input.kmc_suf

the output file path/to/count-kmers.txt can be used with FCGRKmc

from complexcgr import FCGRKmc

kmer = 6
fcgr = FCGRKmc(kmer)

arr = fcgr("path/to/count-kmers.txt") # k-mer counts ordered in a matrix of 2^k x 2^k


# to visualize the distribution of k-mers. 
# Frequencies are scaled between [min, max] values. 
# White color corresponds to the minimum value of frequency
# Black color corresponds to the maximum value of frequency
fcgr.plot(arr) 

# Save it with numpy
import numpy as np
np.save("path_save/fcgr.npy",arr)

Videos

CGR encoding

CGR encoding of a sequence

CGR encoding of all k-mers

How are k-mers distributed for different k

ComplexCGR encoding

How are k-mers ordered based on lexicographic order for k=2

ComplexCGR and Symmetry

Conjugate of a complex number has a meaning (reverse sequence)

Functionalities/TODO list


version 0.8.0:
A list of available classes and functionalities are listed below:

Encoders The encoders are functions that map a sequence s{A,C,G,T}s \in \{A,C,G,T\} to a point in the plane. CGR, iCGR, and ComplexCGR.

CGR Chaos Game Representation: encodes a DNA sequence in 3 numbers (N,x,y)(N,x,y)

  • encode a sequence.
  • recover a sequence from a CGR encoding.

iCGR integer CGR: encodes a DNA sequence in 3 integers (N,x,y)(N,x,y).

CGR Chaos Game Representation: encodes a DNA sequence in 3 numbers (N,x,y)(N,x,y)

  • encode a sequence.
  • recover a sequence from a CGR encoding.

iCGR integer CGR: encodes a DNA sequence in 3 integers (N,x,y)(N,x,y).

  • encode a sequence
  • recover a sequence from an iCGR encoding

ComplexCGR: encodes a DNA sequence in 2 integers (k,N)(k,N).

  • encode a sequence
  • recover a sequence from a ComplexCGR encoding
  • plot sequence of ComplexCGR encodings

Image for distribution of k-mers

Author

complexcgr is developed by Jorge Avila Cartes

Related publications

关于

用于基因组序列的压缩和索引,支持快速序列比对和查询

3.2 MB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802032778号