————————————————————————————————————————————————————————————————————————————————
At a glance:
Commet is used for de novo comparisons of raw read sets. It is the successor of
the compareads tool.
————————————————————————————————————————————————————————————————————————————————
This software is a computer program whose purpose is to find all the similar
reads between two set of NGS reads. It also provide a similarity score between
the two samples.
Copyright (C) 2014 INRIA
This program is free software: you can redistribute it and/or modify it under
the terms of the GNU Affero General Public License as published by the Free
Software Foundation, either version 3 of the License, or any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along
with this program. If not, see http://www.gnu.org/licenses/.
Commet.py:
- input:
- A file containing a list of read sets (see below)
- output:
- a bit vector for each intersection of pairs of read sets
- 3 matrices in CVS format
- plain: line i, column j contains the number of reads
from set i similar to at least one read of set j
- percentage: line i, column j contains the percentage
of reads from set i similar to at least one read of
set j
- normalized: line i, column j contains the percentage
of reads similar between sets i and j with respect to
the total number of reads in sets i and j.
- 3 heatmaps (png format), computed from the three previously
mentioned matrices
- A dendrogram (png format) of input reads obtained from the
normalized matrix, by hierarchical complete clusterization of
all input reads.
- misc.:
- This script needs python 2.7
- This script can parallelize computations on SGE clusters using
the –sge option. In this case the analysis of the CVS
matrices generating the four png figures (three heatmaps and
one dendrogram) is performed once all jobs are finished using
the Commet_analysis.py script.
————————————————————————————————————————————————————————————————————————————————
LISTS OF READ SETS
Read sets may be composed of several files in different format (fasta, fastq,
gzip). To give these read sets to Commet, we use the following format:
- Each line corresponds to a read set.
- Each line begins with an identifier for the set directly followed by a
colon (:).
- Then, each file of a set is separated by a semicolon (;).
- A bit vector may be associated to a file by adding its name after a
comma (,).
Example:
Name1:path_set1.1.fq.gz;path_set_1.2.fq.gz,bv1.2.bv;path_set1.3.fq;…
Name2:path_set2.1.fq.gz;path_set_2.2.fq.gz
Name3:path_set3.1.fq,bv3.1.bv;path_set_3.2.fq,bv3.2.bv;path_set3.3.fq.gz
…
————————————————————————————————————————————————————————————————————————————————
APPLICATIONS SUMMARY (compiled in the ‘bin’ directory)
A file containing reads (nucleotide sequences) in fasta or fastq format.
The input file may be compressed with gzip algorithm.
Four parameters for selection : - minimum size [default=0]
- maximum number of N [default=any]
- minimum Shannon index [default=0]
- maximum selected reads [default=all].
output:
A file containing a vector of bits.
Each bit of the vector corresponds to a read in the input file.
If the bit is 1 then the read is selected, if 0 the read was filtered out.
Contributors : Pierre PETERLONGO, pierre.peterlongo@inria.fr [24/07/14] Nicolas MAILLET, nicolas.maillet@inria.fr [24/07/14] Guillaume Collet, guillaume@gcollet.fr [24/07/14]
———————————————————————————————————————————————————————————————————————————————— At a glance: Commet is used for de novo comparisons of raw read sets. It is the successor of the compareads tool. ————————————————————————————————————————————————————————————————————————————————
This software is a computer program whose purpose is to find all the similar reads between two set of NGS reads. It also provide a similarity score between the two samples.
Copyright (C) 2014 INRIA
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see http://www.gnu.org/licenses/.
———————————————————————————————————————————————————————————————————————————————— REQUIREMENTS
———————————————————————————————————————————————————————————————————————————————— INSTALLATION
Commet’s applications have been written in C++. To compile those programs, a makefile is provided using g++:
make
If you want to install in /usr/local/bin, run with root permissions (use sudo):
make sudo make install
Then, you can test Commet on the ABCDE_bench data by running the following command:
python Commet.py ABCDE_bench/sets_config.txt -k 32
———————————————————————————————————————————————————————————————————————————————— USER GUIDE
For complete descriptions of modules and usage, a user guide is provided in the doc directory.
———————————————————————————————————————————————————————————————————————————————— SCRIPTS
Commet.py: - input: - A file containing a list of read sets (see below) - output: - a bit vector for each intersection of pairs of read sets - 3 matrices in CVS format - plain: line i, column j contains the number of reads from set i similar to at least one read of set j - percentage: line i, column j contains the percentage of reads from set i similar to at least one read of set j - normalized: line i, column j contains the percentage of reads similar between sets i and j with respect to the total number of reads in sets i and j. - 3 heatmaps (png format), computed from the three previously mentioned matrices - A dendrogram (png format) of input reads obtained from the normalized matrix, by hierarchical complete clusterization of all input reads. - misc.: - This script needs python 2.7 - This script can parallelize computations on SGE clusters using the –sge option. In this case the analysis of the CVS matrices generating the four png figures (three heatmaps and one dendrogram) is performed once all jobs are finished using the Commet_analysis.py script.
———————————————————————————————————————————————————————————————————————————————— LISTS OF READ SETS
Read sets may be composed of several files in different format (fasta, fastq, gzip). To give these read sets to Commet, we use the following format:
Example: Name1:path_set1.1.fq.gz;path_set_1.2.fq.gz,bv1.2.bv;path_set1.3.fq;… Name2:path_set2.1.fq.gz;path_set_2.2.fq.gz Name3:path_set3.1.fq,bv3.1.bv;path_set_3.2.fq,bv3.2.bv;path_set3.3.fq.gz … ———————————————————————————————————————————————————————————————————————————————— APPLICATIONS SUMMARY (compiled in the ‘bin’ directory)
filter_reads index_and_search extract_reads bvop
———————————————————————————————————————————————————————————————————————————————— FILTER_READS
input:
output:
———————————————————————————————————————————————————————————————————————————————— INDEX_AND_SEARCH
input:
output:
———————————————————————————————————————————————————————————————————————————————— BVOP
input:
output:
———————————————————————————————————————————————————————————————————————————————— EXTRACT_READS
input:
A file containing reads (nucleotide sequences) in fasta or fastq format. The input file may be compressed with gzip algorithm.
A file containing a vector of bits of the same size.
output: