A draft genome scaffolder that uses multiple reference genomes in a
graph-based approach.
Availability and dependencies
The present document provides a short guide for using the stand-alone
version of the software Medusa. This software has not yet been published.
A web interface is available at http://combo.dbe.unifi.it/medusa.
The source code, precompiled version and the present manual are
accessible at https://github.com/combogenomics/medusa.
Medusa depends on the following packages being installed on your system
and available in your PATH:
The archive Medusa.tar.gz contains the following files:
A runnable .jar file medusa.jar This is the program you will run.
A sub-folder with python scripts needed to run the program (medusa_scripts). Leave it in the same folder of
the .jar file.
A sub-folder with a dataset (test) that can be used to test the tool.
A sub-folder with scripts useful for benchmarking the tool.
Input and Output
The following inputs are required:
The targetGenome file: a draft genome in fasta format. This is the
genome you are interested in scaffolding.
An arbitrary long list of auxiliaryDraft files: other draft
genomes in fasta format. The closest these organisms are related to
the target, the better the results will be. These files are expected
to be collected in a specific directory. It is possible to specify
the path to the directory, see the command “-f” in the next section.
The following output files will be produced.
targetGenome_SUMMARY: a textual file containing information about
your data. Number of scaffolds, N50 value etc..
targetGenomeScaffold.fasta: a fasta file with the sequences grouped
in scaffolds. Contigs in the same scaffolds are separated by 100 Ns
by default, or a variable number of Ns (estimate of the distance between
the contigs), if the option “-d” is used.
The following output files can optionally be produced.
targetGenome_distanceTable: a tabular file with the estimation of the
distance between successive contigs (bp).
targetGenome_network.gexf: the contig network in gexf format.
targetGenome_cover.gexf: the final path cover in gexf format.
Usage
The project folder must contain:
the targetGenome in fasta format.
the medusa.jar file
the scripts sub-folder “medusa_scripts”.
the comparison genomes sub-folder “drafts”. (In alternative you can
specify another path for this folder usinf the “-f” option)
Medusa can be run with the following parameters:
The option -i is required and indicates the name of the target
genome file.
The option -o is optional and indicates the name of output fasta
file.
The option -v (recommended) print on console the information given
by the package MUMmer. This option is strongly suggested to
understand if MUMmer is not running properly.
The option -f is optional and indicates the path to the comparison
drafts folder.
The option -random is available (not required). This option allows
the user to run a given number of cleaning rounds and keep the best
solution. Since the variability is small, 5 rounds are
usually sufficient to find the best score.
The option -w2 is optional and allows for a sequence similarity
based weighting scheme. Using a different weighting scheme may lead
to better results.
The option -d allows for the estimation of the distance between pairs of contigs based on the reference genome(s):
in this case the scaffolded contigs will be separated by a number of N characters equal to this estimate.
The estimated distances are also saved in the “*_distanceTable” file.
By default the scaffolded contigs are separated by 100 Ns.
The -gexf is optional. With this option the gexf format of the contig network and
the path cover are porvided.
The option -n50 allows the calculation of the N50 statistic on a FASTA file.
In this case the usage is the following: java -jar medusa.jar -n50
All the other options will be ignored.
Finally the -h option provides a small recap of the previous ones.
The Medusa archive
When medusa archive is unzipped the following files will be extracted:
the medusa.jar file.
the scripts sub-folder “medusa_scripts”.
the utility test scripts folder “medusa_testing”
a folder “test”, containing one test bacterial datasets.
Medusa
A draft genome scaffolder that uses multiple reference genomes in a graph-based approach.
Availability and dependencies
The present document provides a short guide for using the stand-alone version of the software Medusa. This software has not yet been published. A web interface is available at http://combo.dbe.unifi.it/medusa. The source code, precompiled version and the present manual are accessible at https://github.com/combogenomics/medusa.
Medusa depends on the following packages being installed on your system and available in your PATH:
MUMmer: this software is available at http://mummer.sourceforge.net/.
Python (from 2.6) and BioPython (from 1.61).
Java (from 1.6).
The following Python packages should be present:
Networkx
Numpy
Biopython
The archive Medusa.tar.gz contains the following files:
A runnable .jar file medusa.jar This is the program you will run.
A sub-folder with python scripts needed to run the program (medusa_scripts). Leave it in the same folder of the .jar file.
A sub-folder with a dataset (test) that can be used to test the tool.
A sub-folder with scripts useful for benchmarking the tool.
Input and Output
The following inputs are required:
The targetGenome file: a draft genome in fasta format. This is the genome you are interested in scaffolding.
An arbitrary long list of auxiliaryDraft files: other draft genomes in fasta format. The closest these organisms are related to the target, the better the results will be. These files are expected to be collected in a specific directory. It is possible to specify the path to the directory, see the command “-f” in the next section.
The following output files will be produced.
targetGenome_SUMMARY: a textual file containing information about your data. Number of scaffolds, N50 value etc..
targetGenomeScaffold.fasta: a fasta file with the sequences grouped in scaffolds. Contigs in the same scaffolds are separated by 100 Ns by default, or a variable number of Ns (estimate of the distance between the contigs), if the option “-d” is used.
The following output files can optionally be produced.
targetGenome_distanceTable: a tabular file with the estimation of the distance between successive contigs (bp).
targetGenome_network.gexf: the contig network in gexf format.
targetGenome_cover.gexf: the final path cover in gexf format.
Usage
The project folder must contain:
the targetGenome in fasta format.
the medusa.jar file
the scripts sub-folder “medusa_scripts”.
the comparison genomes sub-folder “drafts”. (In alternative you can specify another path for this folder usinf the “-f” option)
Medusa can be run with the following parameters:
The option -i is required and indicates the name of the target genome file.
The option -o is optional and indicates the name of output fasta file.
The option -v (recommended) print on console the information given by the package MUMmer. This option is strongly suggested to understand if MUMmer is not running properly.
The option -f is optional and indicates the path to the comparison drafts folder.
The option -random is available (not required). This option allows the user to run a given number of cleaning rounds and keep the best solution. Since the variability is small, 5 rounds are usually sufficient to find the best score.
The option -w2 is optional and allows for a sequence similarity based weighting scheme. Using a different weighting scheme may lead to better results.
The option -d allows for the estimation of the distance between pairs of contigs based on the reference genome(s): in this case the scaffolded contigs will be separated by a number of N characters equal to this estimate. The estimated distances are also saved in the “*_distanceTable” file. By default the scaffolded contigs are separated by 100 Ns.
The -gexf is optional. With this option the gexf format of the contig network and the path cover are porvided.
The option -n50 allows the calculation of the N50 statistic on a FASTA file. In this case the usage is the following: java -jar medusa.jar -n50 All the other options will be ignored.
Finally the -h option provides a small recap of the previous ones.
The Medusa archive
When medusa archive is unzipped the following files will be extracted:
the medusa.jar file.
the scripts sub-folder “medusa_scripts”.
the utility test scripts folder “medusa_testing”
a folder “test”, containing one test bacterial datasets.
Running an example
Additional datasets for benchmarking
Additional datasets can be retrieved at the medusa_datasets repository https://github.com/combogenomics/medusa_datasets.
Just type
Compile
The project can be compiled by calling ant in the top-level directory: