MDAsim 2 extends the original published MDAsim 1.2, including single nucleotide copy errors and outputs a respective log file of the introduced errors.
The easiest way to install MDAsim is via bioconda. For the required one time setup of conda and bioconda, please refer to bioconda’s installation instructions. Once this is done, all you need to do is issue the following command:
conda install mdasim
build from source
To build MDAsim from source, download the latest tagged release and unpack it. Alternatively, you can install the latest version of the repository by cloning it with:
git clone https://github.com/hzi-bifo/mdasim.git
Enter the created directory:
cd mdasim
To install in a bin folder in this source directory, just run:
make
To install to a bin folder at a custom location, run:
make prefix=path/to/desired/build/folder
To be able to run the software from anywhere on your system, make sure that the created bin directory is in your $PATH.
Usage
Once installed, you can get the full command line usage message with:
mdasim --help
For an initial quick test run, you can use the provided example files (please note: the = between command line arguments and their respective values are required):
cd examples
mdasim --input=example_input.fa --primers=primerList.fasta --coverage=15 \
--output=example_mdasim_out_prefix_ --log=example_mdasim_errors.log >example_mdasim_run.log
Note that the file provided under --input must contain exactly one sequence, since MDAsim does not process input files with more than one sequence.
RAM requirements and runtime
The required memory is linearly proportional to the size of the input genome and to the final coverage requested. E.g., for the S. aureus sample for which the size of the genome is in the order of 3M, to get 50x average coverage 6G RAM is needed. So please note that for a genome of size 3G to get average coverage of 50x, approximately 6T of RAM is needed. One suggestion to reduce the required memory size for large genomes is to break the genome to smaller pieces (may be with some overlaps), then apply MDAsim on each piece separately.
The runtime of MDAsim increases exponentially depending on the length of the simulated sequence. Simulating the amplification of 60M bases at a target coverage of 20 takes around 8h.
Output formats
<out_prefix>Amplicons.fasta
The format is fasta, with the ID line of each amplicon as follows:
R<IOA>: amplicon name consisting of R and an amplicon index counter that starts at 1 (i.e. the last one shows the total number of amplicons in the file). Within the output file, the name of a fragment is unique.
<LA>: length of the amplicon
<POS>: position on the original input sequence where this fragment can be aligned to. Positions start at 0.
<S>: + or -, indicating the positive or negative strand. In order to align a negative strand with the original input sequence, it must be reverted and complemented.
--log errors.log
The format of the log file for single nucleotide substitution errors is tab separated as follows:
#pos\tref\tsub
pos: position on the original input sequence (0-based)
ref: reference nucleotide in the original input sequence that was replaced
sub: nucleotide that is generated in the strand of the original input sequence
For consistent reference back to the original input sequence, both ref and sub will be reported as if incorporated into the input sequence’s strand. I.e., if the substitution happens in the complementary strand, both nucleotides will be complemented before logging.
caveats
Floating point exception on small inputs
Smaller sizes of input fasta sequences combined with a low target coverage can give a Floating point exception. In the example run provided above, setting --coverage=10 chowcases this. While we assume numerical issues, we have resisted the urge of being nerd-sniped in this particular case. But if you want to investigate, please contribute to the issue we use to track this.
Citation
MDAsim 2 extends the original MDAsim 1.2. Whenever you use MDAsim 2, please cite both versions:
MDAsim 2 citation: Until there is a publication to cite, please cite the link to the tagged version that you use (e.g.: https://github.com/hzi-bifo/mdasim/releases/v2.1.1) or the exact bioconda version of the package that you use.
MDAsim 2 extends the original published MDAsim 1.2, including single nucleotide copy errors and outputs a respective log file of the introduced errors.
The original MDAsim 1.2 can be found on sourceforge and we have kept its README_mdasim1-2.txt for reference. For credits for different features, please refer to the CREDITS_mdasim.txt and the change logs of the releases. The license, as set by the original MDAsim 1.2, is provided in LICENSE.txt.
Information on how to use and cite MDAsim 2:
Installation
bioconda
The easiest way to install MDAsim is via bioconda. For the required one time setup of conda and bioconda, please refer to bioconda’s installation instructions. Once this is done, all you need to do is issue the following command:
build from source
To build MDAsim from source, download the latest tagged release and unpack it. Alternatively, you can install the latest version of the repository by cloning it with:
Enter the created directory:
To install in a
binfolder in this source directory, just run:To install to a
binfolder at a custom location, run:To be able to run the software from anywhere on your system, make sure that the created
bindirectory is in your$PATH.Usage
Once installed, you can get the full command line usage message with:
For an initial quick test run, you can use the provided example files (please note: the
=between command line arguments and their respective values are required):Note that the file provided under
--inputmust contain exactly one sequence, since MDAsim does not process input files with more than one sequence.RAM requirements and runtime
The required memory is linearly proportional to the size of the input genome and to the final coverage requested. E.g., for the S. aureus sample for which the size of the genome is in the order of 3M, to get 50x average coverage 6G RAM is needed. So please note that for a genome of size 3G to get average coverage of 50x, approximately 6T of RAM is needed. One suggestion to reduce the required memory size for large genomes is to break the genome to smaller pieces (may be with some overlaps), then apply MDAsim on each piece separately.
The runtime of MDAsim increases exponentially depending on the length of the simulated sequence. Simulating the amplification of 60M bases at a target coverage of 20 takes around 8h.
Output formats
<out_prefix>Amplicons.fastaThe format is fasta, with the ID line of each amplicon as follows:
R<IOA>: amplicon name consisting ofRand an amplicon index counter that starts at 1 (i.e. the last one shows the total number of amplicons in the file). Within the output file, the name of a fragment is unique.<LA>: length of the amplicon<POS>: position on the original input sequence where this fragment can be aligned to. Positions start at 0.<S>:+or-, indicating the positive or negative strand. In order to align a negative strand with the original input sequence, it must be reverted and complemented.--log errors.logThe format of the log file for single nucleotide substitution errors is tab separated as follows:
pos: position on the original input sequence (0-based)ref: reference nucleotide in the original input sequence that was replacedsub: nucleotide that is generated in the strand of the original input sequenceFor consistent reference back to the original input sequence, both
refandsubwill be reported as if incorporated into the input sequence’s strand. I.e., if the substitution happens in the complementary strand, both nucleotides will be complemented before logging.caveats
Floating point exceptionon small inputsSmaller sizes of input fasta sequences combined with a low target coverage can give a
Floating point exception. In theexamplerun provided above, setting--coverage=10chowcases this. While we assume numerical issues, we have resisted the urge of being nerd-sniped in this particular case. But if you want to investigate, please contribute to the issue we use to track this.Citation
MDAsim 2 extends the original MDAsim 1.2. Whenever you use MDAsim 2, please cite both versions: