build(deps): bump rand from 0.10.0 to 0.10.1 (#158)
Bumps rand from 0.10.0 to 0.10.1.
updated-dependencies:
- dependency-name: rand dependency-version: 0.10.1 dependency-type: direct:production …
Signed-off-by: dependabot[bot] support@github.com Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9
京公网安备 11010802032778号
Randomly subsample sequencing reads or alignments.
Table of Contents
Install
Precompiled binary
The quickest way to get
rasusais with the install script:The installer will identify the architecture and OS, download the latest binary for your system, and move it to
/usr/local/bin(on Linux/macOS).If you would like to install it to a different location, you can pass the
-bor--bin-diroption to the script.You can also pass options to the script like so
cargoAlternatively, for more modern Rust-based binary management, we recommend using
cargo-binstall:condaPrerequisite:
conda(and bioconda channel correctly set up)Container
Docker images are hosted on GitHub Container Registry.
apptainerPrerequisite:
apptainerThe above will use the latest version. If you want to specify a version then use a tag (or commit) like so.
dockerPrerequisite:
dockerYou can find all the available tags on the container registry. Note: versions prior to 0.4.0 were housed on Docker Hub. And versions from 0.4.0 to 2.2.0 were on quay.io.
homebrewPrerequisite:
homebrewBuild locally
Prerequisite:
rusttoolchainUsage
Basic usage - reads
Subsample FASTQ reads or unaligned SAM/BAM/CRAM.
rasusa readsalso supports unaligned SAM/BAM/CRAM input:The above commands will output the subsampled file to
stdout.Or, if you have paired Illumina
For more details on the above options, and additional options, see below.
Basic usage - alignments
Subsample alignments
this will subsample each position in the alignment to 30x coverage.
Required parameters
rasusahas three required options for thereadscommand, and two required options for thealncommand.Input
This positional argument specifies the file(s) containing the reads or unaligned alignments you would like to subsample.
For the
readscommand, the file(s) can be in FASTA, FASTQ, or unaligned SAM, BAM, or CRAM format. FASTA and FASTQ files can be compressed (with a tool such asgzip). If two files are passed toreads,rasusawill assume they are paired-end reads.For the
alncommand, the file must be a valid coordinate-sorted SAM/BAM/CRAM file.Coverage
-c,--coverageThis option is used to determine the minimum coverage to subsample the reads to. For the
readscommand, it can be specified as an integer (100), a decimal/float (100.7), or either of the previous suffixed with an ‘x’ (100x). For thealncommand, it is an integer only.Due to the method for determining how many bases are required to achieve the desired coverage in the
readscommand, the actual coverage, in the end, could be slightly higher than requested. For example, if the last included read is very long. The log messages should inform you of the actual coverage in the end.For the
alncommand, the coverage is the maximum number of reads that should be present at each position in the alignment. If a position has fewer than the requested number of reads, all reads at that position will be included.Genome size
-g,--genome-sizeThe genome size of the input is also required. It is used to determine how many bases are necessary to achieve the desired coverage. This can, of course, be as precise or rough as you like.
Genome size can be passed in many ways. As a plain old integer (1600), or with a metric suffix (1.6kb). All metric suffixes can have an optional ‘b’ suffix and be lower, upper, or mixed case. So ‘Kb’, ‘kb’ and ‘k’ would all be inferred as ‘kilo’. Valid metric suffixes include:
Alternatively, a FASTA/Q index file can be given and the genome size will be set to the sum of all reference sequences in it.
Optional parameters
Output
-o,--outputreadsBy default,
rasusawill output the subsampled file tostdout(if one file is given). If you would prefer to specify an output file path, then use this option.Output for Illumina paired files must be specified using
--outputtwice --o out.r1.fq -o out.r2.fqThe ordering of the output files is assumed to be the same as the input.
rasusa readswill attempt to automatically infer the output format and whether compression of the output file(s) is required. It does this by detecting any of the supported extensions:.fa/.fasta: FASTA format.fq/.fastq: FASTQ format.bam: BAM format.cram: CRAM format.sam: SAM format.gz: will compress the output withgzip.bzor.bz2: will compress the output withbzip2.lzma: will compress the output with thexzLZMA algorithm.zst: will compress the output withzstdalnFor the
alncommand, the output file format will be the same as the input if writing to stdout, otherwise it will be inferred from the file extension.Output compression/format
readsrasusa readsprovides two options for explicitly setting the output format and compression.-Z,--compress-typeUse this option to manually set the compression algorithm to use for the output file(s). It will override any format automatically detected from the output path. Note: this is only used for FASTA/FASTQ output.
Valid options are:
g:gzipb:bzip2lorx:xzLZMA algorithmz:zstdu: no compression-O,--output-formatUse this option to manually set the output format to use for the output file(s). It will override any format automatically detected from the output path.
Valid options are:
forfasta: FASTAqorfastq: FASTQsorsam: SAMborbam: BAMcorcram: CRAMaln-O,--output-formatUse this option to manually set the output file format for
rasusa aln. By default, the same format as the input will be used, or the format will be guessed from the--outputpath extension if given.Valid options are:
sorsam: SAMborbam: BAMcorcram: CRAMAll values to this option are case insensitive.
Compresion level
-l,--compress-levelCompression level to use if compressing the output. By default this is set to the default for the compression type being output.
Target number of bases
-b,--basesExplicitly set the number of bases required in the subsample. This option takes the number in the same format as genome size.
Number of reads
-n,--numExplicitly set the number of reads in the subsample. This option takes the number in the same format as genome size.
When providing paired reads as input, this option will sample this many total read pairs. For example, when passing
-n 20 r1.fq r2.fq, the two output files will have 20 reads each, and the read ids will be the same in both.Note: if this option is given, genome size and coverage are not required.
Fraction of reads
-f,--fracExplicitly set the fraction of total reads in the subsample. The value given to this option can be a float or a percentage - i.e.,
-f 0.5and-f 50will both take half of the reads.Random seed
-s,--seedThis option allows you to specify the random seed used by the random subsampler. By explicitly setting this parameter, you make the subsample for the input reproducible. You only need to pass this parameter if you are likely to want to subsample the same input file again in the future and want the same subset of reads. However, if you forget to use this option, the seed generated by the system will be printed to the log output, allowing you to use it in the future.
Strictly adhere to requested coverage
-e,--strictIf the requested coverage, total bases, number of reads, or fraction of reads cannot be met, an error will be thrown. By default, a warning is displayed, and the maximum possible coverage, total bases, number of reads, or fraction of reads is used.
Subsampling strategy
--strategyBy default,
rasusa alnuses thestreamstrategy, which implements a fast sweep-line algorithm with random priority. It processes a coordinate-sorted alignment file in a single pass (two passes if using paired-end data) while maintaining an active set of reads in a heap, ensuring that no position exceeds the target depth N.This strategy provides the option
--swap-distance(default: 5 bp), which limits the allowed distance when swapping between reads encountered in the current scan and reads already in the heap.Alternatively, users can select the
fetchstrategy. This approach repeatedly fetches overlapping reads, shuffles them, and samples to the target depth N.The fetch strategy provides additional controls:
--batch-size(default: 10 kb): size of genomic window cached in memory--step-size(default: 100 bp): step size used when scanning along the chromosome to find overlapping readsIn most cases, the default
streamstrategy is recommended due to its speed and low memory usage. See this PR for a discussion of performance and behavior differences between these two strategies.Verbosity
-vAdding this optional flag will make the logging more verbose. By default, logging will produce messages considered “info” or above (see here for more details). If verbosity is switched on, you will additionally get “debug” level logging messages.
Full usage
readscommandalncommandBenchmark
The real question is: will
rasusajust needlessly eat away at your precious time on earth?To do this benchmark, I am going to use hyperfine.
The data I used comes from
Single long read input
Download and rename the FASTQ
The file size is 2.9G, and it has 379,547 reads.
We benchmark against
filtlongusing the same strategy outlined in Motivation.Results
filtlong --target_bases 220576600 tb.fqrasusa reads tb.fq -c 50 -g 4411532 -s 1Summary:
rasusaran 21.77 ± 0.29 times faster thanfiltlong.Paired-end input
Download and then deinterleave the FASTQ with
pyfastaqEach file’s size is 179M and has 283,590 reads.
For this benchmark, we will use
seqtk. We will also testseqtk‘s 2-pass mode as this is analogous torasusa reads.Results
seqtk sample -s 1 r1.fq 140000 > /tmp/r1.fq; seqtk sample -s 1 r2.fq 140000 > /tmp/r2.fq;seqtk sample -2 -s 1 r1.fq 140000 > /tmp/r1.fq; seqtk sample -2 -s 1 r2.fq 140000 > /tmp/r2.fq;rasusa reads r1.fq r2.fq -n 140000 -s 1 -o /tmp/r1.fq -o /tmp/r2.fqSummary:
rasusa readsran 1.84 times faster thanseqtk(1-pass) and 1.77 times faster thanseqtk(2-pass)So,
rasusa readsis faster thanseqtkbut doesn’t require a fixed number of reads - allowing you to avoid doing maths to determine how many reads you need to downsample to a specific coverage. 🤓Contributing
If you would like to help improve
rasusayou are very welcome!For changes to be accepted, they must pass the CI and coverage checks. These include:
rustfmt. This can be done by runningcargo fmtin the project directory.cargo clippy --all-features --all-targets -- -D warningstarpaulin.Citing
If you use
rasusain your research, it would be appreciated if you could cite it.Bibtex
You can get the following citation by running
rasusa cite