NSCCN/rasusa：随机下采样FASTQ文件以生成指定覆盖深度的子集

Randomly subsample sequencing reads or alignments.

Hall, M. B., (2022). Rasusa: Randomly subsample sequencing reads to a specified coverage. Journal of Open Source Software, 7(69), 3941, https://doi.org/10.21105/joss.03941

Table of Contents
Install
Usage
Benchmark
Contributing
Citing

Install

GitHub Downloads (all assets, all releases)

Precompiled binary

The quickest way to get rasusa is with the install script:

curl -sSL https://github.com/mbhall88/rasusa/releases/latest/download/install.sh | sh
# or with wget
wget -nv -O - https://github.com/mbhall88/rasusa/releases/latest/download/install.sh | sh

The installer will identify the architecture and OS, download the latest binary for your system, and move it to /usr/local/bin (on Linux/macOS).

If you would like to install it to a different location, you can pass the -b or --bin-dir option to the script.

curl -sSL https://github.com/mbhall88/rasusa/releases/latest/download/install.sh | sh -s -- -b /my/custom/bin

You can also pass options to the script like so

$ curl -sSL https://github.com/mbhall88/rasusa/releases/latest/download/install.sh | sh -s -- --help
install.sh [option]

Fetch and install the latest version of rasusa, if rasusa is already
installed it will be updated to the latest version.

Options
        -V, --verbose
                Enable verbose output for the installer

        -f, -y, --force, --yes
                Skip the confirmation prompt during installation

        -p, --platform
                Override the platform identified by the installer [default: apple-darwin]

        -b, --bin-dir
                Override the bin installation directory [default: /usr/local/bin]

        -a, --arch
                Override the architecture identified by the installer [default: x86_64]

        -B, --base-url
                Override the base URL used for downloading releases [default: https://github.com/mbhall88/rasusa/releases]

        -h, --help
                Display this help message

`cargo`

Crates.io Total Downloads

cargo install rasusa

Alternatively, for more modern Rust-based binary management, we recommend using cargo-binstall:

cargo binstall rasusa

`conda`

Conda

Prerequisite: conda (and bioconda channel correctly set up)

conda install rasusa

Container

Docker images are hosted on GitHub Container Registry.

`apptainer`

Prerequisite: apptainer

URI="docker://ghcr.io/mbhall88/rasusa"
apptainer exec "$URI" rasusa --help

The above will use the latest version. If you want to specify a version then use a tag (or commit) like so.

VERSION="2.2.2"
URI="docker://ghcr.io/mbhall88/rasusa:${VERSION}"

`docker`

Prerequisite: docker

docker pull ghcr.io/mbhall88/rasusa
docker run ghcr.io/mbhall88/rasusa --help

You can find all the available tags on the container registry. Note: versions prior to 0.4.0 were housed on Docker Hub. And versions from 0.4.0 to 2.2.0 were on quay.io.

`homebrew`

Homebrew Homebrew Downloads

Prerequisite: homebrew

brew install rasusa

Build locally

Prerequisite: rust toolchain

git clone https://github.com/mbhall88/rasusa.git
cd rasusa
cargo build --release
target/release/rasusa --help
# if you want to check everything is working ok
cargo test --all

Usage

Basic usage - reads

Subsample FASTQ reads or unaligned SAM/BAM/CRAM.

rasusa reads --coverage 30 --genome-size 4.6mb in.fq

rasusa reads also supports unaligned SAM/BAM/CRAM input:

rasusa reads --coverage 30 --genome-size 4.6mb in.bam

The above commands will output the subsampled file to stdout.

Or, if you have paired Illumina

rasusa reads --coverage 30 --genome-size 4g -o out.r1.fq -o out.r2.fq r1.fq r2.fq

For more details on the above options, and additional options, see below.

Basic usage - alignments

Subsample alignments

rasusa aln --coverage 30 in.bam | samtools sort -o out.bam

this will subsample each position in the alignment to 30x coverage.

Required parameters

rasusa has three required options for the reads command, and two required options for the aln command.

Input

This positional argument specifies the file(s) containing the reads or unaligned alignments you would like to subsample.

For the reads command, the file(s) can be in FASTA, FASTQ, or unaligned SAM, BAM, or CRAM format. FASTA and FASTQ files can be compressed (with a tool such as gzip). If two files are passed to reads, rasusa will assume they are paired-end reads.

For the aln command, the file must be a valid coordinate-sorted SAM/BAM/CRAM file.

Bash wizard tip 🧙: Let globs do the work for you r*.fq

Coverage

`-c`, `--coverage`

Not required if --bases is present for reads

This option is used to determine the minimum coverage to subsample the reads to. For the reads command, it can be specified as an integer (100), a decimal/float (100.7), or either of the previous suffixed with an ‘x’ (100x). For the aln command, it is an integer only.

Due to the method for determining how many bases are required to achieve the desired coverage in the reads command, the actual coverage, in the end, could be slightly higher than requested. For example, if the last included read is very long. The log messages should inform you of the actual coverage in the end.

For the aln command, the coverage is the maximum number of reads that should be present at each position in the alignment. If a position has fewer than the requested number of reads, all reads at that position will be included.

[!NOTE] For paired-end data: To ensure 100% pair retention (i.e., no orphan reads) during subsampling, the aln command uses a two-pass strategy. It first subsamples the first segment (Read 1) at half the target coverage (coverage / 2) and then recovers the corresponding mates (Read 2).

Because this approach is constrained by pairing rather than per-position depth alone, the resulting coverage is not guaranteed to be perfectly uniform across all positions. Instead, some fluctuations around the requested depth should be expected.

If the requested coverage is a strict minimum requirement, we recommend setting --coverage slightly higher to account for these fluctuations. See the discussion in this PR for additional details.

Genome size

`-g`, `--genome-size`

Not valid for aln

Not required if --bases is present for reads

The genome size of the input is also required. It is used to determine how many bases are necessary to achieve the desired coverage. This can, of course, be as precise or rough as you like.
Genome size can be passed in many ways. As a plain old integer (1600), or with a metric suffix (1.6kb). All metric suffixes can have an optional ‘b’ suffix and be lower, upper, or mixed case. So ‘Kb’, ‘kb’ and ‘k’ would all be inferred as ‘kilo’. Valid metric suffixes include:

Base (b) - multiplies by 1
Kilo (k) - multiplies by 1,000
Mega (m) - multiplies by 1,000,000
Giga (g) - multiplies by 1,000,000,000
Tera (t) - multiplies by 1,000,000,000,000

Alternatively, a FASTA/Q index file can be given and the genome size will be set to the sum of all reference sequences in it.

[!TIP] If you want to use rasusa in a scenario where you don’t know what the genome size is, such as in an automated pipeline that can take in any kind of organism, you could estimate the genome size with something like lrge (#shamelessplug).
$ gsize=$(lrge reads.fq)
$ rasusa reads -g $gsize -c 10 reads.fq
lrge is designed for long reads. If you want to estimate the genome size from short reads, you could use something like Mash or GenomeScope2. See the lrge docs for examples of how Mash/GenomeScope2 can be used for this task.

Optional parameters

Output

`-o`, `--output`

reads

[!IMPORTANT] This parameter is required if passing paired Illumina data to reads.

By default, rasusa will output the subsampled file to stdout (if one file is given). If you would prefer to specify an output file path, then use this option.

Output for Illumina paired files must be specified using --output twice - -o out.r1.fq -o out.r2.fq

The ordering of the output files is assumed to be the same as the input.

rasusa reads will attempt to automatically infer the output format and whether compression of the output file(s) is required. It does this by detecting any of the supported extensions:

.fa/.fasta: FASTA format
.fq/.fastq: FASTQ format
.bam: BAM format
.cram: CRAM format
.sam: SAM format
.gz: will compress the output with gzip
.bz or .bz2: will compress the output with bzip2
.lzma: will compress the output with the xz LZMA algorithm
.zst: will compress the output with zstd

[!NOTE] If no extension is matched, the output will be in the same format as the input. You can convert between formats by specifying the desired extension (e.g., input BAM and output FASTQ). However, you cannot convert from FASTA/FASTQ to unaligned SAM/BAM/CRAM–though you can convert from unaligned SAM/BAM/CRAM to FASTA/FASTQ.

aln

For the aln command, the output file format will be the same as the input if writing to stdout, otherwise it will be inferred from the file extension.

[!NOTE] The output alignment will most likely not be sorted. You can use samtools sort to sort the output. e.g.,
rasusa aln -c 5 in.bam | samtools sort -o out.bam

Output compression/format

reads

rasusa reads provides two options for explicitly setting the output format and compression.

`-Z`, `--compress-type`

Use this option to manually set the compression algorithm to use for the output file(s). It will override any format automatically detected from the output path. Note: this is only used for FASTA/FASTQ output.

Valid options are:

g: gzip
b: bzip2
l or x: xz LZMA algorithm
z: zstd
u: no compression

`-O`, `--output-format`

Use this option to manually set the output format to use for the output file(s). It will override any format automatically detected from the output path.

Valid options are:

f or fasta: FASTA
q or fastq: FASTQ
s or sam: SAM
b or bam: BAM
c or cram: CRAM

aln

`-O`, `--output-format`

Use this option to manually set the output file format for rasusa aln. By default, the same format as the input will be used, or the format will be guessed from the --output path extension if given.

Valid options are:

s or sam: SAM
b or bam: BAM
c or cram: CRAM

All values to this option are case insensitive.

Compresion level

`-l`, `--compress-level`

reads only

Compression level to use if compressing the output. By default this is set to the default for the compression type being output.

Target number of bases

`-b`, `--bases`

reads only

Explicitly set the number of bases required in the subsample. This option takes the number in the same format as genome size.

[!NOTE] If this option is given, genome size and coverage are not required.

Number of reads

`-n`, `--num`

reads only

Explicitly set the number of reads in the subsample. This option takes the number in the same format as genome size.

When providing paired reads as input, this option will sample this many total read pairs. For example, when passing -n 20 r1.fq r2.fq, the two output files will have 20 reads each, and the read ids will be the same in both.

Note: if this option is given, genome size and coverage are not required.

Fraction of reads

`-f`, `--frac`

reads only

Explicitly set the fraction of total reads in the subsample. The value given to this option can be a float or a percentage - i.e., -f 0.5 and -f 50 will both take half of the reads.

[!NOTE] If this option is given, genome size and coverage are not required.

Random seed

`-s`, `--seed`

This option allows you to specify the random seed used by the random subsampler. By explicitly setting this parameter, you make the subsample for the input reproducible. You only need to pass this parameter if you are likely to want to subsample the same input file again in the future and want the same subset of reads. However, if you forget to use this option, the seed generated by the system will be printed to the log output, allowing you to use it in the future.

Strictly adhere to requested coverage

`-e`, `--strict`

reads only

If the requested coverage, total bases, number of reads, or fraction of reads cannot be met, an error will be thrown. By default, a warning is displayed, and the maximum possible coverage, total bases, number of reads, or fraction of reads is used.

Subsampling strategy

`--strategy`

aln only

By default, rasusa aln uses the stream strategy, which implements a fast sweep-line algorithm with random priority. It processes a coordinate-sorted alignment file in a single pass (two passes if using paired-end data) while maintaining an active set of reads in a heap, ensuring that no position exceeds the target depth N.

This strategy provides the option --swap-distance (default: 5 bp), which limits the allowed distance when swapping between reads encountered in the current scan and reads already in the heap.

Alternatively, users can select the fetch strategy. This approach repeatedly fetches overlapping reads, shuffles them, and samples to the target depth N.

The fetch strategy provides additional controls:

--batch-size (default: 10 kb): size of genomic window cached in memory
--step-size (default: 100 bp): step size used when scanning along the chromosome to find overlapping reads

In most cases, the default stream strategy is recommended due to its speed and low memory usage. See this PR for a discussion of performance and behavior differences between these two strategies.

Verbosity

`-v`

reads only

Adding this optional flag will make the logging more verbose. By default, logging will produce messages considered “info” or above (see here for more details). If verbosity is switched on, you will additionally get “debug” level logging messages.

Full usage

$ rasusa --help
Randomly subsample reads or alignments

Usage: rasusa [OPTIONS] <COMMAND>

Commands:
  reads  Randomly subsample reads (FASTA/Q, unaligned SAM/BAM/CRAM)
  aln    Randomly subsample alignments to a specified depth of coverage
  cite   Get a bibtex formatted citation for this package
  help   Print this message or the help of the given subcommand(s)

Options:
  -v             Switch on verbosity
  -h, --help     Print help
  -V, --version  Print version

`reads` command

$ rasusa reads --help
Randomly subsample reads (FASTA/Q, unaligned SAM/BAM/CRAM)

Usage: rasusa reads [OPTIONS] <FILE(S)>...

Arguments:
  <FILE(S)>...
          The FASTA/FASTQ or unaligned SAM/BAM/CRAM file(s) to subsample.
          
          For paired Illumina, the order matters. i.e., R1 then R2. Single-file paired-end is also supported for unaligned SAM/BAM/CRAM.

Options:
  -o, --output <OUTPUT>
          Output filepath(s); stdout if not present.
          
          For paired Illumina pass this flag twice `-o o1.fq -o o2.fq`
          
          NOTE: The order of the pairs is assumed to be the same as the input - e.g., R1 then R2.
          
          This option is required for paired input.

  -g, --genome-size <size|faidx>
          Genome size to calculate coverage with respect to. e.g., 4.3kb, 7Tb, 9000, 4.1MB
          
          Alternatively, a FASTA/Q index file can be provided and the genome size will be set to the sum of all reference sequences.
          
          If --bases is not provided, this option and --coverage are required

  -c, --coverage <FLOAT>
          The desired depth of coverage to subsample the reads to
          
          If --bases is not provided, this option and --genome-size are required

  -b, --bases <bases>
          Explicitly set the number of bases required e.g., 4.3kb, 7Tb, 9000, 4.1MB
          
          If this option is given, --coverage and --genome-size are ignored

  -n, --num <INT>
          Subsample to a specific number of reads
          
          If paired-end reads are passed, this is the number of (matched) reads from EACH file. This option accepts the same format as genome size - e.g., 1k will take 1000 reads

  -f, --frac <FLOAT>
          Subsample to a fraction of the reads - e.g., 0.5 samples half the reads
          
          Values >1 and <=100 will be automatically converted - e.g., 25 => 0.25

  -e, --strict
          Exit with an error if the requested coverage/bases/reads is not possible

  -s, --seed <INT>
          Random seed to use

  -v
          Switch on verbosity

  -Z, --compress-type <u|b|g|l|x|z>
          u: uncompressed; b: Bzip2; g: Gzip; l: Lzma; x: Xz (Lzma); z: Zstd
          
          Rasusa will attempt to infer the output compression format automatically from the filename extension. This option is used to override that. If writing to stdout, the default is uncompressed. Note: this is only used for FASTA/FASTQ output.

  -O, --output-format <OUTPUT_FORMAT>
          Explicitly set the output format.
          
          If not provided, Rasusa will attempt to infer the format from the filename extension.
          
          [possible values: fasta, fastq, sam, bam, cram]

  -l, --compress-level <1-21>
          Compression level to use if compressing output. Uses the default level for the format if not specified

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

`aln` command

$ rasusa aln --help
Randomly subsample alignments to a specified depth of coverage

Usage: rasusa aln [OPTIONS] --coverage <INT> <FILE>

Arguments:
  <FILE>
          Path to the input alignment file (SAM/BAM/CRAM) to subsample
          
          Note: An index (.bai) is required when using '--strategy fetch'.

Options:
  -o, --output <FILE>
          Path to the output subsampled alignment file. Defaults to stdout (same format as input)
          
          The output is not guaranteed to be sorted. We recommend piping the output to `samtools sort`

  -O, --output-format <FMT>
          Output format. Rasusa will attempt to infer the format from the output file extension if not provided

  -c, --coverage <INT>
          The desired depth of coverage to subsample the alignment to

  -s, --seed <INT>
          Random seed to use

      --strategy <STRATEGY>
          Subsampling strategy

          Possible values:
          - stream: A linear scan approach using sweep line algorithm with random priority. Requires sorted alignment input
          - fetch:  A fetching approach to randomly subsample reads given read overlap position. Requires indexed input (.bai)
          
          [default: stream]

      --swap-distance <INT>
          [Stream] A maximum distance (bp) allowed between start position of new read and the worst read in the heap to consider them to be 'swappable'.
          
          Larger values allow swapping reads over greater distances, but may cause local undersampling. A value of `0` means only allows swap between reads that have the same start position.
          
          [default: 5]

      --step-size <INT>
          [Fetch] When a region has less than the desired coverage, the step size to move along the chromosome to find more reads.
          
          The lowest of the step and the minimum end coordinate of the reads in the region will be used. This parameter can have a significant impact on the runtime of the subsampling process.
          
          [default: 100]

      --batch-size <INT>
          [Fetch] The size of the genomic window (bp) to cache into memory at once.
          
          Larger values reduce disk seeking, but at the cost of high memory usage. The minimum value is 1,000 bp to avoid small region queries.
          
          [default: 10000]

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Benchmark

“Time flies like an arrow; fruit flies like a banana.”
― Anthony G. Oettinger

The real question is: will rasusa just needlessly eat away at your precious time on earth?

To do this benchmark, I am going to use hyperfine.

The data I used comes from

Bainomugisa, Arnold, et al. “A complete high-quality MinION nanopore assembly of an extensively drug-resistant Mycobacterium tuberculosis Beijing lineage strain identifies novel variation in repetitive PE/PPE gene regions.” Microbial genomics 4.7 (2018).

[!NOTE] These benchmarks are for reads only as there is no other tool that replicates the functionality of aln.

Single long read input

Download and rename the FASTQ

URL="ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR649/008/SRR6490088/SRR6490088_1.fastq.gz"
wget "$URL" -O - | gzip -d -c > tb.fq

The file size is 2.9G, and it has 379,547 reads.
We benchmark against filtlong using the same strategy outlined in Motivation.

TB_GENOME_SIZE=4411532
COVG=50
TARGET_BASES=$(( TB_GENOME_SIZE * COVG ))
FILTLONG_CMD="filtlong --target_bases $TARGET_BASES tb.fq"
RASUSA_CMD="rasusa reads tb.fq -c $COVG -g $TB_GENOME_SIZE -s 1"
hyperfine --warmup 3 --runs 10 --export-markdown results-single.md \
     "$FILTLONG_CMD" "$RASUSA_CMD"

Results

Command	Mean [s]	Min [s]	Max [s]	Relative
`filtlong --target_bases 220576600 tb.fq`	21.685 ± 0.055	21.622	21.787	21.77 ± 0.29
`rasusa reads tb.fq -c 50 -g 4411532 -s 1`	0.996 ± 0.013	0.983	1.023	1.00

Summary: rasusa ran 21.77 ± 0.29 times faster than filtlong.

Paired-end input

Download and then deinterleave the FASTQ with pyfastaq

URL="ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR648/008/SRR6488968/SRR6488968.fastq.gz"
wget "$URL" -O - | gzip -d -c - | fastaq deinterleave - r1.fq r2.fq

Each file’s size is 179M and has 283,590 reads.
For this benchmark, we will use seqtk. We will also test seqtk‘s 2-pass mode as this is analogous to rasusa reads.

NUM_READS=140000
SEQTK_CMD_1="seqtk sample -s 1 r1.fq $NUM_READS > /tmp/r1.fq; seqtk sample -s 1 r2.fq $NUM_READS > /tmp/r2.fq;"
SEQTK_CMD_2="seqtk sample -2 -s 1 r1.fq $NUM_READS > /tmp/r1.fq; seqtk sample -2 -s 1 r2.fq $NUM_READS > /tmp/r2.fq;"
RASUSA_CMD="rasusa reads r1.fq r2.fq -n $NUM_READS -s 1 -o /tmp/r1.fq -o /tmp/r2.fq"
hyperfine --warmup 10 --runs 100 --export-markdown results-paired.md \
     "$SEQTK_CMD_1" "$SEQTK_CMD_2" "$RASUSA_CMD"

Results

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`seqtk sample -s 1 r1.fq 140000 > /tmp/r1.fq; seqtk sample -s 1 r2.fq 140000 > /tmp/r2.fq;`	907.7 ± 23.6	875.4	997.8	1.84 ± 0.62
`seqtk sample -2 -s 1 r1.fq 140000 > /tmp/r1.fq; seqtk sample -2 -s 1 r2.fq 140000 > /tmp/r2.fq;`	870.8 ± 54.9	818.2	1219.8	1.77 ± 0.61
`rasusa reads r1.fq r2.fq -n 140000 -s 1 -o /tmp/r1.fq -o /tmp/r2.fq`	492.2 ± 165.4	327.4	887.4	1.00

Summary: rasusa reads ran 1.84 times faster than seqtk (1-pass) and 1.77 times faster than seqtk (2-pass)

So, rasusa reads is faster than seqtk but doesn’t require a fixed number of reads - allowing you to avoid doing maths to determine how many reads you need to downsample to a specific coverage. 🤓

Contributing

If you would like to help improve rasusa you are very welcome!

For changes to be accepted, they must pass the CI and coverage checks. These include:

Code is formatted with rustfmt. This can be done by running cargo fmt in the project directory.
There are no compiler errors/warnings. You can check this by running cargo clippy --all-features --all-targets -- -D warnings
Code coverage has not reduced. If you want to check coverage before pushing changes, I use tarpaulin.

Citing

If you use rasusa in your research, it would be appreciated if you could cite it.

Hall, M. B., (2022). Rasusa: Randomly subsample sequencing reads to a specified coverage. Journal of Open Source Software, 7(69), 3941, https://doi.org/10.21105/joss.03941

Bibtex

You can get the following citation by running rasusa cite

@article{Hall2022,
  doi = {10.21105/joss.03941},
  url = {https://doi.org/10.21105/joss.03941},
  year = {2022},
  publisher = {The Open Journal},
  volume = {7},
  number = {69},
  pages = {3941},
  author = {Michael B. Hall},
  title = {Rasusa: Randomly subsample sequencing reads to a specified coverage},
  journal = {Journal of Open Source Software}
}

Table of Contents

Install

Precompiled binary

cargo

conda

Container

apptainer

docker

homebrew

Build locally

Usage

Basic usage - reads

Basic usage - alignments

Required parameters

Input

Coverage

-c, --coverage

Genome size

-g, --genome-size

Optional parameters

Output

-o, --output

Output compression/format

-Z, --compress-type

-O, --output-format

-O, --output-format

Compresion level

-l, --compress-level

Target number of bases

-b, --bases

Number of reads

-n, --num

Fraction of reads

-f, --frac

Random seed

-s, --seed

Strictly adhere to requested coverage

-e, --strict

Subsampling strategy

--strategy

Verbosity

-v

Full usage

reads command

aln command

Benchmark

Single long read input

Results

Paired-end input

Results

Contributing

Citing

Bibtex

`cargo`

`conda`

`apptainer`

`docker`

`homebrew`

`-c`, `--coverage`

`-g`, `--genome-size`

`-o`, `--output`

`-Z`, `--compress-type`

`-O`, `--output-format`

`-O`, `--output-format`

`-l`, `--compress-level`

`-b`, `--bases`

`-n`, `--num`

`-f`, `--frac`

`-s`, `--seed`

`-e`, `--strict`

`--strategy`

`-v`

`reads` command

`aln` command