Some collaborators wanted to know how long they need to perform sequencing on the
Nanopore
device until they got “sufficient” data (sufficient is obviously application-dependent).
They were just going to do multiple runs for different amounts of time. So instead, I
created ontime
to easily grab reads from the first hour, first two hours, first three hours etc. and
run those
subsets through the analysis pipeline that was the intended application. This way they
only
needed to do one (longer) run.
Install
tl;dr: precompiled binary
curl -sSL ontime.mbh.sh | sh
# or with wget
wget -nv -O - ontime.mbh.sh | sh
You can also pass options to the script like so
$ curl -sSL ontime.mbh.sh | sh -s -- --help
install.sh [option]
Fetch and install the latest version of ontime, if ontime is already
installed it will be updated to the latest version.
Options
-V, --verbose
Enable verbose output for the installer
-f, -y, --force, --yes
Skip the confirmation prompt during installation
-p, --platform
Override the platform identified by the installer [default: apple-darwin]
-b, --bin-dir
Override the bin installation directory [default: /usr/local/bin]
-a, --arch
Override the architecture identified by the installer [default: x86_64]
-B, --base-url
Override the base URL used for downloading releases [default: https://github.com/mbhall88/ssubmit/releases]
-h, --help
Display this help message
I want to save the output to a Gzip-compressed file
$ ontime --to 2h -o out.fq.gz in.fq
Usage
Extract subsets of ONT (Nanopore) reads based on time
Usage: ontime [OPTIONS] <FILE>
Arguments:
<FILE> Input fastq/fasta/BAM/SAM file
Options:
-o, --output <FILE> Output file name [default: stdout]
-O, --output-type <u|b|g|l> (fastq/a output only) u: uncompressed; b: Bzip2; g: Gzip; l: Lzma
-L, --compress-level <1-21> Compression level to use if compressing fastq output [default: 6]
-f, --from <DATE/DURATION> Earliest start time; otherwise the earliest time is used
-t, --to <DATE/DURATION> Latest start time; otherwise the latest time is used
-s, --show Show the earliest and latest start times in the input and exit
-h, --help Print help (see more with '--help')
-V, --version Print version
Specifying a time range
The --from and --to options are used to restrict the timeframe you want reads from.
These options accept two different formats: duration and timestamp.
Duration: The most human-friendly way to provide a range is with duration. For
example, 1h means 1 hour. Passing --from 1h says “I want reads that were generated 1
hour or more after sequencing started” - i.e. the earliest start time in the file plus 1
hour. Likewise, passing --to 2h says “I only want reads that were generated before the
second hour of sequencing”. Using --from and --to in combination gives you a range.
We support a range of time/duration units and they can be combined. For example,
3h45m to indicate 3 hours and 45 minutes. See the duration-str docs for
the full list
of support duration units.
Negative durations are also allowed. A negative duration subtracts that duration from
the latest start time in the file. So --to -1h will exclude reads that were
sequenced in the last hour of the run. Negative ranges are also valid -
i.e. --from -2h --to -1h will give you the reads sequenced in the penultimate hour of
the run.
Timestamp: If you want to provide date and time for your ranges, that is acceptable
in --from/--to also. See the formatting guide for more information.
To make using timestamps a little easier, you can first run ontime --show <in.fq> to
get the earliest and latest timestamps in the file.
Time format
The times that ontime extracts are the start_time=<time> or st:Z:<time> section contained in the
description of each fastq read.
The format of this time has changed a few times, so if you come across a file
which ontime cannot parse, please raise an issue so I can make it work.
All times printed by ontime and accepted by the --from/--to options
are UTC time. More recent versions of Guppy also have UTC offsets in
their start_time; for simplicity’s sake, these offsets are ignored by ontime. So, if
you want to provide a timestamp to --from/--to based on a timeframe in your local
time, please first convert it to UTC time.
In general, the timestamp format ontime accepts anything that
is RFC339-compliant.
The basic (recommended) format is <YEAR>-<MONTH>-<DAY>T<HOUR>:<MINUTE>:<SECONDS>Z -
e.g. 2022-12-12T18:39:09Z. Feel free to get precise with
subseconds though if you like…
Full usage
Extract subsets of ONT (Nanopore) reads based on time
Usage: ontime [OPTIONS] <FILE>
Arguments:
<FILE>
Input fastq/fasta/BAM/SAM file
Options:
-o, --output <FILE>
Output file name [default: stdout]
Note: you cannot output a fastq if a BAM/SAM input is given and vice versa. Use samtools for post-processing. However, you can output SAM if the input is BAM and vice versa.
-O, --output-type <u|b|g|l>
(fastq/a output only) u: uncompressed; b: Bzip2; g: Gzip; l: Lzma
ontime will attempt to infer the output compression format automatically from the output extension. If writing to stdout, the default is uncompressed (u)
-L, --compress-level <1-21>
Compression level to use if compressing fastq output
[default: 6]
-f, --from <DATE/DURATION>
Earliest start time; otherwise the earliest time is used
This can be a timestamp - e.g. 2022-11-20T18:00:00 - or a duration from the start - e.g. 2h30m (2 hours and 30 minutes from the start). See the docs for more examples
-t, --to <DATE/DURATION>
Latest start time; otherwise the latest time is used
See --from (and docs) for examples
-s, --show
Show the earliest and latest start times in the input and exit
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
ONTime
Extract subsets of ONT (Nanopore) reads based on time
Motivation
Some collaborators wanted to know how long they need to perform sequencing on the Nanopore device until they got “sufficient” data (sufficient is obviously application-dependent).
They were just going to do multiple runs for different amounts of time. So instead, I created
ontimeto easily grab reads from the first hour, first two hours, first three hours etc. and run those subsets through the analysis pipeline that was the intended application. This way they only needed to do one (longer) run.Install
tl;dr: precompiled binary
You can also pass options to the script like so
Conda
Cargo
Container
Docker images are hosted at quay.io.
singularityPrerequisite:
singularityThe above will use the latest version. If you want to specify a version then use a tag (or commit) like so.
dockerPrerequisite:
dockerYou can find all the available tags on the quay.io repository.
Build from source
Examples
I want the reads that were sequenced in the first hour
The same, but using a BAM file as input
I want the reads that were sequenced after the first hour
I want all reads except those sequenced in the last hour
I want reads sequenced between the third and fourth hours
Check what the earliest and latest start times in the fastq are
I like to be specific, give me the reads that were sequenced while I was eating dinner ( see note on time formats)
I want to save the output to a Gzip-compressed file
Usage
Specifying a time range
The
--fromand--tooptions are used to restrict the timeframe you want reads from. These options accept two different formats: duration and timestamp.Duration: The most human-friendly way to provide a range is with duration. For example,
1hmeans 1 hour. Passing--from 1hsays “I want reads that were generated 1 hour or more after sequencing started” - i.e. the earliest start time in the file plus 1 hour. Likewise, passing--to 2hsays “I only want reads that were generated before the second hour of sequencing”. Using--fromand--toin combination gives you a range.We support a range of time/duration units and they can be combined. For example,
3h45mto indicate 3 hours and 45 minutes. See theduration-strdocs for the full list of support duration units.Negative durations are also allowed. A negative duration subtracts that duration from the latest start time in the file. So
--to -1hwill exclude reads that were sequenced in the last hour of the run. Negative ranges are also valid - i.e.--from -2h --to -1hwill give you the reads sequenced in the penultimate hour of the run.Timestamp: If you want to provide date and time for your ranges, that is acceptable in
--from/--toalso. See the formatting guide for more information.To make using timestamps a little easier, you can first run
ontime --show <in.fq>to get the earliest and latest timestamps in the file.Time format
The times that
ontimeextracts are thestart_time=<time>orst:Z:<time>section contained in the description of each fastq read. The format of this time has changed a few times, so if you come across a file whichontimecannot parse, please raise an issue so I can make it work.All times printed by
ontimeand accepted by the--from/--tooptions are UTC time. More recent versions of Guppy also have UTC offsets in theirstart_time; for simplicity’s sake, these offsets are ignored byontime. So, if you want to provide a timestamp to--from/--tobased on a timeframe in your local time, please first convert it to UTC time.In general, the timestamp format
ontimeaccepts anything that is RFC339-compliant.The basic (recommended) format is
<YEAR>-<MONTH>-<DAY>T<HOUR>:<MINUTE>:<SECONDS>Z- e.g.2022-12-12T18:39:09Z. Feel free to get precise with subseconds though if you like…Full usage
Cite
ontimeis archived at Zenodo.and replace the version number with whichever you used.