This package is inspired by the demultiplexing options in porechop (by Ryan Wick) but without the adapter trimming options - it just demuxes. It uses the parasail library with its Python bindings to do pairwise alignment which provides a considerable speed up over the seqan library used by porechop due to its low-level use of vector processor instructions.
Additional speed-ups come from specifying exactly which barcodes are present so it limits searching only to those (this is usually something you know, after all).
There is also more flexibility with how double barcodes are called (i.e., when you only call a read’s barcode if it has the matching barcode sequence at both ends to reduce the chance of mis-called reads). In readucks you can specify a lower identity threshold for the secondary barcode than the primary. The primary barcode is defined as the one with the highest match (either at the start or end of the read). The secondary barcode is then the other one. The secondary barcode must match the primary (i.e., be from the same pair) but may have more read errors.
We are currently investigating the optimal settings for this to trade off between sensitivity and specificity for double barcoding.s
One source file misc.py is from porechop and is provided with its original licencing information.
Provide a path to a input file or directory of input files to be processed. These should be either FASTQ or FASTA files with appropriate file extensions (.fastq or .fasta).
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
Provide the path of a directory into which output files will be placed. If this is not specified then the output files will go into the current working directory (except for annotation CSV files - these will be placed along side their matching read files).
-b, --bin_barcodes
This option will bin reads into files according to their assigned barcodes. One file (either FASTQ or FASTA depending on the input file types) will be produced for each barcode that is called (and one for unassigned reads).
-a, --annotate_files
This option writes a CSV file for each input file (with a corresponding file name) containing the barcode assignment for each read.
-e, --extended_info
When provided with the -a option, this writes much more detailed information about the barcode matches to the annotation (CSV) file.
-p PREFIX, --prefix PREFIX
Give an optional prefix string that will be prepended to each output file.
-t THREADS, --threads THREADS
The number of parallel threads to use (1 to turn off multithreading) (default: automatic)
-v VERBOSITY, --verbosity VERBOSITY
Specify the level of output information: 0 = none, 1 = some, 2 = lots (default: 1)
Demuxing options:
--single
Only attempts to match a single barcode at one end (default double)
--native_barcodes
Only attempts to match the 24 native nanopore barcodes (default)
Specify a list of barcodes to look for (numbers, indexed from 1, refer to native, PCR or rapid barcodes as specified)
Barcode search settings:
--threshold THRESHOLD
A read must have at least this percent identity to a barcode (default: 90.0)
--secondary_threshold SECONDARY_THRESHOLD
When double barcoding the second barcode (the one with the lower identity of the two) must have at least this percent identity (and match the first one) (default: 70.0)
--scoring_scheme SCORING_SCHEME
Scoring scheme for the pairwise alignment. A comma-delimited string of alignment scores: match, mismatch, gap open, gap extend (default: 3,-6,-5,-2)
This will demux the reads in my_reads.fastq, producing bin files in a directory called demuxed (which must already exist), giving some feedback information to the screen.
readucks
Nanopore read de-multiplexer (read demux -> readux -> readucks, innit).
This package is inspired by the demultiplexing options in
porechop(by Ryan Wick) but without the adapter trimming options - it just demuxes. It uses theparasaillibrary with its Python bindings to do pairwise alignment which provides a considerable speed up over theseqanlibrary used byporechopdue to its low-level use of vector processor instructions.Additional speed-ups come from specifying exactly which barcodes are present so it limits searching only to those (this is usually something you know, after all).
There is also more flexibility with how double barcodes are called (i.e., when you only call a read’s barcode if it has the matching barcode sequence at both ends to reduce the chance of mis-called reads). In
readucksyou can specify a lower identity threshold for the secondary barcode than the primary. The primary barcode is defined as the one with the highest match (either at the start or end of the read). The secondary barcode is then the other one. The secondary barcode must match the primary (i.e., be from the same pair) but may have more read errors.We are currently investigating the optimal settings for this to trade off between sensitivity and specificity for double barcoding.s
One source file
misc.pyis fromporechopand is provided with its original licencing information.Installation
Install dependencies
Install Readucks
Run Readucks
Running
Command line options:
Main options:
Provide a path to a input file or directory of input files to be processed. These should be either
FASTQorFASTAfiles with appropriate file extensions (.fastqor.fasta).Provide the path of a directory into which output files will be placed. If this is not specified then the output files will go into the current working directory (except for annotation CSV files - these will be placed along side their matching read files).
This option will bin reads into files according to their assigned barcodes. One file (either FASTQ or FASTA depending on the input file types) will be produced for each barcode that is called (and one for unassigned reads).
This option writes a CSV file for each input file (with a corresponding file name) containing the barcode assignment for each read.
When provided with the
-aoption, this writes much more detailed information about the barcode matches to the annotation (CSV) file.Give an optional prefix string that will be prepended to each output file.
The number of parallel threads to use (1 to turn off multithreading) (default: automatic)
Specify the level of output information: 0 = none, 1 = some, 2 = lots (default: 1)
Demuxing options:
Only attempts to match a single barcode at one end (default double)
Only attempts to match the 24 native nanopore barcodes (default)
Only attempts to match the 96 PCR barcodes
Only attempts to match the 12 rapid barcodes
Specify a list of barcodes to look for (numbers, indexed from 1, refer to native, PCR or rapid barcodes as specified)
Barcode search settings:
A read must have at least this percent identity to a barcode (default: 90.0)
When double barcoding the second barcode (the one with the lower identity of the two) must have at least this percent identity (and match the first one) (default: 70.0)
Scoring scheme for the pairwise alignment. A comma-delimited string of alignment scores: match, mismatch, gap open, gap extend (default: 3,-6,-5,-2)
Example commands
This will demux the reads in
my_reads.fastq, producing bin files in a directory calleddemuxed(which must already exist), giving some feedback information to the screen.