Demultiplex files and prepare reads for the target capture analysis pipeline.
Check external barcodes
If the libraries are pooled, demux by internal barcode
Verify pairing
Trim adaptors
Mask low-complexity regions, and optionally trim low-quality bases.
Collect stats on steps 1–5.
Demultiplexing by internal barcode is done with cutadapt.
The other steps are done with bbmap scripts.
Read pairing is maintained by processing the R1 and R2 files together.
Singletons generated by adaptor trimming are also output to a file with unpaired in the name.
Installation
tcdemux is in bioconda. You can use the biocontainer hosted on quay.io with Docker or Apptainer/Singularity, e.g.:
Manual installation is not supported, but if you need to do it, here are the steps:
Install bbmap and make sure it’s in your path.
Install R with the packages data.table, bit64, ggplot2 and viridis.
Install tcdemux with python3 -m pip install git+git://github.com/tomharrop/tcdemux.git. pip will install the python3 dependencies biopython, cutadapt, pandas and snakemake.
Usage
External barcodes only
tcdemux requires a sample_data file in csv format with the fields name, i5_index, i7_index, r1_file, and r2_file.
Provide the csv to tcdemux using the --sample_data argument.
tcdemux will process the sample1 and sample2 files separately, resulting in output files called sample1.r1.fastq.gz, sample1.r2.fastq.gz and sample1.unpaired.fastq.gz, and the equivalents for sample2.
tcdemux does not demultiplex the samples in this case.
Additional, internal barcodes
If the sample_data file also has a pool_name field, tcdemux will demultiplex the pools by internal index sequence.
This also requires the internal_index_sequence field in the csv.
In this case, sample1 and sample2 are multiplexed in pool1 with internal barcodes.
tcdemux will demultiplex the pool before trimming and masking, resulting in the same files as above.
Sample names
Sample names will be checked for characters that are not uppercase or lowercase
letters, digits, or underscores. The names will also be checked for double
underscores. If any of these characters are found, the pipeline will print a
message end exit.
These characters cause issues for other software used in target capture
analysis.
You can fix this by changing the names in the sample_data and running tcdemux
again.
tcdemux does not allow barcode errors
External barcodes are checked for errors before trimming and masking, and reads with barcode errors are discarded.
Barcode errors are sometimes allowed in the Illumina workflow.
You can check if your fastq files have barcode errors like this:
If you see more than one barcode, then barcode errors were allowed in the Illumina workflow.
tcdemux uses exact barcode matches with no errors allowed when it demultiplexes by internal barcode.
Other options
You also need to provide paths to the raw read directory and an output directory, and at least one adaptor file for trimming.
If you want to keep the intermediate files, pass the --keep_intermediate_files argument.
The pipeline uses 5 threads and about 8 GB of RAM per sample.
Provide multiples of these using the --threads and --mem_gb arguments.
usage: tcdemux [-h] [-n] [--threads int] [--mem_gb int] [--restart_times RESTART_TIMES]
--sample_data SAMPLE_DATA_FILE --read_directory READ_DIRECTORY --adaptors
ADAPTOR_FILES [ADAPTOR_FILES ...] --outdir OUTDIR
[--keep_intermediate_files | --no-keep_intermediate_files]
options:
-h, --help show this help message and exit
-n Dry run
--threads int Number of threads.
--mem_gb int Amount of RAM in GB.
--restart_times RESTART_TIMES
number of times to restart failing jobs (default 0)
--sample_data SAMPLE_DATA_FILE
Sample csv (see README)
--read_directory READ_DIRECTORY
Directory containing the read files
--adaptors ADAPTOR_FILES [ADAPTOR_FILES ...]
FASTA file(s) of adaptors. Multiple adaptor files can be used.
--outdir OUTDIR Output directory
--keep_intermediate_files, --no-keep_intermediate_files
tcdemux
Demultiplex files and prepare reads for the target capture analysis pipeline.
Demultiplexing by internal barcode is done with
cutadapt. The other steps are done withbbmapscripts.Read pairing is maintained by processing the R1 and R2 files together. Singletons generated by adaptor trimming are also output to a file with unpaired in the name.
Installation
tcdemuxis in bioconda. You can use the biocontainer hosted on quay.io with Docker or Apptainer/Singularity, e.g.:You can also install it with conda, e.g.
Manual installation
Manual installation is not supported, but if you need to do it, here are the steps:
bbmapand make sure it’s in your path.Rwith the packagesdata.table,bit64,ggplot2andviridis.tcdemuxwithpython3 -m pip install git+git://github.com/tomharrop/tcdemux.git.pipwill install the python3 dependencies biopython, cutadapt, pandas and snakemake.Usage
External barcodes only
tcdemuxrequires a sample_data file in csv format with the fieldsname,i5_index,i7_index,r1_file, andr2_file.Provide the csv to
tcdemuxusing the--sample_dataargument.Here’s an example sample_data file:
tcdemuxwill process the sample1 and sample2 files separately, resulting in output files called sample1.r1.fastq.gz, sample1.r2.fastq.gz and sample1.unpaired.fastq.gz, and the equivalents for sample2.tcdemuxdoes not demultiplex the samples in this case.Additional, internal barcodes
If the sample_data file also has a
pool_namefield,tcdemuxwill demultiplex the pools by internal index sequence. This also requires theinternal_index_sequencefield in the csv.Here’s an example sample_data file:
In this case, sample1 and sample2 are multiplexed in pool1 with internal barcodes.
tcdemuxwill demultiplex the pool before trimming and masking, resulting in the same files as above.Sample names
Sample names will be checked for characters that are not uppercase or lowercase letters, digits, or underscores. The names will also be checked for double underscores. If any of these characters are found, the pipeline will print a message end exit.
These characters cause issues for other software used in target capture analysis.
You can fix this by changing the names in the sample_data and running
tcdemuxagain.tcdemuxdoes not allow barcode errorsExternal barcodes are checked for errors before trimming and masking, and reads with barcode errors are discarded.
Barcode errors are sometimes allowed in the Illumina workflow. You can check if your fastq files have barcode errors like this:
If you see more than one barcode, then barcode errors were allowed in the Illumina workflow.
tcdemuxuses exact barcode matches with no errors allowed when it demultiplexes by internal barcode.Other options
You also need to provide paths to the raw read directory and an output directory, and at least one adaptor file for trimming.
If you want to keep the intermediate files, pass the
--keep_intermediate_filesargument.The pipeline uses 5 threads and about 8 GB of RAM per sample. Provide multiples of these using the
--threadsand--mem_gbarguments.Overview
With internal barcodes
With only external barcodes