Before opening a new issue here, please check the appropriate help channel on the KneadData bioBakery Support Forum and consider opening or commenting on a thread there.
KneadData is a tool designed to perform quality control on metagenomic and
metatranscriptomic sequencing data, especially data from microbiome experiments.
In these experiments, samples are typically taken from a host in hopes of
learning something about the microbial community on the host. However,
sequencing data from such experiments will often contain a high ratio of host to
bacterial reads. This tool aims to perform principled in silico separation of
bacterial reads from these “contaminant” reads, be they from the host, from
bacterial 16S sequences, or other user-defined sources. Additionally, KneadData
can be used for other filtering tasks. For example, if one is trying to clean
data derived from a human sequencing experiment, KneadData can be used to
separate the human and the non-human reads.
If you use the KneadData software, please cite our manuscript: TBA
SAMTools (only required if input file is in BAM format)
Memory (>= 4 Gb if using Bowtie2, >= 8 Gb if using BMTagger)
Operating system (Linux or Mac)
Optionally, BMTagger can be used instead of Bowtie2.
The executables for the required software packages should be installed in your PATH.Alternatively,youcanprovidethelocationoftheBowtie2install(BOWTIE2_DIR) with the following KneadData option “–bowtie2 $BOWTIE2_DIR”.
Installation
Before installing KneadData, please install the Java Runtime Environment (JRE). First download the JRE for your platform. Then follow the instructions for your platform: Linux 64-bit or Mac OS. At the end of the installation, add the location of the java executable to your $PATH.
Download KneadData
You can download the latest KneadData release or the development version. The source contains example files. If installing with pip, it is optional to first download the KneadData source.
Option 1: Latest Release (Recommended)
Download kneaddata.tar.gz and unpack the latest release of KneadData.
Note: Creating a clone of the repository requires Git to be installed.
Install KneadData
Install with pip
$ pip install kneaddata
This command will automatically install Trimmomatic and Bowtie2. To bypass the install of dependencies, add the option “–install-option=’–bypass-dependencies-install’”.
If you do not have write permissions to ‘/usr/lib/‘, then add the option “–user” to the install command. This will install the python package into subdirectories of ‘HOME/.local′.Pleasenotewhenusingthe"−−user"installoptiononsomeplatforms,youmightneedtoadd′HOME/.local/bin/‘ to your $PATH as it might not be included by default. You will know if it needs to be added if you see the following message kneaddata: command not found when trying to run KneadData after installing with the “–user” option.
Install from source
Follow the instructions to download KneadData
Move to the KneadData source directory: $ cd kneaddata
Install KneadData
$ python setup.py install
This command will automatically install Trimmomatic and Bowtie2. To bypass the install of dependencies, add the option “–bypass-dependencies-install”.
If you do not have write permissions to ‘/usr/lib/‘, then add the option “–user” to the install command. This will install the python package into subdirectories of ‘HOME/.local′.Pleasenotewhenusingthe"−−user"installoptiononsomeplatforms,youmightneedtoadd′HOME/.local/bin/‘ to your $PATH as it might not be included by default. You will know if it needs to be added if you see the following message kneaddata: command not found when trying to run KneadData after installing with the “–user” option.
Download the database
It is recommended that you download the human (Homo_sapiens_hg39_T2T_Bowtie2_v0.1.tar.gz - source ) reference database (approx. size = 3.6 GB). However, this step is not required if you are using your own custom reference database or if you will not be running with a reference database.
The dog reference database (German Shepherd dog assembly) is also available for download (approximate size = ~2.5 Gb). This database is based on the genomic DNA sequences for the Canis lupus familiaris assembly version UU_Cfam_GSD_1.0 (accession GCF_011100685.1). This file includes the nucleotide sequences of the assembled chromosomes and unplaced scaffolds.
The dog reference database (domestic dog) is also available for download (approx. size = 1.4 GB). This database is based on the genomic DNA sequences for the Canis familiaris (domestic dog) assembly version ROS_Cfam_1.0. This file includes the nucleotide sequences of the assembled chromosomes and unplaced scaffolds.
The cat reference database is available for download (approx. size = 3.7 GB). This database is based on the genomic DNA sequences for the Felis catus (domestic cat) This link includes the nucleotide sequences of the assembled chromosomes and unplaced scaffolds.
A reference database can be downloaded to use when running KneadData. Alternatively, you can create your own custom reference database.
Select Reference Sequences
First you must select reference sequences for the contamination you are trying to
remove. Say you wish to filter reads from a particular “host.” Broadly
defined, the host can be an organism, or a set of organisms, or just a set of
sequences. Then, you simply must generate a reference database for KneadData from a
FASTA file containing these
sequences. Usually, researchers want to remove reads from the human genome, the
human transcriptome, or ribosomal RNA. You can access some of these FASTA files
using the resources below:
Ribosomal RNA: Silva provides a comprehensive
database for ribosomal RNA sequences spanning all three domains of life
(Bacteria, Archaea, and Eukarya).
Human Genome & Transcriptome: Information about the newest assembly of human
genomic data can be found at the NCBI project
page. USCS
provides a convenient website to download
this data.
Generating KneadData Databases
KneadData requires that your reference sequences (FASTA files) be indexed to
form KneadData databases beforehand. This only needs to be done once per
reference sequence.
For certain common databases, we provide indexed files. If you use these, you
can skip the manual build steps below. Alternatively if you would like to bypass
the reference alignment portion of the workflow, a database does not need to be
provided when running KneadData.
To download the indexed human reference database, run the following command:
$ kneaddata_database --download human bowtie2 $DIR
When running this command, $DIR should be replaced with the full path to the directory you have selected to store the database.
Creating a Bowtie2 Database
Simply run the bowtie2-build indexer included with Bowtie2 as follows:
$ bowtie2-build <reference> <db-name>
Where <reference> is the reference FASTA file, and <db-name> is the name you
wish to call your Bowtie2 database. For more details, refer
to the bowtie2-build-documentation
Note: Creating SILVA ribosomal_RNA Database
Creating the SILVA ribosomal_RNA database requires one additional step. Run the following python program before bowtie2-build command which converts the “U”s to “T”s in the fasta sequences. Script link: modify_RNA_to_DNA.py$ python -u modify_RNA_to_DNA.py input.fasta output.fa
Creating a BMTagger Database
KneadData includes kneaddata_build_database, an executable that
will automatically generate these databases for BMTagger. Simply run
$ kneaddata_build_database reference.fasta
By default, this will generate the reference databases, whose names are prefixed
with reference.fasta.
A note on PATH: The above command will fail if the tools in the BMTagger suite
(specifically, bmtool and srprism) and the NCBI BLAST executables are not in
your PATH. If this is the case, you can specify a path to these tools using the
-b, -s, and -m options. Run
$ kneaddata_build_database --help
for more details.
Example Custom Database Build
Say you want to remove human reads from your metagenomic sequencing data.
You downloaded the human genome in a file called Homo_sapiens.fasta.
Then, you can generate the KneadData database by executing:
All of the required KneadData database files will have file names prefixed by
Homo_sapiens_db and have various file extensions.
Note: For creating SILVA ribosomal_RNA database
Run the following python program before bowtie2-build command which converts the “U”s to “T”s in the fasta sequences for creating SILVA ribosomal_RNA database. Script link: modify_RNA_to_DNA.py$ python -u modify_RNA_to_DNA.py input.fasta output.fa
How to Run
After downloading or generating your database file, you can start to remove contaminant reads.
As input, KneadData requires FASTQ files. It supports both single end and paired
end reads. KneadData uses either Bowtie2 (default) or BMTagger to identify the
contaminant reads.
By default, this will create the same four files as running with bowtie2. The only differences are the contaminants file will have “bmtagger” in the name instead of “bowtie2” and the included $DATABASE name would differ.
If you wanted to use BMTagger and the BMTagger executable was located at
$HOME/bmtagger/bmtagger.sh which is not in your PATHyouwouldaddtheoption"−−bmtaggerHOME/bmtagger/bmtagger.sh” to the command.
If you wanted to select the basenames of the output files, you would add the option “–output-prefix NAME",replacingNAME with the name you would like used.
Paired End Run
To run KneadData in paired end mode with Bowtie2, run
kneaddata_output: The folder to write the output files.
The outputs depend on what happens during the quality filtering and trimming
part of the pipeline.
When performing quality filtering and trimming for paired end files, three things
can happen:
Both reads in the pair pass.
The read in the first mate passes, and the one in the second does not pass.
The read in the second mate passes, and the one in the first does not pass.
The number of outputs are a function of the read quality.
KneadData + Bowtie2 (or BMTagger) Outputs: There can be up to 8 outputs per reference
database, plus up to 5 aggregate outputs.
Instead of single end reads, say you have paired end reads and you want to
separate the reads that came from bacterial mRNA, bacterial rRNA, and human RNA.
You have two databases, one prefixed bact_rrna_db and the other prefixed
human_rna_db, and your sequence files are seq1.fastq and seq2.fastq. To
run with Bowtie2, execute
This will output files in the folder seq_out named:
Files for just the bact_rrna_db database:
seq_kneaddata_paired_bact_rrna_db_bowtie2_contam_1.fastq: Reads from the first mate in
situation (1) above that were identified as belonging to the bact_rrna_db
database.
seq_kneaddata_paired_bact_rrna_db_bowtie2_contam_2.fastq: Reads from the second mate in
situation (1) above that were identified as belonging to the bact_rrna_db
database.
seq_kneaddata_paired_bact_rrna_db_bowtie2_clean_1.fastq: Reads from the first mate in
situation (1) above that were identified as NOT belonging to the
bact_rrna_db database.
seq_kneaddata_paired_bact_rrna_db_bowtie2_clean_2.fastq: Reads from the second mate in
situation (1) above that were identified as NOT belonging to the
bact_rrna_db database.
Depending on the input FASTQ, one or more of the following may be output:
seq_kneaddata_unmatched_1_bact_rrna_db_bowtie2_contam.fastq: Reads from the first mate in
situation (2) above that were identified as belonging to the bact_rrna_db
database.
seq_kneaddata_unmatched_1_bact_rrna_db_bowtie2_clean.fastq: Reads from the first mate in
situation (2) above that were identified as NOT belonging to the
bact_rrna_db database.
seq_kneaddata_unmatched_2_bact_rrna_db_bowtie2_contam.fastq: Reads from the second mate in
situation (3) above that were identified as belonging to the bact_rrna_db
database.
seq_kneaddata_unmatched_2_bact_rrna_db_bowtie2_clean.fastq: Reads from the second mate in
situation (3) above that were identified as NOT belonging to the
bact_rrna_db database.
Files for just the human_rna_db database:
seq_kneaddata_paired_human_rna_db_bowtie2_contam_1.fastq: Reads from the first mate in
situation (1) above that were identified as belonging to the human_rna_db
database.
seq_kneaddata_paired_human_rna_db_bowtie2_contam_2.fastq: Reads from the second mate in
situation (1) above that were identified as belonging to the human_rna_db
database.
seq_kneaddata_paired_human_rna_db_bowtie2_clean_1.fastq: Reads from the first mate in
situation (1) above that were identified as NOT belonging to the
human_rna_db database.
seq_kneaddata_paired_human_rna_db_bowtie2_clean_2.fastq: Reads from the second mate in
situation (1) above that were identified as NOT belonging to the
human_rna_db database.
Depending on the input FASTQ, one or more of the following may be output:
seq_kneaddata_unmatched_1_human_rna_db_bowtie2_contam.fastq: Reads from the first mate in
situation (2) above that were identified as belonging to the human_rna_db
database.
seq_kneaddata_unmatched_1_human_rna_db_bowtie2_clean.fastq: Reads from the first mate in
situation (2) above that were identified as NOT belonging to the
human_rna_db database.
seq_kneaddata_unmatched_2_human_rna_db_bowtie2_contam.fastq: Reads from the second mate in
situation (2) above that were identified as belonging to the human_rna_db
database.
seq_kneaddata_unmatched_2_human_rna_db_bowtie2_clean.fastq: Reads from the second mate in
situation (2) above that were identified as NOT belonging to the
human_rna_db database.
Note, the files named “*_clean.fastq” will only be written if running with the option “–store-temp-output”.
Aggregated files:
seq_kneaddata.log: Log file containing statistics about the run.
seq_kneaddata_paired_1.fastq: Reads from the first mate in situation (1) identified as
NOT belonging to any of the reference databases.
seq_kneaddata_paired_2.fastq: Reads from the second mate in situation (1) identified as
NOT belonging to any of the reference databases.
seq_kneaddata_unmatched_1.fastq: Reads from the first mate in situation (2) identified as
NOT belonging to any of the reference databases.
seq_kneaddata_unmatched_2.fastq: Reads from the second mate in situation (3) identified as
NOT belonging to any of the reference databases.
Demo Run
The examples folder contains a demo input file. This file is a single read, fastq format.
Kneaddata will use “NexteraPE” adapters provided by trimomatic to trim the adapter contents by default.
The other available options are: ["NexteraPE", "TruSeq2", "TruSeq3","none"]. Based on the source of the sequencer and the FASTQC report, it is highly reccommended
to choose the correct sequencer source to ensure the removal of adapter contents by Kneaddata.
Example: Trimmming adapter sequence using TruSeq3 sequencer adapters in the workflow:
When using –bypass-trim, Kneaddata expects input files to follow its post-trim naming convention (e.g., *.trimmed.fastq). If you supply input.fastq, the run may crash with an unclear error.
Workaround: Rename your input to match the expected format, e.g.:
mv input.fastq input.trimmed.fastq
Trim Overrepresented/Repetitive sequences
It is highly recommeded to use –run-trim-repetitive flag for Shotgun sequences (Metatranscriptomics-MTX, Metagenomics-MGX) to trim the overrepresented sequences if shown in FASTQC reports.
However, Kneaddata will not trim the overrepresented sequences by default as Amplicon sequences usually have a large number of repetitive reads resulting in depletion of the read count.
Example: Trimming overrepresented sequences using the Fastqc reports:
If you want to specify additional arguments for Bowtie2 using the
--bowtie2-options flag, you will need to use the equals sign along with quotes. Add additional flags for each option.
NOTE: Manually specifying additional arguments will completely override the defaults.
Also more than one database can be provided for each run. The database argument can contain the folder that includes the database or the prefix of the database files.
ATTENTION
Before opening a new issue here, please check the appropriate help channel on the KneadData bioBakery Support Forum and consider opening or commenting on a thread there.
For additional information, visit the KneadData Tutorial.
KneadData User Manual
KneadData is a tool designed to perform quality control on metagenomic and metatranscriptomic sequencing data, especially data from microbiome experiments. In these experiments, samples are typically taken from a host in hopes of learning something about the microbial community on the host. However, sequencing data from such experiments will often contain a high ratio of host to bacterial reads. This tool aims to perform principled in silico separation of bacterial reads from these “contaminant” reads, be they from the host, from bacterial 16S sequences, or other user-defined sources. Additionally, KneadData can be used for other filtering tasks. For example, if one is trying to clean data derived from a human sequencing experiment, KneadData can be used to separate the human and the non-human reads.
If you use the KneadData software, please cite our manuscript: TBA
Contents
Requirements
Optionally, BMTagger can be used instead of Bowtie2.
The executables for the required software packages should be installed in your PATH.Alternatively,youcanprovidethelocationoftheBowtie2install(BOWTIE2_DIR) with the following KneadData option “–bowtie2 $BOWTIE2_DIR”.
Installation
Before installing KneadData, please install the Java Runtime Environment (JRE). First download the JRE for your platform. Then follow the instructions for your platform: Linux 64-bit or Mac OS. At the end of the installation, add the location of the java executable to your $PATH.
Download KneadData
You can download the latest KneadData release or the development version. The source contains example files. If installing with pip, it is optional to first download the KneadData source.
Option 1: Latest Release (Recommended)
Option 2: Development Version
Create a clone of the repository:
$ git clone https://github.com/biobakery/kneaddata.gitNote: Creating a clone of the repository requires Git to be installed.
Install KneadData
Install with pip
$ pip install kneaddatakneaddata: command not foundwhen trying to run KneadData after installing with the “–user” option.Install from source
$ cd kneaddata$ python setup.py installkneaddata: command not foundwhen trying to run KneadData after installing with the “–user” option.Download the database
It is recommended that you download the human (Homo_sapiens_hg39_T2T_Bowtie2_v0.1.tar.gz - source ) reference database (approx. size = 3.6 GB). However, this step is not required if you are using your own custom reference database or if you will not be running with a reference database.
$ kneaddata_database --download human_genome bowtie2 $DIRIf you are running with bmtagger instead of bowtie2, then download the bmtagger database instead of the bowtie2 database with the following command.
$ kneaddata_database --download human_genome bmtagger $DIRThe human transcriptome (hg38) reference database is also available for download (approx. size = 254 MB).
$ kneaddata_database --download human_transcriptome bowtie2 $DIRThe SILVA Ribosomal RNA reference database is also available for download (approx. size = 11 GB).
$ kneaddata_database --download ribosomal_RNA bowtie2 $DIRThe mouse (C57BL) reference database is also available for download (approx. size = 3 GB).
$ kneaddata_database --download mouse_C57BL bowtie2 $DIRThe dog reference database (German Shepherd dog assembly) is also available for download (approximate size = ~2.5 Gb). This database is based on the genomic DNA sequences for the Canis lupus familiaris assembly version UU_Cfam_GSD_1.0 (accession GCF_011100685.1). This file includes the nucleotide sequences of the assembled chromosomes and unplaced scaffolds.
$ kneaddata_database --download dog_genome bowtie2 $DIRThe dog reference database (domestic dog) is also available for download (approx. size = 1.4 GB). This database is based on the genomic DNA sequences for the Canis familiaris (domestic dog) assembly version ROS_Cfam_1.0. This file includes the nucleotide sequences of the assembled chromosomes and unplaced scaffolds.
$ wget https://huttenhower.sph.harvard.edu/kneadData_databases/dog_genome.tar.gzThe cat reference database is available for download (approx. size = 3.7 GB). This database is based on the genomic DNA sequences for the Felis catus (domestic cat) This link includes the nucleotide sequences of the assembled chromosomes and unplaced scaffolds.
$ kneaddata_database --download cat_genome bowtie2 $DIRCreate a Custom Database
A reference database can be downloaded to use when running KneadData. Alternatively, you can create your own custom reference database.
Select Reference Sequences
First you must select reference sequences for the contamination you are trying to remove. Say you wish to filter reads from a particular “host.” Broadly defined, the host can be an organism, or a set of organisms, or just a set of sequences. Then, you simply must generate a reference database for KneadData from a FASTA file containing these sequences. Usually, researchers want to remove reads from the human genome, the human transcriptome, or ribosomal RNA. You can access some of these FASTA files using the resources below:
Ribosomal RNA: Silva provides a comprehensive database for ribosomal RNA sequences spanning all three domains of life (Bacteria, Archaea, and Eukarya).
Human Genome & Transcriptome: Information about the newest assembly of human genomic data can be found at the NCBI project page. USCS provides a convenient website to download this data.
Generating KneadData Databases
KneadData requires that your reference sequences (FASTA files) be indexed to form KneadData databases beforehand. This only needs to be done once per reference sequence.
For certain common databases, we provide indexed files. If you use these, you can skip the manual build steps below. Alternatively if you would like to bypass the reference alignment portion of the workflow, a database does not need to be provided when running KneadData.
To download the indexed human reference database, run the following command:
$ kneaddata_database --download human bowtie2 $DIRCreating a Bowtie2 Database
Simply run the
bowtie2-buildindexer included with Bowtie2 as follows:$ bowtie2-build <reference> <db-name>Where
<reference>is the reference FASTA file, and<db-name>is the name you wish to call your Bowtie2 database. For more details, refer to the bowtie2-build-documentationNote: Creating SILVA ribosomal_RNA Database
Creating the SILVA ribosomal_RNA database requires one additional step. Run the following python program before
bowtie2-buildcommand which converts the “U”s to “T”s in the fasta sequences.Script link: modify_RNA_to_DNA.py
$ python -u modify_RNA_to_DNA.py input.fasta output.faCreating a BMTagger Database
KneadData includes
kneaddata_build_database, an executable that will automatically generate these databases for BMTagger. Simply run$ kneaddata_build_database reference.fastaBy default, this will generate the reference databases, whose names are prefixed with
reference.fasta.A note on PATH: The above command will fail if the tools in the BMTagger suite (specifically, bmtool and srprism) and the NCBI BLAST executables are not in your PATH. If this is the case, you can specify a path to these tools using the
-b,-s, and-moptions. Run$ kneaddata_build_database --helpfor more details.
Example Custom Database Build
Say you want to remove human reads from your metagenomic sequencing data. You downloaded the human genome in a file called
Homo_sapiens.fasta.Then, you can generate the KneadData database by executing:
$ bowtie2-build Homo_sapiens.fasta -o Homo_sapiens_dbfor Bowtie2, or
$ kneaddata_build_database Homo_sapiens.fasta -o Homo_sapiens_dbAll of the required KneadData database files will have file names prefixed by
Homo_sapiens_dband have various file extensions.Note: For creating SILVA ribosomal_RNA database
Run the following python program before
bowtie2-buildcommand which converts the “U”s to “T”s in the fasta sequences for creating SILVA ribosomal_RNA database.Script link: modify_RNA_to_DNA.py
$ python -u modify_RNA_to_DNA.py input.fasta output.faHow to Run
After downloading or generating your database file, you can start to remove contaminant reads. As input, KneadData requires FASTQ files. It supports both single end and paired end reads. KneadData uses either Bowtie2 (default) or BMTagger to identify the contaminant reads.
Single End Run
To run KneadData in single end mode, run
$ kneaddata --unpaired seq.fastq --reference-db $DATABASE --output kneaddata_outputThis will create files in the folder
kneaddata_outputnamedseq_kneaddata_$DATABASE_bowtie2_contam.fastq: FASTQ file containing reads that were identified as contaminants from the database (named $DATABASE).seq_kneaddata.fastq: This file includes reads that were not in the reference database.seq_kneaddata.trimmed.fastq: This file has trimmed reads.seq_kneaddata.logTo run KneadData in single end mode with BMTagger, run
$ kneaddata --unpaired seq.fastq --reference-db $DATABASE --run-bmtaggerBy default, this will create the same four files as running with bowtie2. The only differences are the contaminants file will have “bmtagger” in the name instead of “bowtie2” and the included $DATABASE name would differ.
If you wanted to use BMTagger and the BMTagger executable was located at
$HOME/bmtagger/bmtagger.shwhich is not in your PATHyouwouldaddtheoption"−−bmtaggerHOME/bmtagger/bmtagger.sh” to the command.If you wanted to select the basenames of the output files, you would add the option “–output-prefix NAME",replacingNAME with the name you would like used.
Paired End Run
To run KneadData in paired end mode with Bowtie2, run
$ kneaddata --input1 seq1.fastq --input2 seq2.fastq -db $DATABASE --output kneaddata_outputTo run KneadData in paired end mode with BMTagger, run
$ kneaddata --input seq1.fastq --input seq2.fastq -db $DATABASE --run-bmtagger --output kneaddata_outputseq1.fastq: Your input FASTQ file, first mateseq2.fastq: Your input FASTQ file, second mate$DATABASE: Prefix for the KneadData database.kneaddata_output: The folder to write the output files.The outputs depend on what happens during the quality filtering and trimming part of the pipeline.
When performing quality filtering and trimming for paired end files, three things can happen:
The number of outputs are a function of the read quality.
KneadData + Bowtie2 (or BMTagger) Outputs: There can be up to 8 outputs per reference database, plus up to 5 aggregate outputs.
Instead of single end reads, say you have paired end reads and you want to separate the reads that came from bacterial mRNA, bacterial rRNA, and human RNA. You have two databases, one prefixed
bact_rrna_dband the other prefixedhuman_rna_db, and your sequence files areseq1.fastqandseq2.fastq. To run with Bowtie2, execute$ kneaddata --input1 seq1.fastq --input2 seq2.fastq -db bact_rrna_db -db human_rna_db --output seq_outThis will output files in the folder
seq_outnamed:Files for just the
bact_rrna_dbdatabase:seq_kneaddata_paired_bact_rrna_db_bowtie2_contam_1.fastq: Reads from the first mate in situation (1) above that were identified as belonging to thebact_rrna_dbdatabase.seq_kneaddata_paired_bact_rrna_db_bowtie2_contam_2.fastq: Reads from the second mate in situation (1) above that were identified as belonging to thebact_rrna_dbdatabase.seq_kneaddata_paired_bact_rrna_db_bowtie2_clean_1.fastq: Reads from the first mate in situation (1) above that were identified as NOT belonging to thebact_rrna_dbdatabase.seq_kneaddata_paired_bact_rrna_db_bowtie2_clean_2.fastq: Reads from the second mate in situation (1) above that were identified as NOT belonging to thebact_rrna_dbdatabase.Depending on the input FASTQ, one or more of the following may be output:
seq_kneaddata_unmatched_1_bact_rrna_db_bowtie2_contam.fastq: Reads from the first mate in situation (2) above that were identified as belonging to thebact_rrna_dbdatabase.seq_kneaddata_unmatched_1_bact_rrna_db_bowtie2_clean.fastq: Reads from the first mate in situation (2) above that were identified as NOT belonging to thebact_rrna_dbdatabase.seq_kneaddata_unmatched_2_bact_rrna_db_bowtie2_contam.fastq: Reads from the second mate in situation (3) above that were identified as belonging to thebact_rrna_dbdatabase.seq_kneaddata_unmatched_2_bact_rrna_db_bowtie2_clean.fastq: Reads from the second mate in situation (3) above that were identified as NOT belonging to thebact_rrna_dbdatabase.Files for just the
human_rna_dbdatabase:seq_kneaddata_paired_human_rna_db_bowtie2_contam_1.fastq: Reads from the first mate in situation (1) above that were identified as belonging to thehuman_rna_dbdatabase.seq_kneaddata_paired_human_rna_db_bowtie2_contam_2.fastq: Reads from the second mate in situation (1) above that were identified as belonging to thehuman_rna_dbdatabase.seq_kneaddata_paired_human_rna_db_bowtie2_clean_1.fastq: Reads from the first mate in situation (1) above that were identified as NOT belonging to thehuman_rna_dbdatabase.seq_kneaddata_paired_human_rna_db_bowtie2_clean_2.fastq: Reads from the second mate in situation (1) above that were identified as NOT belonging to thehuman_rna_dbdatabase.Depending on the input FASTQ, one or more of the following may be output:
seq_kneaddata_unmatched_1_human_rna_db_bowtie2_contam.fastq: Reads from the first mate in situation (2) above that were identified as belonging to thehuman_rna_dbdatabase.seq_kneaddata_unmatched_1_human_rna_db_bowtie2_clean.fastq: Reads from the first mate in situation (2) above that were identified as NOT belonging to thehuman_rna_dbdatabase.seq_kneaddata_unmatched_2_human_rna_db_bowtie2_contam.fastq: Reads from the second mate in situation (2) above that were identified as belonging to thehuman_rna_dbdatabase.seq_kneaddata_unmatched_2_human_rna_db_bowtie2_clean.fastq: Reads from the second mate in situation (2) above that were identified as NOT belonging to thehuman_rna_dbdatabase.Note, the files named “*_clean.fastq” will only be written if running with the option “–store-temp-output”.
Aggregated files:
seq_kneaddata.log: Log file containing statistics about the run.seq_kneaddata_paired_1.fastq: Reads from the first mate in situation (1) identified as NOT belonging to any of the reference databases.seq_kneaddata_paired_2.fastq: Reads from the second mate in situation (1) identified as NOT belonging to any of the reference databases.seq_kneaddata_unmatched_1.fastq: Reads from the first mate in situation (2) identified as NOT belonging to any of the reference databases.seq_kneaddata_unmatched_2.fastq: Reads from the second mate in situation (3) identified as NOT belonging to any of the reference databases.Demo Run
The examples folder contains a demo input file. This file is a single read, fastq format.
$ kneaddata --unpaired examples/demo.fastq --reference-db examples/demo_db --output kneaddata_demo_outputThis will create four output files:
kneaddata_demo_output/demo_kneaddata.fastqkneaddata_demo_output/demo_kneaddata_demo_db_bowtie2_contam.fastqkneaddata_demo_output/demo_kneaddata.logkneaddata_demo_output/demo_kneaddata.trimmed.fastqSequencer Source for trimming Adapter Contents
Kneaddata will use “NexteraPE” adapters provided by trimomatic to trim the adapter contents
by default.The other available options are:
["NexteraPE", "TruSeq2", "TruSeq3","none"]. Based on the source of the sequencer and the FASTQC report, it is highly reccommended to choose the correct sequencer source to ensure the removal of adapter contents by Kneaddata.Example: Trimmming adapter sequence using TruSeq3 sequencer adapters in the workflow:
Example: Skipping adapter trimming in the workflow:
–bypass-trim option
When using –bypass-trim, Kneaddata expects input files to follow its post-trim naming convention (e.g., *.trimmed.fastq). If you supply input.fastq, the run may crash with an unclear error. Workaround: Rename your input to match the expected format, e.g.:
Trim Overrepresented/Repetitive sequences
It is highly recommeded to use –run-trim-repetitive flag for Shotgun sequences (Metatranscriptomics-MTX, Metagenomics-MGX) to trim the overrepresented sequences if shown in FASTQC reports.
However, Kneaddata will not trim the overrepresented sequences by default as Amplicon sequences usually have a large number of repetitive reads resulting in depletion of the read count.
Example: Trimming overrepresented sequences using the Fastqc reports:
Example: Trimming overrepresented sequences and TruSeq3 adapters:
Additional Arguments
If you want to specify additional arguments for Bowtie2 using the
--bowtie2-optionsflag, you will need to use the equals sign along with quotes. Add additional flags for each option.For example:
$ kneaddata --unpaired demo.fastq --output kneaddata_output --reference-db database_folder --bowtie2-options="--very-fast" --bowtie2-options="-p 2"A similar approach is used to specify additional arguments for Trimmomatic:
$ kneaddata --unpaired demo.fastq --output kneaddata_output --reference-db database_folder --trimmomatic-options="LEADING:3" --trimmomatic-options="TRAILING:3"NOTE: Manually specifying additional arguments will completely override the defaults.
Also more than one database can be provided for each run. The database argument can contain the folder that includes the database or the prefix of the database files.
For example:
$ kneaddata --unpaired demo.fastq --output kneaddata_output --reference-db database_folder --reference-db database_folder2/demoContributions
Thanks go to these wonderful people:
Complete Option List
All options can be accessed with
$ kneaddata --help.