Contaminant sequences in metagenomic samples can potentially impact the interpretation of findings reported in microbiome studies, especially in low biomass environments. Based on the hypothesis that contamination from DNA extraction kits or sampling lab environments will leave taxonomic “bread crumbs” across multiple distinct sample types, allowing for the detection of microbial contaminants when negative controls are unavailable, We introduce you Squeegee, a de novo computational contamination detection tool for metagenomic samples.
System requirements
Squeegee is supported on Linux system. The user should provide sufficient amount of RAM in order to load the classification database for Kraken. A standard database containing bacteria, archaea, and viral genomes from NCBI RefSeq takes more than 300 Gb. This tool is tested on Linux (Ubuntu 18.04.5 LTS).
Required tools and packages
The following are the required dependencies for Squeegee
The easiest way to install Squeegee is through conda. It’s highly recommand that the tool is installed on a clean conda enviorment. To create a new conda enviorment, please follow conda user guide. The user could also install the required dependencies and then make a clone of the repo. The typical install time using conda install should be no longer than 5 minutes.
conda install -c bioconda squeegee
Building a Kraken database
Before running squeegee, a kraken database should be build. Details about build kraken database, please check the Kraken manual. It’s higher recommended that the user build the database with bacteria, viruses, and archaea sequences from RefSeq database. Please do not delete the original sequences nor the mapping file that are used during the Kraken database building process, since squeegee would quary those sequences in order to extract reference sequences used in the alignment step.
Input Sequencing Data
The input data for Squeegee should include multiple metagenomic samples that are collected from distinct enviorment, and are processed in the same lab/or with the same reagent, such as DNA extraction kit, to ensure that the source of contamination are consistent across all the samples. Current version of the software only supports Illumina paired end reads. The support for single end read data and long read data will be added in the near future.
Required Metadata
User must provide a metadata file in the tab-separated text file format. In the meta data file, there will be four colomns. First colomn indicates the sample ID, second colomn is the user defined sample type, third and fourth colomn are the absolute path to the first and second reads for the paired-end sequencing data in fastq format.
The required argument are metadata file, the directory to kraken database, and a user specificed output directory for the analysis.
Parameter Settings
The parameter settings affect the precision and recall of Squeegee. Based on the basic understanding of the samples, the user is able to control how likely a taxon is being recruited as a candidate contaminant by setting minimum prevalence threshold (Default:0.6) to different values.
If the users are processing samples that have similar microbial communities, increasing the minimum prevalence threshold will reduce the number of false positives caused by shared true community members.
Lowering the minimum prevalence threshold allows the program to consider more candidate contaminants, potentially increase recall but will increase the run time. Minimum read support threshold, minimum abundance threshold, and minimum alignment coverage threshold all contribute to how restrict a taxon is considered present. Based on different sequencing technologies, 5% or more of the reads may be misclassified by the taxonomic classifier even at genus level. Increasing those thresholds allows more confident identification of whether a taxon is truly present or not. On the other hand, in a scenario where contaminant species are low in abundance, setting those parameters at high values could cause an increase in false negatives.
Output Format
Squeegee will generate output in the user specified directory. The output includes Kraken classification output and report for each sample, kmer sketches generated for each sample, and alignment file of mapping reads to the candidate contaminant. In the final_predictions.txt, a set of predicted contaminants will be listed in the following format:
The first line of the file specifies the sample types provided by the user, seperated by slashes. The next line is the header of the output. The rest of the file includes a list of contaminant, and their taxonomic ID, taxonomic name, combined score, prevalence score, align scores, mash score, sample type prevalence (multiple score associated with sample type, seperated by slashes), sample type coverage (multiple score associated with sample type, seperated by slashes). Higher score indicates the species is more likely to be a contaminant.
Testing Squeegee
To test the software, please downloaded the toy input, including 3 simulated metagenomic samples, the metadata file, and the expected output, from here.
Once the downloaded is complete, put the input_data directory under a directory called demo (to match the metadata), then run the following command
Since the dataset is a toy dataset with only limited number of simulated Illumina paired end reads, the min-align parameter is set to 0.005, which is much lower then the default setting. The software should output multiple files including kraken classification result/report, alignment files, and multiple text report. At the end, Squeegee would output a file named final_predictions.txt, indicating that Methylobacterium sp. 17Sr1-43 is predicted as a contaminant species in this dataset. Processing this demo dataset takes about 15 minutes with a single thread.
Squeegee
Contaminant sequences in metagenomic samples can potentially impact the interpretation of findings reported in microbiome studies, especially in low biomass environments. Based on the hypothesis that contamination from DNA extraction kits or sampling lab environments will leave taxonomic “bread crumbs” across multiple distinct sample types, allowing for the detection of microbial contaminants when negative controls are unavailable, We introduce you Squeegee, a de novo computational contamination detection tool for metagenomic samples.
System requirements
Squeegee is supported on Linux system. The user should provide sufficient amount of RAM in order to load the classification database for Kraken. A standard database containing bacteria, archaea, and viral genomes from NCBI RefSeq takes more than 300 Gb. This tool is tested on Linux (Ubuntu 18.04.5 LTS).
Required tools and packages
The following are the required dependencies for Squeegee
Installation
The easiest way to install Squeegee is through conda. It’s highly recommand that the tool is installed on a clean conda enviorment. To create a new conda enviorment, please follow conda user guide. The user could also install the required dependencies and then make a clone of the repo. The typical install time using
conda installshould be no longer than 5 minutes.Building a Kraken database
Before running squeegee, a kraken database should be build. Details about build kraken database, please check the Kraken manual. It’s higher recommended that the user build the database with bacteria, viruses, and archaea sequences from RefSeq database. Please do not delete the original sequences nor the mapping file that are used during the Kraken database building process, since squeegee would quary those sequences in order to extract reference sequences used in the alignment step.
Input Sequencing Data
The input data for Squeegee should include multiple metagenomic samples that are collected from distinct enviorment, and are processed in the same lab/or with the same reagent, such as DNA extraction kit, to ensure that the source of contamination are consistent across all the samples. Current version of the software only supports Illumina paired end reads. The support for single end read data and long read data will be added in the near future.
Required Metadata
User must provide a metadata file in the tab-separated text file format. In the meta data file, there will be four colomns. First colomn indicates the sample ID, second colomn is the user defined sample type, third and fourth colomn are the absolute path to the first and second reads for the paired-end sequencing data in fastq format.
Running Sqeeegee
Use the following command to run squeegee:
The required argument are metadata file, the directory to kraken database, and a user specificed output directory for the analysis.
Parameter Settings
The parameter settings affect the precision and recall of Squeegee. Based on the basic understanding of the samples, the user is able to control how likely a taxon is being recruited as a candidate contaminant by setting minimum prevalence threshold (Default:0.6) to different values.
If the users are processing samples that have similar microbial communities, increasing the minimum prevalence threshold will reduce the number of false positives caused by shared true community members.
Lowering the minimum prevalence threshold allows the program to consider more candidate contaminants, potentially increase recall but will increase the run time. Minimum read support threshold, minimum abundance threshold, and minimum alignment coverage threshold all contribute to how restrict a taxon is considered present. Based on different sequencing technologies, 5% or more of the reads may be misclassified by the taxonomic classifier even at genus level. Increasing those thresholds allows more confident identification of whether a taxon is truly present or not. On the other hand, in a scenario where contaminant species are low in abundance, setting those parameters at high values could cause an increase in false negatives.
Output Format
Squeegee will generate output in the user specified directory. The output includes Kraken classification output and report for each sample, kmer sketches generated for each sample, and alignment file of mapping reads to the candidate contaminant. In the
final_predictions.txt, a set of predicted contaminants will be listed in the following format:The first line of the file specifies the sample types provided by the user, seperated by slashes. The next line is the header of the output. The rest of the file includes a list of contaminant, and their taxonomic ID, taxonomic name, combined score, prevalence score, align scores, mash score, sample type prevalence (multiple score associated with sample type, seperated by slashes), sample type coverage (multiple score associated with sample type, seperated by slashes). Higher score indicates the species is more likely to be a contaminant.
Testing Squeegee
To test the software, please downloaded the toy input, including 3 simulated metagenomic samples, the metadata file, and the expected output, from here. Once the downloaded is complete, put the
input_datadirectory under a directory calleddemo(to match the metadata), then run the following commandTo run the software with multiple threads (for example, 20 threads), use the following command:
Since the dataset is a toy dataset with only limited number of simulated Illumina paired end reads, the min-align parameter is set to 0.005, which is much lower then the default setting. The software should output multiple files including kraken classification result/report, alignment files, and multiple text report. At the end, Squeegee would output a file named
final_predictions.txt, indicating that Methylobacterium sp. 17Sr1-43 is predicted as a contaminant species in this dataset. Processing this demo dataset takes about 15 minutes with a single thread.