Computer Science and Engineering Department, University of California, Riverside, CA 92521
Botany and Plant Sciences Department, University of California, Riverside, CA 92521
Publications
Please cite the following peer-reviewed publications when using CLARK:
Ounit, R., Wanamaker, S., Close, T. J., & Lonardi, S. (2015). CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics, 16(1), 236.
Ounit, R., & Lonardi, S. (2016). Higher classification sensitivity of short metagenomic reads with CLARK-S. Bioinformatics, 32(24), 3823-3825.
About
CLARK (CLAssifier based on Reduced K-mers) addresses the computational challenges of DNA sequence classification in molecular biology, genomics, metagenomics, and genetics. It uses discriminative k-mers for supervised sequence classification, providing a confidence score for each assignment. CLARK supports three classifiers:
CLARK (default): Designed for powerful workstations and large databases, using exact k-mer matching.
CLARK-l: A memory-efficient version for small metagenomes and limited-memory workstations, also using exact k-mer matching.
CLARK-S: Uses spaced k-mers for higher sensitivity, requiring more RAM.
General Notes
CLARK was introduced at the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB) in 2014.
CLARK-S was presented at the 14th Workshop on Algorithms in Bioinformatics in 2015.
License
CLARK is distributed under the GNU General Public License (GPL) v3. It is free software, available without any warranty. See the GNU General Public License for more details: http://www.gnu.org/licenses/
Releases
Version 1.3.0 (May 16, 2024)
Fixed error when processing bacterial/viral genomes during database construction.
Update reference for human genome.
Extend list of references to download for plasmid/plastids.
Code improvements and refactoring included.
Previous Versions
Version 1.2.6.1 (May 11, 2019): Various improvements and bug fixes.
Version 1.2.6 (February 21, 2019): New features and performance enhancements.
Version 1.2.5.1 (September 20, 2018): Added script for k-mer distribution and support for long reads.
Version 1.2.5 (February 13, 2018): Extended fungi database and added plasmid, plastid, and protozoa databases.
Version 1.2.4 (November 25, 2017): Introduced fungi database for hospital-acquired infections.
Version 1.2.3.2 (August 21, 2017): Improved summary script and support for paired-end reads in Fasta format.
Version 1.2.3.1 (February 27, 2017): Added summary script and updated contact information.
Version 1.2.3 (April 7, 2016): Finalized CLARK-S and added scripts for custom databases.
Version 1.2.2-b (December 11, 2015): Improved speed and RAM usage for full and spaced modes.
Version 1.2.1-b (June 15, 2015): Introduced discriminative spaced k-mers.
Version 1.1.3 (June 3, 2015): Extended features and simplified output.
Version 1.1.2 (April 22, 2015): Added scripts for abundance estimation and gzip support.
Version 1.1.1 (April 15, 2015): Improved efficiency for loading databases.
Version 1.1 (February 20, 2015): Improved storage efficiency and multithreaded database loading.
Version 1.0 (September 1, 2014): Initial release.
Compatibility Between Versions
Databases created with version 1.0 are not compatible with v1.1 or newer. Rebuild the database using v1.1 or newer.
Software & System Requirements
External Tools & C++ Compiler Version
CLARK requires a 64-bit OS (Linux or Mac) and GNU GCC 4.4 or higher.
Multithreading operations use the OpenMP libraries.
Memory Requirements
CLARK can be RAM-intensive. Detailed memory usage for different versions and configurations is provided below:
CLARK (default): 58 GB RAM for loading the bacterial database, 156 GB for building it.
CLARK-l: Designed for 4 GB RAM laptops, suitable for small metagenomes.
CLARK-S: Up to 101 GB RAM for spaced k-mer classification.
Filter by abundance (e.g., >2%): ./estimate_abundance.sh -F ./result.csv -D <DIR_DB/> -a 2
Results Format
Full mode:<Object_ID>,<hit count in target 1>,...,<hit count in target N>,<Length of object>,<Gamma>,<first assignment>,<hit count of first>,<second assignment>,<hit count of second>,<confidence score>
Default/Express mode:<Object_ID>,<Length of object>,<1st_assignment>
This README provides a comprehensive guide to installing, using, and understanding CLARK. It ensures clarity and ease of use for all users. For more detailed information and examples, please refer to the full README file included with the CLARK distribution.
CLARK: Fast and Accurate Sequence Classifier
Authors
Publications
Please cite the following peer-reviewed publications when using CLARK:
About
CLARK (CLAssifier based on Reduced K-mers) addresses the computational challenges of DNA sequence classification in molecular biology, genomics, metagenomics, and genetics. It uses discriminative k-mers for supervised sequence classification, providing a confidence score for each assignment. CLARK supports three classifiers:
General Notes
License
CLARK is distributed under the GNU General Public License (GPL) v3. It is free software, available without any warranty. See the GNU General Public License for more details: http://www.gnu.org/licenses/
Releases
Version 1.3.0 (May 16, 2024)
Previous Versions
Compatibility Between Versions
Software & System Requirements
External Tools & C++ Compiler Version
Memory Requirements
CLARK can be RAM-intensive. Detailed memory usage for different versions and configurations is provided below:
Installation
tar -xvf CLARKV1.3.0.tar.gz./install.shUsage
Scripts
Basic Example
./set_targets.sh <DIR_DB/> bacteria./classify_metagenome.sh -O ./sample.fa -R ./resultHow to Choose the K-mer Length
Manual & Options
A detailed description of all command-line options and parameters is provided below.
Definitions of Parameters
Examples
./CLARK -k 31 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results -m 0./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -O ./objects.fa -R ./results -m 2 -n 12./CLARK -k 20 -T ./targets_addresses.txt -D ./DBD/ -P ./sample1.fastq ./sample2.fastq -R paired.resultsClassification of Metagenomic Samples
Scripts provided for metagenomic classification:
Setting Targets
<DIR_DB/>)../set_targets.sh <DIR_DB/> bacteria./set_targets.sh <DIR_DB/> bacteria viruses humancustom`
Running the Classification
Example commands:
./classify_metagenome.sh -O ./sample.fa -R ./result./classify_metagenome.sh -O ./sample.fa -R ./result -k 20./classify_metagenome.sh -O ./sample.fa -R ./result -m 0./classify_metagenome.sh -O ./samples.txt -R ./samples.txt -m 0./classify_metagenome.sh -O ./samples.R.txt ./samples.L.txt -R ./samples.R.txt -m 0./classify_metagenome.sh -O ./sample.fa -R ./result -m 2 -n 8./classify_metagenome.sh -O ./sample.fa.gz -R ./result -m 0 -n 8 --gzipped./classify_metagenome.sh -P ./sample1.fastq ./sample2.fastq -R ./result --light./classify_metagenome.sh -P ./sample1.fastq ./sample2.fastq -R ./result --spaced./classify_metagenome.sh -O ./sample.fa -R ./result -m 0 -n 8 --spaced./classify_metagenome.sh -O ./sample.fa.gz -R ./result -m 2 -n 8 --spaced./classify_metagenome.sh -O ./sample.fa -R ./result --spaced -s 2Abundance Estimation
Example commands:
./estimate_abundance.sh -F ./result.csv -D <DIR_DB/>./estimate_abundance.sh -F ./result.csv -D <DIR_DB/> --highconfidence./estimate_abundance.sh -F ./result.csv -D <DIR_DB/> -c 0.80./estimate_abundance.sh -F ./result.csv -D <DIR_DB/> -g 0.03./estimate_abundance.sh -F ./result.csv -D <DIR_DB/> -g 0.03 --mpa./estimate_abundance.sh -F ./result.csv -D <DIR_DB/> -a 2Results Format
<Object_ID>,<hit count in target 1>,...,<hit count in target N>,<Length of object>,<Gamma>,<first assignment>,<hit count of first>,<second assignment>,<hit count of second>,<confidence score><Object_ID>,<Length of object>,<1st_assignment>COMMUNITY OF CLARK USERS
For discussions, suggestions, or questions please join the Google groups of the CLARK users at: https://groups.google.com/g/clarkusers
This README provides a comprehensive guide to installing, using, and understanding CLARK. It ensures clarity and ease of use for all users. For more detailed information and examples, please refer to the full README file included with the CLARK distribution.