This repository is home to CALITAS, a CRISPR-Cas-aware ALigner for In silico off-TArget Search. CALITAS implements a customized gapped alignment of guide sequences to genomes and other reference sequences, returning consistent and non-redundant alignments.
Overview
CALITAS is a suite of bioinformatic tools for enumerating candidate off-target sites for CRISPR guide sequences and standardizing their alignment. It features tools for searching an entire genome for candidate off-target sites, as well as for aligning guides to specific sequences or locations in a genome.
Key features of CALITAS include:
Detection of all candidate off-target sites up to the requested number of mismatches and gaps
Elimination of redundant alignments resulting in a single best or canonical alignment per locus
Searches with multiple PAM sequences and/or PAM-less searching
Integration of known variants in VCF format into genome-wide off-target searches
Customizable scoring system for weighting mismatches vs. gaps and differences in the protospacer vs. PAM
CALITAS is not intended to predict active off-target sites, but rather to enumerate candidate off-target sites for further investigation.
Getting CALITAS
Releases
Binary releases of CALITAS are available from the releases page on GitHub. The downloadable JAR files contain CALITAS and all dependencies, and need only a Java RunTime version 8 or higher installed. Once downloaded CALITAS can be run as follows to produce usage information:
java -Xmx8g -jar calitas.jar
Bioconda
Relases of CALITAS are also available via Bioconda. To use these releases you will first need to install the conda package manager - Miniconda is recommended if you do not already use conda. CALITAS can then be installed with:
conda install -c bioconda calitas
The conda release comes with a small helper script and can be run simply as calitas.
Building from Source
Both release and development versions of the code can be built from source. Builds are performed with sbt. Once sbt is installed run:
sbt clean assembly
This will build a JAR equivalent to a release JAR at calitas/target/scala-2.12/calitas.jar
Publication
A CALITAS manuscript is currently in review. This section will be updated with a reference to the publication when it is available.
Usage
CALITAS has four available sub-commands:
AlignToReference performs glocal alignment of query sequence to a window on the reference
PairwiseAlignSequences performs pairwise alignment of sequences
PrepareVcf prepares a VCF for optimal use by SearchReference
SearchReference searches a FASTA file for alignments of a guide+PAM(s)
It should be noted that commands which use a genome or reference FASTA file require that the FASTA file have both an index and a sequence dictionary. These can be generated using samtools as follows:
The following is an example of invoking SearchReference to find candidate off target sequences in the HG38 genome for a single guide and PAM (note that location and sequence of the PAM is indicated by providing guide sequence in upper case and PAM sequence in lower case):
The last three parameters are optional and replicate the defaults. Additional options are available; detailed usage including all available parameters can be obtained by running calitas SearchReference.
The following is an example of running AlignToReference to produce standarized alignments at locations where guide(s) are known to align. The invocation will produce the single best alignment per query sequence and target location:
With the following being an example of the tab-delimited input file for AlignToReference:
id query chrom position
1 CTTGCCCCACAGGGCAGTAAnrg chr1 13358
2 CTTGCCCCACAGGGCAGTAAnrg chr1 510578
3 CTTGCCCCACAGGGCAGTAAnrg chr1 844033
The output of both SearchReference and AlignToReference is a tab-delimited text file with one row per candidate off-target site including the following columns:
column name
description
guide_id
Name/ID of guide.
unpadded_guide_sequence
The sequence of the guide used, unpadded.
genome_build
The assembly name of the searched genome (e.g. HG38).
chromosome
Chromosome for target sequence alignment (eg: chr3).
coordinate_start
Start of the unpadded target sequence in the genome, 0-based open ended, excluding PAM.
coordinate_end
End of the unpadded target sequence in the genome, 0-based open ended, excluding PAM.
strand
Either + or -. The reported strand is the strand of the genome which matches the guide sequence. E.g. if strand is reported as + this means the guide resembles the sequence on the top strand of the genome, and will bind to the bottom strand of the genome.
unpadded_target_sequence
The unpadded target sequence (as DNA) as found in the genome, without gaps/bulges, excluding PAM. Reported sequence matches the reported strand (i.e. - strand hits will report the reverse complement of the genomic sequence).
ten_bases_5_prime
The 10 bases from the reference genome immediately 5’ of the off-target location (coordinate_start/coordinate_end), respecting strand.
ten_bases_3_prime
The 10 bases from the reference genome immediately 3’ of the off-target location (coordinate_start/coordinate_end), respecting strand.
pam_used
PAM used in the alignment (eg: nrg).
variant_id
When searching using a VCF, a semi-colon separated list of variant IDs (e.g. rs1234;rs2345) that have non-reference alleles present in the off-target alignment. May be empty when no variants are present.
variant_description
When searching using a VCF, a semi-colon separated list of variant descriptions in the format id:pos:ref>alt:af where pos is the position within the alignment and af is the allele frequency of the alternate allele.
variant_vcf
When searching using a VCF, a string composed of filname of the VCF followed by a colon (:) and then the MD5 of the VCF.
allele_frequency
When searching using a VCF, the minimum allele frequency of any variant included in the target alignment.
score
Alignment score (including PAM).
guide_mm
Mismatches in the guide region (excluding PAM).
guide_gaps
Gaps in the guide region (excluding PAM).
guide_mm_plus_gaps
Total gaps and mismatches in the guide region (excluding PAM).
pam_mm
Mismatches in the PAM region.
total_mm_plus_gaps
Total count of mismatches and gaps across the both guide and PAM regions.
padded_guide
Guide + PAM sequence with padding for mismatches and bulges/gaps.
padded_alignment
Visual representation of the guide-target alignment: `
padded_target
Target sequence, including PAM, with padding for mismatches and bulges/gaps.
padded_extra_8_bases_5_prime
an additional 8 bases on the 5’ side of the padded_target.
padded_extra_8_bases_3_prime
an additional 8 bases on the 3’ side of the padded_target.
cigar
Cigar representation of guide sequence alignment.
unpadded_guide_sequence_length
Length of guide sequence not including PAM, nor gaps/bulges.
unpadded_target_sequence_length
Length of target sequence not including PAM, nor gaps/bulges.
aligner
Aligner name, e.g. CALITAS:SearchReference.
aligner_version
Version number of the CALITAS software used to produce the alignments.
aligner_search_pam
Comma-separated list of pams used during the search.
aligner_other_parameters
A semicolon-separated list of parameters provided when running CALITAS.
time_stamp
A date and time stamp for when the alignment run was started (UTC) in this format: Wed Jan 6 16:58:29 UTC 2021.
Note on coordinates produced by CALITAS
All genome coordinates in CALITAS output files are 0-based open ended, the same systemm used in BED files. I.e. the first base of a chromosome or sequence is represented by 0, and when describing a region on chromosome we specify the first base included in the interval as the start and the base after the interval as the end. E.g. the first 10bp on a chromosome would be represented as 0-10.
CALITAS
This repository is home to CALITAS, a CRISPR-Cas-aware ALigner for In silico off-TArget Search. CALITAS implements a customized gapped alignment of guide sequences to genomes and other reference sequences, returning consistent and non-redundant alignments.
Overview
CALITAS is a suite of bioinformatic tools for enumerating candidate off-target sites for CRISPR guide sequences and standardizing their alignment. It features tools for searching an entire genome for candidate off-target sites, as well as for aligning guides to specific sequences or locations in a genome.
Key features of CALITAS include:
CALITAS is not intended to predict active off-target sites, but rather to enumerate candidate off-target sites for further investigation.
Getting CALITAS
Releases
Binary releases of CALITAS are available from the releases page on GitHub. The downloadable JAR files contain CALITAS and all dependencies, and need only a Java RunTime version 8 or higher installed. Once downloaded CALITAS can be run as follows to produce usage information:
Bioconda
Relases of CALITAS are also available via Bioconda. To use these releases you will first need to install the conda package manager - Miniconda is recommended if you do not already use conda. CALITAS can then be installed with:
The conda release comes with a small helper script and can be run simply as
calitas.Building from Source
Both release and development versions of the code can be built from source. Builds are performed with sbt. Once sbt is installed run:
This will build a JAR equivalent to a release JAR at
calitas/target/scala-2.12/calitas.jarPublication
A CALITAS manuscript is currently in review. This section will be updated with a reference to the publication when it is available.
Usage
CALITAS has four available sub-commands:
It should be noted that commands which use a genome or reference FASTA file require that the FASTA file have both an index and a sequence dictionary. These can be generated using samtools as follows:
The following is an example of invoking
SearchReferenceto find candidate off target sequences in the HG38 genome for a single guide and PAM (note that location and sequence of the PAM is indicated by providing guide sequence in upper case and PAM sequence in lower case):The last three parameters are optional and replicate the defaults. Additional options are available; detailed usage including all available parameters can be obtained by running
calitas SearchReference.The following is an example of running
AlignToReferenceto produce standarized alignments at locations where guide(s) are known to align. The invocation will produce the single best alignment per query sequence and target location:With the following being an example of the tab-delimited input file for
AlignToReference:The output of both
SearchReferenceandAlignToReferenceis a tab-delimited text file with one row per candidate off-target site including the following columns:guide_idunpadded_guide_sequencegenome_buildHG38).chromosomechr3).coordinate_startcoordinate_endstrand+or-. The reported strand is the strand of the genome which matches the guide sequence. E.g. if strand is reported as+this means the guide resembles the sequence on the top strand of the genome, and will bind to the bottom strand of the genome.unpadded_target_sequencestrand(i.e.-strand hits will report the reverse complement of the genomic sequence).ten_bases_5_primecoordinate_start/coordinate_end), respectingstrand.ten_bases_3_primecoordinate_start/coordinate_end), respectingstrand.pam_usednrg).variant_idrs1234;rs2345) that have non-reference alleles present in the off-target alignment. May be empty when no variants are present.variant_descriptionid:pos:ref>alt:afwhere pos is the position within the alignment and af is the allele frequency of the alternate allele.variant_vcf:) and then the MD5 of the VCF.allele_frequencyscoreguide_mmguide_gapsguide_mm_plus_gapspam_mmtotal_mm_plus_gapspadded_guidepadded_alignmentpadded_targetpadded_extra_8_bases_5_primepadded_extra_8_bases_3_primecigarunpadded_guide_sequence_lengthunpadded_target_sequence_lengthalignerCALITAS:SearchReference.aligner_versionaligner_search_pamaligner_other_parameterstime_stampNote on coordinates produced by CALITAS
All genome coordinates in CALITAS output files are 0-based open ended, the same systemm used in BED files. I.e. the first base of a chromosome or sequence is represented by 0, and when describing a region on chromosome we specify the first base included in the interval as the start and the base after the interval as the end. E.g. the first 10bp on a chromosome would be represented as
0-10.