assembly-scan reads an assembly in FASTA format and outputs summary statistics
in TSV or JSON format
assembly-scan
I wanted a quick method to output simple summary statistics of an input assembly
in TSV or JSON format. There are alternatives including
assemblathon-stats.pl
and assembly-stats, but
they didn’t output what I wanted. Thus assembly-scan was born.
While I will always recommend using the Bioconda installation, the only dependency
assembly-scan has is Python >=3.7. So, if you have that already you can use the
script directly.
git@github.com:rpetit3/assembly-scan.git
cd assembly-scan
python3 bin/assembly-scan YOUR_ASSEMBLY.fasta
From there you can decide to add it to your PATH or not. But, again, I recommend
just going the Bioconda route.
Usage
assembly-scan requires an assembly, gzip compressed or uncompressed, as input.
Usage
usage: assembly-scan [-h] [--json] [--transpose] [--prefix PREFIX] [--version] ASSEMBLY
Generate statistics for a given assembly.
positional arguments:
ASSEMBLY FASTA file to read (gzip or uncompressed)
options:
-h, --help show this help message and exit
--json Print output in a JSON format
--transpose Print output in a transposed tab-delimited format
--prefix PREFIX ID to use for output (Default: basename of assembly)
--version show program's version number and exit
Example Usage
Many FASTA files are available in the test directory. These include an uncompressed
complete phiX174 genome and a compressed Staphylococcus aureus assembly. This script
reads the input and outputs summary statistics in tab-delimited format to STDOUT.
Uncompressed
By default assembly-scan outputs the results in tab-delimited format. But for example
purposes the --transpose option has been used. It is just easier to look at in the
README.
Number of contigs with non-A,T,G,C, or N characters
contig_percent_a
Percent of A nucleotides in contigs
contig_percent_c
Percent of C nucleotides in contigs
contig_percent_g
Percent of G nucleotides in contigs
contig_percent_t
Percent of T nucleotides in contigs
contig_percent_n
Percent of N nucleotides in contigs
contig_non_acgtn
Percent of non-A,T,G,C, or N nucleotides in contigs
contigs_greater_1m
Number of contigs greater than 1,000,000 bp
contigs_greater_100k
Number of contigs greater than 100,000 bp
contigs_greater_10k
Number of contigs greater than 10,000 bp
contigs_greater_1k
Number of contigs greater than 1,000 bp
percent_contigs_greater_1m
Percent of contigs greater than 1,000,000 bp
percent_contigs_greater_100k
Percent of contigs greater than 1,000,000 bp
percent_contigs_greater_10k
Percent of contigs greater than 1,000,000 bp
percent_contigs_greater_1k
Percent of contigs greater than 1,000,000 bp
Naming
Originally this was named assembly-stats, but after a quick Google search (which I
didn’t do, again, I really should do
better!) I found another assembly-stats
from Sanger Pathogens. So I decided to rename it to assembly-scan, similar to my
fastq-scan tool, since this process is similar
to the Scan ability found in
some video games/movies/tv etc… In otherwords, it ‘scans’ an assembly and provides the
user with otherwise hidden information about the assembly.
assembly-scanreads an assembly in FASTA format and outputs summary statistics in TSV or JSON formatassembly-scanI wanted a quick method to output simple summary statistics of an input assembly in TSV or JSON format. There are alternatives including assemblathon-stats.pl and assembly-stats, but they didn’t output what I wanted. Thus
assembly-scanwas born.Installation
Bioconda
assembly-scan is available on Bioconda.
From Source
While I will always recommend using the Bioconda installation, the only dependency
assembly-scanhas is Python >=3.7. So, if you have that already you can use the script directly.From there you can decide to add it to your PATH or not. But, again, I recommend just going the Bioconda route.
Usage
assembly-scanrequires an assembly, gzip compressed or uncompressed, as input.Usage
Example Usage
Many FASTA files are available in the test directory. These include an uncompressed complete phiX174 genome and a compressed Staphylococcus aureus assembly. This script reads the input and outputs summary statistics in tab-delimited format to STDOUT.
Uncompressed
By default
assembly-scanoutputs the results in tab-delimited format. But for example purposes the--transposeoption has been used. It is just easier to look at in the README.gzip Compressed
assembly-scanincludes a simple check (.gz extension) for gzip compressed assemblies. This example also demonstrates the--jsonoption output.Output Columns
--prefixNaming
Originally this was named assembly-stats, but after a quick Google search (which I didn’t do, again, I really should do better!) I found another assembly-stats from Sanger Pathogens. So I decided to rename it to
assembly-scan, similar to my fastq-scan tool, since this process is similar to the Scan ability found in some video games/movies/tv etc… In otherwords, it ‘scans’ an assembly and provides the user with otherwise hidden information about the assembly.