You may wonder why this tool even exists. Well, I tried to do the right
thing and use established tools like readseq and seqret from EMBOSS, but
they both mangled IDs containing | or . characters, and
there is no way to fix this behaviour. This resulted in inconsitences
between my .gbk and .fna versions of files in my pipelines.
Then you may wonder why I didn’t use Bioperl or Biopython. Well they are
heavyweight libraries, and actually very slow at parsing Genbank files.
This script uses only core Perl modules, has no other dependencies, and
runs very quickly.
It supports the following input formats:
Genbank flat file, typically .gb, .gbk, .gbff (starts with LOCUS)
EMBL flat file, typically .embl, (starts with ID)
GFF with sequence, typically .gff, .gff3 (starts with ##gff)
FASTA DNA, typically .fasta, .fa, .fna, .ffn (starts with >)
FASTQ DNA, typically .fastq, .fq (starts with @)
CLUSTAL alignments, typically .clw, .clu (starts with CLUSTAL or MUSCLE)
STOCKHOLM alignments, typically .sth (starts with # STOCKHOLM)
GFA assembly graph, typically .gfa (starts with ^[A-Z]\t)
PDB protein data bank structure, typicall .pdb (starts with ^HEADER)
any2fasta has no dependencies except Perl 5.10
or higher. It only uses core modules, so no CPAN needed.
Conda
% conda install -c bioconda any2fasta
Direct script download
% cd /usr/local/bin # choose a folder in your $PATH
% wget https://raw.githubusercontent.com/tseemann/any2fasta/master/any2fasta
% chmod +x any2fasta
Github
% git clone https://github.com/tseemann/any2fasta.git
% cp any2fasta/any2fasta $HOME/.local/bin # choose a folder in your $PATH
Test Installation
Sinple check
% ./any2fasta -v
any2fasta 1.0.2
% ./any2fasta -h
NAME
any2fasta 1.0.2
SYNOPSIS
Convert various sequence formats into FASTA
USAGE
any2fasta [options] file.{gb,fa,fq,gff,gfa,clw,sth}[.gz,bz2,zip] > output.fasta
OPTIONS
-h Print this help
-v Print version and exit
-q No output while running, only errors
-k Skip, don't die, on bad input files
-n Replace non-[AGTC] with 'N'
-l Lowercase the sequence
-u Uppercase the sequence
-g Include VERSION from GBK/EMBL files
-s Strip sequence descriptions (FASTA,FASTQ)
END
Extensive test
% bats $(dirname $(which any2fasta))/test/test.sh
✓ Script syntax check
✓ Version
...
✓ Multiple sequence with one bad one
✓ Allow skipping over bad files
29 tests, 0 failures, 2 skipped
Examples
% any2fasta ref.gbk > ref.fna
% any2fasta in.fasta > out.fasta # should behave like "cat"
% any2fasta prokka.gff > prokka.fna # only if GFF has FASTA appended
% any2fasta - < file.gb > file.fasta # '-' means stdin
% anyfasta genes.gff.gz > genes.ffn # automatically decompresses
% any2fasta 1.gb 2.fa.gz 3.gff.bz2 - > out.fa # multiple files and stdin
% any2fasta R1.fq.gz | bzip2 > R1.fa.bz2 # 'seqtk seq -A' is much faster
% any2fasta -q 23S.clw > 23S.aln # gaps '-' will be preserved
% any2fasta pfam4321.sth > pfam4321.aln # '.' gaps will become '-'
Options
-n replaces any characters that aren’t A,C,G,T with N (gaps preserved)
-l will lowercase all the letters
-u will uppercase all the letters
-q will prevent logging messages being printed
-k will warn of bad inputs and continue on. not stop and error
any2fasta
Convert various sequence formats to FASTA
Quick start
Motivation
You may wonder why this tool even exists. Well, I tried to do the right thing and use established tools like
readseqandseqretfrom EMBOSS, but they both mangled IDs containing|or.characters, and there is no way to fix this behaviour. This resulted in inconsitences between my.gbkand.fnaversions of files in my pipelines.Then you may wonder why I didn’t use Bioperl or Biopython. Well they are heavyweight libraries, and actually very slow at parsing Genbank files. This script uses only core Perl modules, has no other dependencies, and runs very quickly.
It supports the following input formats:
.gb,.gbk,.gbff(starts withLOCUS).embl, (starts withID).gff,.gff3(starts with##gff).fasta,.fa,.fna,.ffn(starts with>).fastq,.fq(starts with@).clw,.clu(starts withCLUSTALorMUSCLE).sth(starts with# STOCKHOLM).gfa(starts with^[A-Z]\t).pdb(starts with^HEADER)Files may be compressed with:
.gz.bz2.zipplus any other formats supported by your installed version of Perl’s
IO::Uncompress::AnyUncompressmodule.Installation
any2fastahas no dependencies except Perl 5.10 or higher. It only uses core modules, so no CPAN needed.Conda
Direct script download
Github
Test Installation
Sinple check
Extensive test
Examples
Options
-nreplaces any characters that aren’t A,C,G,T with N (gaps preserved)-lwill lowercase all the letters-uwill uppercase all the letters-qwill prevent logging messages being printed-kwill warn of bad inputs and continue on. not stop and error-gwill appened the version to the sequence ID-sremovesdescfrom>id descin FASTA,FASTQ,GFFIssues
Submit feedback to the Issue Tracker
License
GPL v3
Author
Torsten Seemann