Seqstr (pronounced as seq-string) is a lightweight tool to compile simple string input into long genomic sequences. It is designed to provide a concise and flexible way to specify long genomic sequences that can be used for downstream analysis. For example, it can be used by web servers to avoid transferring long genomic sequences. Seqstr is also a format specification, which can be implemented in different languages. We also provide a test suite for verifying an implementation.
Seqstr allows using a combination of genome interval coordinates, raw sequence nucleotides, and specifying mutations. A single sequence can be flexibly composed by concatenating multiple subsequences (for example, [hg38]chr7:5530575-5530625 -, @chr7 5530575 C T, @chr7 5530576 GC A;TTAAccggGGNaa;[hg38]chrX:1000000-1000017 +;TTAA;). Seqstr can also be used to express multiple named sequences for downstream applications.
Use [reference_genome]chr:start-end strand to specify an genomic interval. For example, [hg38]chr7:5530575-5530625 - would extract cooridnate 5530575 through 5530625 from chromosome 7 of the hg38 reference genome, and take reverse complement of the sequence.
[] is used to specify the reference genome (UCSC convention), if not specified, the default is hg38.
chromosome name should be one of the chromosome names from the specified reference genome (UCSC convention), followed by :
start and end coordinates are 0-based, connected via -, inclusive for the start coordinate and exclusive for the end coordinate
strand only takes + and -, default is +. strand is separated from end coordinate by a space
Composing and modifying sequences
For more flexibly specifying a sequence that is different from the reference genome sequence, Seqstr can take input like [hg38]chr7:5530575-5530625 -, @chr7 5530575 C T, @chr7 5530576 GC A;TTAAccggGGNaa;[hg38]chrX:1000000-1000017 +;TTAA;, which is explained below:
Compose subsequences
Seqstr can concatenate multiple subsequences connected by ;. Each subsequence can be either a genomic interval (For example, [hg38]chr7:5530575-5530625 -) or a sequence (For example, TTAAccggGGNaa). For directly specifying sequence in Seqstr, any characters, e.g. special marking, are allowed and will be included in the output sequence. Because the purpose of Seqstr is to shorten the input size, direct sequence specification is usually used when concatenating with an genomic interval.
; is used to separate multiple sections of sequences. The final outcome consists of all sections and follows the order of input string
raw sequence strings are also allowed and remain as they were in the entire output sequence
Sequence modifier
Any genomic interval can be modified by a mutation or variant specified with the syntax @chr position reference_allele alternative_allele. Multiple modifier can be provided to introduce multiple mutations into the same sequence. Mutation specification is with respect to original sequence coordinates and + strand.
mutation specification is separated from the genomic interval specification by ,
mutation specification starts with @chromosome, then at particular coordinate, change from reference_allele to alternative_allele
multiple mutations are allowed and separated by ,. we note that while any mutation may change the length of the sequence, all mutation coordinates are relative to the original reference genome sequence.
Overlapping mutations are not allowed.
Multiple sequences
Seqstr supports specifying multiple sequences separated by \n or line breaks and labeling each sequence via <> embedding the names assigned to each sequence. For example,
would be parsed into an array of sequences, with s1 and s2 as their names.
Seqstr interprets \n or line break as separator for multiple sequences
String enclosed by <> at the beginning of a sequence is used as the name for the single sequence. If not provided, Seqstr assigns i to the name by default where i refers to the numerical order of the sequence starting from 0.
Output format
The python CLI outputs sequences in fasta format.
The python API outputs a list of SeqOutput objects list of (sequence name, sequence, error message). For example, <s1>[hg38]chr7:5480600-5480620 -\n<s2>[hg38]chr7:44746680-44746700 + returns
We expect the Seqstr output to be an ordered list that can be accessed with an index. Each element contains a name and a sequence.
Symbols and terms
Sequence : complete individual sequence for downstream analysis, consisting of consecutive nucleotides
Subsequences : a segment of a sequence; the minimal unit of Seqstr
Sequence modifier : introduce mutation to the sequence specified by a genomic region
[] : specify the reference genome (UCSC convention)
: : separate chromosome and coordinate
- : means “from…to…” between two valid coordinates (0-based), or “reverse strand” in strand specification
+ : “forward strand” in strand specification
; : separate multiple subsequences
@ : initiate mutation specification
, : separate mutation specification
\n : separate multiple sequences
<> : enclose the name for sequence
Python implementation usage
CLI Usage
For command line usage, simply provides the Seqstr input file path, input_file, to seqstr.py, sequences are retrieved and saved in fasta format. --download option specifies the genome files to download so that sequences will be retrieved locally in future use. If you only want to download human genome files without retrieval of any sequences, you may run for example, python seqstr.py --download hg38 without providing input_file. If local genome files are not found, seqstr.py will query UCSC API instead. --dir sets the directory for downloading genome files and the default directory is your working directory. During installation, you may also set the directory which will be stored in ~/.seqstr.config, and it will be overwritten every time --dir is specified. --output option is for the output fasta file path and name. The default is saving to input_file.fasta in the current working directory.
Seqstr Documentation
Seqstr (pronounced as seq-string) is a lightweight tool to compile simple string input into long genomic sequences. It is designed to provide a concise and flexible way to specify long genomic sequences that can be used for downstream analysis. For example, it can be used by web servers to avoid transferring long genomic sequences. Seqstr is also a format specification, which can be implemented in different languages. We also provide a test suite for verifying an implementation.
Seqstr allows using a combination of genome interval coordinates, raw sequence nucleotides, and specifying mutations. A single sequence can be flexibly composed by concatenating multiple subsequences (for example,
[hg38]chr7:5530575-5530625 -, @chr7 5530575 C T, @chr7 5530576 GC A;TTAAccggGGNaa;[hg38]chrX:1000000-1000017 +;TTAA;). Seqstr can also be used to express multiple named sequences for downstream applications.Contents
Seqstr format
Genomic interval
Use
[reference_genome]chr:start-end strandto specify an genomic interval. For example,[hg38]chr7:5530575-5530625 -would extract cooridnate 5530575 through 5530625 from chromosome 7 of the hg38 reference genome, and take reverse complement of the sequence.[]is used to specify the reference genome (UCSC convention), if not specified, the default is hg38.:-, inclusive for the start coordinate and exclusive for the end coordinate+and-, default is+. strand is separated from end coordinate by a spaceComposing and modifying sequences
For more flexibly specifying a sequence that is different from the reference genome sequence, Seqstr can take input like
[hg38]chr7:5530575-5530625 -, @chr7 5530575 C T, @chr7 5530576 GC A;TTAAccggGGNaa;[hg38]chrX:1000000-1000017 +;TTAA;, which is explained below:Compose subsequences
Seqstr can concatenate multiple subsequences connected by
;. Each subsequence can be either a genomic interval (For example,[hg38]chr7:5530575-5530625 -) or a sequence (For example,TTAAccggGGNaa). For directly specifying sequence in Seqstr, any characters, e.g. special marking, are allowed and will be included in the output sequence. Because the purpose of Seqstr is to shorten the input size, direct sequence specification is usually used when concatenating with an genomic interval.;is used to separate multiple sections of sequences. The final outcome consists of all sections and follows the order of input stringSequence modifier
Any genomic interval can be modified by a mutation or variant specified with the syntax
@chr position reference_allele alternative_allele. Multiple modifier can be provided to introduce multiple mutations into the same sequence. Mutation specification is with respect to original sequence coordinates and+strand.,@chromosome, then at particular coordinate, change from reference_allele to alternative_allele,. we note that while any mutation may change the length of the sequence, all mutation coordinates are relative to the original reference genome sequence.Multiple sequences
Seqstr supports specifying multiple sequences separated by
\nor line breaks and labeling each sequence via<>embedding the names assigned to each sequence. For example,would be parsed into an array of sequences, with
s1ands2as their names.\nor line break as separator for multiple sequences<>at the beginning of a sequence is used as the name for the single sequence. If not provided, Seqstr assignsito the name by default whereirefers to the numerical order of the sequence starting from 0.Output format
The python CLI outputs sequences in fasta format.
The python API outputs a list of SeqOutput objects
list of (sequence name, sequence, error message). For example,<s1>[hg38]chr7:5480600-5480620 -\n<s2>[hg38]chr7:44746680-44746700 +returnsYou can access sequence name, sequence, error message as follows,
We expect the Seqstr output to be an ordered list that can be accessed with an index. Each element contains a name and a sequence.
Symbols and terms
[]: specify the reference genome (UCSC convention):: separate chromosome and coordinate-: means “from…to…” between two valid coordinates (0-based), or “reverse strand” in strand specification+: “forward strand” in strand specification;: separate multiple subsequences@: initiate mutation specification,: separate mutation specification\n: separate multiple sequences<>: enclose the name for sequencePython implementation usage
CLI Usage
For command line usage, simply provides the Seqstr input file path,
input_file, toseqstr.py, sequences are retrieved and saved in fasta format.--downloadoption specifies the genome files to download so that sequences will be retrieved locally in future use. If you only want to download human genome files without retrieval of any sequences, you may run for example,python seqstr.py --download hg38without providinginput_file. If local genome files are not found,seqstr.pywill query UCSC API instead.--dirsets the directory for downloading genome files and the default directory is your working directory. During installation, you may also set the directory which will be stored in~/.seqstr.config, and it will be overwritten every time--diris specified.--outputoption is for the output fasta file path and name. The default is saving toinput_file.fasta in the current working directory.Or
API Usage
Test
test.txt.fastais generated after running the following command in terminaltest passeswith or without specific error messages are printed to stdout after running test script,test.py