目录

STRcount

Software tool to analyse STR loci from long read data. STRcount can count the number of repeats in a repeat expansion and give you the count in a tabular format for further downstream analysis.

Release notes

  • 0.1.0: Initial release with tools updated on Pypi and ready to use.

Dependencies

Developed and tested on Python 3.7.10. Dependencies include:

Installation instructions

Install GraphAligner

STRcount requires and uses GraphAligner as it’s alignment tool. To install this you could either install it using Anaconda:

Or you could install it from source using the instructions here

Installation using pip

Use the following command to install STRcount and all dependencies:

pip install STRcount

Installation from source

git clone https://github.com/sabiqali/strcount.git
cd strcount
python setup.py install

Installation to develop

To develop using STRcount, you will need to create a conda environment or python virtual environment, then perform the following steps:

git clone https://github.com/sabiqali/strcount.git
cd strcount
python -m pip install -r requirements.txt
python src/STRcount/STRcount.py -h

Config file format

The config file should be in the following format:

chr begin end name repeat prefix suffix
chr9 27573527 27573544 c9orf72 GGCCCC <150bp_left_flank> <150bp_right_flank>
This format is the same as the format used in STRique as found here.
The 150bp flanking region is suggested, longer flanks will require more time to execute, smaller flanks will take less time but the results may be less accurate.

Usage

If installed using pip or from source, you will be able to use it using STRcount else if you have installed to develop, you will be able to use it using python src/STRcount/STRcount.py

STRcount [-h] --reference REFERENCE --fastq FASTQ --config CONFIG
                  --output OUTPUT [--min-identity MIN_IDENTITY]
                  [--min-aligned-fraction MIN_ALIGNED_FRACTION]
                  [--write-non-spanned]
                  [--repeat_orientation REPEAT_ORIENTATION]
                  [--prefix_orientation PREFIX_ORIENTATION]
                  [--suffix_orientation SUFFIX_ORIENTATION]
                  [--cleanup CLEANUP] [--output_directory OUTPUT_DIRECTORY]
                  [--multiseed-DP MULTISEED_DP]
                  [--precise-clipping PRECISE_CLIPPING]

optional arguments:
 -h, --help            show this help message and exit
 --reference REFERENCE
                       the reference from which the STR Graph will be
                       generated
 --fastq FASTQ         the baseaclled reads in fastq format
 --config CONFIG       the config file
 --output OUTPUT       the output file
 --min-identity MIN_IDENTITY
                       only use reads with identity greater than this
 --min-aligned-fraction MIN_ALIGNED_FRACTION
                       require alignments cover this proportion of the query
                       sequence
 --write-non-spanned   do not require the reads to span the prefix/suffix
                       region
 --repeat_orientation REPEAT_ORIENTATION
                       the orientation of the repeat string. + or -
 --prefix_orientation PREFIX_ORIENTATION
                       the orientation of the prefix, + or -
 --suffix_orientation SUFFIX_ORIENTATION
                       the orientation of the suffix, + or -
 --cleanup CLEANUP     do you want to clean up the temporary file?
 --output_directory OUTPUT_DIRECTORY
                       the output directory for all output and temporary
                       files
 --multiseed-DP MULTISEED_DP
                       Aligner option
 --precise-clipping PRECISE_CLIPPING
                       Aligner option: use arg as the identity threshold for
                       a valid alignment.

Output

The output is in a .tsv format that will look something like this: | read_name | strand | spanned | count | align_score | identity | query_aligned_fraction | | — | — | — | — | — | — | — |

  • read_name: The name of the read that is currently being proccessed
  • strand: The strand on which the primary alignment has been detected
  • spanned: If set to 1, it means that the read spanned the repeat locus and the flanking sequence
  • count: The number of repeat motifs detected at the locus for that particular read
  • align_score: The alignment score as given by GraphAligner
  • identity: The percentage identity as given by GraphAligner
  • query_aligned_fraction: This signifies how much of the query sequence is covered by the alignment

Contact

Sabiq Chaudhary

License

MIT

关于

用于统计字符串中字符出现频率的工具

53.0 KB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802032778号