目录

FastKit

Fastkit is primarily written to perform routine/repetitive operations on FASTA and FASTQ files, most notably formatting. Its purpose is pre-processing of files for use in other bioinformatics tools.

  • FastKit seeks to wrap established libraries (such as Biopython, SeqTK, FASTX-toolkit, etc) rather than re-invent the wheel!
  • Write output to stdout so that the consumer can control file names transparently and pipe between fastkit subcommands to the target tool/file.
  • Each subcommand should infer datatypes from filenames and process accordingly, rather than having a tool for each datatype.
  • Raise error messages that provide good user experience

Example usage

fastkit format input.raw.fasta --strip-header-space > input.fasta
fastkit validate input.raw.fasta --dna  # Will raise an error if not valid IUPAC DNA

Testing in development

# Does not require pip install of fastkit
python fastkit/format.py --strip-header-space test/data/spaces.fas

Running tests

python -m unittest tests/*.py

# Or

./run_tests.sh

Available datatypes

  • FASTA

Available subcommands

  • format
  • validate

Format

usage: fastkit format [-h] [--strip-header-space] [--uppercase] filename

Reformat FASTA files in preparation for tool execution.

Available filters:
- Strip spaces from FASTA headers
- Convert sequence characters to uppercase

# TODO: filter escaped chars from Galaxy text input

positional arguments:
  filename              A filename to parse and correct.

options:
  -h, --help            show this help message and exit
  --strip-header-space  Strip spaces from title and replace with underscore
  --uppercase           Transform all sequence characters to uppercase

Validate

usage: fastkit validate [-h] [--protein] [--dna] [--no-unknown] [--sequence-count SEQUENCE_COUNT] filename

Validate FASTA files in preparation for tool execution.

These functions should not alter contents but only raise exceptions or return
boolean values to communicate validity of data.

Available validators:
- dna
- protein
- no-unknown
- sequence-count

positional arguments:
  filename              A filename to parse and correct.

options:
  -h, --help            show this help message and exit
  --protein             Validate as IUPAC protein sequence
  --dna                 Validate as IUPAC DNA sequence
  --no-unknown          Prohibit unknown IUPAC characters (X/N) - requires --dna or --protein
  --sequence-count SEQUENCE_COUNT
                        [int] Maximum number of sequences that are permitted

Adding a subcommand

  • Create new function(s) in fastkit/<new_command>.py
  • <new_command>.py must have a main callable - use format.py as an example
  • Import new_command in fastkit.__init__.py
  • Add new_command to fastkit.cli.SUBCOMMANDS

Pushing changes to bioconda

N.B. bioconda-bot will make automated pull requests to update fastkit when a new release has been published on GitHub

  • Publish a new release on GitHub
  • Fork bioconda/bioconda-recipes and make a branch for the new version
  • Update the version and sha256 in recipes/fastkit/meta.yaml to match the new release
  • Commit, push and make a pull request to bioconda/bioconda-recipes
  • Wait for it to be merged (you may need to ask bioconda-bot to add a label once it’s been approved)
关于

一个用于高性能计算环境的软件包管理工具,旨在简化软件安装和依赖管理。

55.0 KB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802032778号