from cdhit_reader import read_cdhit
input = "cluster.fa.clstr"
for cluster in read_cdhit(input):
print(f"{cluster.name} refSequence={cluster.refname} size={len(cluster)}")
for member in cluster.sequences:
print(f" {member.name} ({member.length}) identity={member.identity}% {'(Reference sequence)' if member.is_ref else ''}")
Load all clusters in to a list:
# Load all clusters to a list
clusters = read_cdhit(input).read_items()
Read FASTA file
if os.path.exists(fileName):
for seq in cdhit_reader.read_fasta(fileName, line_len=60):
print(seq) # will be wrapped at 60 chars per line, use 0 to disable wrapping
# to access individual attributes:
# print(">" + seq.name + " " + seq.comment + "\n" + seq.sequence)
The module ships a demo program called cdhit-reader.py.
Compare two fasta files
This requires cd-hit installed and available in the system path.
cdhit-compare allows to compare two fasta files and print the sequences that are in common, those which are only
present in one of the files or those which are redundant.
input1 BJJOHBJ_00007
input2 BJJOHBJ_00007
input2 BJJOHBJ_00002
both BJJOHBJ_00003:BJJOHBJ_00003
both BJJOHBJ_00005:BJJOHBJ_00005
both BJJOHBJ_00004:BJJOHBJ_00004
multi input1#IBJJOHBJ_00006,input1#BBJJOHBJ_000B6,input1#CBJJOHBJ_000C6,input2#IBJJOHBJ_00006,input2#BBJJOHBJ_000B6,input2#CBJJOHBJ_000C6
dupl_input1 BJJOHBJ_00001:BJJOHBJ_000F
where records starting with file1 or file2 are only present in one of the files,
records starting with both are present in both files (one per file),
records starting with dupl are duplicates (two in one of the files),
and records starting with multi are present multiple times in at least one of the datasets.
cdhit-parser
CD-HIT file reader.
Read CD-HIT .clstr file
Basic usage
Load all clusters in to a list:
Read FASTA file
Install
or via Miniconda, which will also install cd-hit
Demo applications
Cluster stats
The module ships a demo program called
cdhit-reader.py.Compare two fasta files
cdhit-compareallows to compare two fasta files and print the sequences that are in common, those which are only present in one of the files or those which are redundant.Example:
will produce:
where records starting with file1 or file2 are only present in one of the files, records starting with both are present in both files (one per file), records starting with dupl are duplicates (two in one of the files), and records starting with multi are present multiple times in at least one of the datasets.
Author
License
This project is licensed under the MIT License.
Acknowledgments
This module was based on fasta_reader by Danilo Horta