ENH Add ‘hybrid’ as synonym for ‘long_reads’ in –sequencing-type
Accept ‘hybrid’ as an alias for ‘long_read’ in the –sequencing-type argument, since hybrid assemblies should use the long-read pipeline. Update help text, FAQ, ChangeLog, and whatsnew documentation accordingly.
Closes #216
Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com
版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9
京公网安备 11010802032778号
SemiBin: Metagenomic Binning Using Siamese Neural Networks for short and long reads
CONTACT US: Please use GitHub issues for bug reports and the SemiBin users mailing-list for more open-ended discussions or questions.
If you use this software in a publication please cite:
The self-supervised approach and the algorithms used for long-read datasets (as well as their benchmarking) are described in
Basic usage of SemiBin
A tutorial of running SemiBin from scratch can be found here SemiBin tutorial.
Installation with
conda:This will install the
SemiBin2command in your environment.The inputs to the SemiBin are contigs (assembled from the reads) and BAM files (reads mapping to the contigs). In the docs you can see how to generate the inputs starting with a metagenome.
Running with single-sample binning (for example: human gut samples):
(if you are using contigs from long-reads, add the
--sequencing-type=long_readargument).Running with multi-sample binning:
The output includes the bins in the
output_binsdirectory (including the bin.*.fa and recluster.*.fa).Please find more options and details below and read the docs.
Advanced Installation
SemiBin runs (and is continuously tested) on Python 3.8-3.13
pixi
The current recommended way to install SemiBin with GPU-support is to use pixi. Pixi will use the packages from conda-forge and bioconda to install SemiBin and its dependencies. See the docs for more details, but the basic idea is to create a
pixi.tomlfile with the following content:This will install SemiBin with GPU support, but it does require a CUDA-compatible GPU. Alternatively, you can install SemiBin in CPU-only mode by removing the
pytorch-gpuandcudalines.Source
You will need the following dependencies:
The easiest way to install the dependencies is with conda:
Once the dependencies are installed, you can install SemiBin by running:
Optional extra dependencies:
Examples of binning
SemiBin runs on single-sample, co-assembly and multi-sample binning. Here we show the simple modes as an example. For the details and examples of every SemiBin subcommand, please read the docs.
Binning assemblies from long reads
Since version 1.4, SemiBin proposes new algorithm (ensemble based DBSCAN algorithm) for binning assemblies from long reads. To use it, you can used the subcommands
bin_longor pass the option--sequencing-type=long_readto thesingle_easy_binormulti_easy_binsubcommands.Easy single/co-assembly binning mode
Single sample and co-assembly are handled the same way by SemiBin.
You will need the following inputs:
contig.fain the example below)mapped_reads.sorted.bamin the example below)The
single_easy_bincommand can be used to produce results in a single step.For example:
Alternatively, you can train a new model for that sample, by not passing in the
--environmentflag:The following environments are supported:
human_gutdog_gutoceansoilcat_guthuman_oralmouse_gutpig_gutbuilt_environmentwastewaterchicken_caecum(Contributed by Florian Plaza Oñate)globalThe
globalenvironment can be used if none of the others is appropriate. Note that training a new model can take a lot of time and disk space. Some patience will be required. If you have a lot of samples from the same environment, you can also train a new model from them and reuse it.Easy multi-samples binning mode
The
multi_easy_bincommand can be used in multi-samples binning mode:You will need the following inputs:
For every contig, format of the name is
<sample_name>:<contig_name>, where:is the default separator (it can be changed with the--separatorargument). NOTE: Make sure the sample names are unique and the separator does not introduce confusion when splitting. For example:You can use this to get the combined contig:
If either the sample or the contig names use the default separator (
:), you will need to change it with the--separator,-sargument.After mapping samples (individually) to the combined FASTA file, you can get the results with one line of code:
Running with abundance information from strobealign-aemb
Strobealign-aemb is a fast abundance estimation method for metagenomic binning. As strobealign-aemb can not provide the mapping information for every position of the contig, so we can not run SemiBin2 with strobealign-aemb in binning modes where samples used smaller 5 and need to split the contigs to generate the must-link constratints.
Output
The output folder will contain:
By default, bins are in
output_binsdirectory.For more details about the output, read the docs.