VirusRecom: Detecting recombination of viral lineages using information theory
1. Download and install
VirusRecom is developed based on Python 3, and you can get and install the VirusRecom in a variety of ways.
1.1. pip method (recommend)
virusrecom has been distributed to the standard library of PyPI (https://pypi.org/project/virusrecom/), and the latest version can be easily installed by the tool pip.
Firstly, download Python3 (https://www.python.org/), and install Python3 and pip tool, then,
# (1) add bioconda origin
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
# (2) install virusrecom
## (i) create a separate environment for virusrecom (recommend)
conda create -n virusrecom_env python=3.7 # python >=3.5 but != 3.8
conda activate virusrecom_env
conda install virusrecom # or "conda install bioconda::virusrecom"
## (ii) or installation without creating separate environment (slow)
conda install virusrecom # or "conda install bioconda::virusrecom"
# (3) view the help documentation
virusrecom -h
1.3. Or local installation
In addition to the pip and conda methods, you can also install virusrecom manually using the file setup.py.
Firstly, download this repository, then, run:
python setup.py install
virusrecom -h
1.4. Or directly run the source code
virusrecom can also be run using the source code without installation. First, download this repository, then, install the required python environment of virusrecom:
finally, run virusrecom by the file main.py. Please view the help documentation by python main.py -h.
1.5. Or use the binary files
For the two earlier release packages (versions v1.0 and v1.1), you can also directly run the binary files of virusrecom without installation. The binary files are provided at https://github.com/ZhijianZhou01/virusrecom/releases.
In general, the executable file of virusrecom is located at the main folder. Then, running the virusrecom.exe (windows system) or virusrecom (Linux or MacOS system) to start. If you could not get permission to run virusrecom on Linux system or MacOS system, you could change permissions by chmod -R 775 Directory or chmod -R 777 Directory.
2. Getting help
virusrecom is a command-line-interface program, users can get help documentation of the software by entering virusrecom -h or virusrecom --help .
Tip: Compared with virusrecom v1.0, v1.1~1.3 has optimized the parameters of the input file. Besides, since v1.4, virusrecom has updated the input method of lineage mark.
The simple help documentation of virusrecom v1.4.0 is as follows.
Parameter
Description
-h, –help
Show this help message and exit.
-a ALIGNMENT
Aligned sequence file (*.fasta). Note, each sequence name requires containing lineage mark.
-ua UNALIGNMENT
Unaligned (non-alignment) sequence file (*.fasta). Note, each sequence name requires containing lineage mark.
-at ALIGN_TOOL
Program used for multiple sequence alignments (MSA).
-iwic INPUT_WIC
Using the already obtained WIC values of reference lineages directly by a *.csv input-file.
-q QUERY
Name of query lineage (usually potential recombinant), such as ‘-q xxxx’. Besides, ‘-q auto’ can scan all lineages as potential recombinant in turn.
-map MAPPING
The table of sequence-to-lineage mapping.
-g GAP
Reserve sites containing gaps(-) in analyses? ‘-g y’ means to reserve, and ‘-g n’ means to delete.
-m METHOD
Method for site scanning. ‘-m p’ uses polymorphic sites only, ‘-m a’ uses all the sites.
-w WINDOW
Number of nucleotides sites per sliding window. Note: if the ‘-m p’ has been used, -w refers to the number of polymorphic sites per windows.
-s STEP
Step size of the sliding window. Note: if the ‘-m p’ has been used, -s refers to the number of polymorphic sites per jump.
-mr MAX_REGION
The maximum allowed recombination region. Note: if the ‘-m p’ method has been used, it refers the maximum number of polymorphic sites contained in a recombinant region.
-cp PERCENTAGE
The cutoff threshold of proportion (cp, default: 0.9) used for searching recombination regions when mWIC/EIC >= cp, the maximum value of cp is 1.
-cu CUMULATIVE
Simply using the max cumulative WIC of all sites to identify the major parent. Off by default. If required, specify ‘-cu y.
-b BREAKPOINT
Possible breakpoint scan of recombination. ‘-b y’ means yes, ‘-b n’ means no. Note: this option only takes effect when ‘-m p’ has been specified.
-bw BREAKWIN
The window size (default: 200) used for breakpoint scan. The step size is fixed at 1. Note: this option only takes effect when ‘-m p -b y’ has been specified.
-t THREAD
Number of threads (or cores) for calculations, default: 4.
-y Y_START
Starting value (default: 0) of the Y-axis in plot diagram.
-le LEGEND
The location of the legend, the default is adaptive. ‘-le r’ indicates placed on the right.
-owic ONLY_WIC
Only calculate site WIC value. Off by default. If required, please specify ‘-owic y’.
-e ENGRAVE
Engraves file name to sequence names in batches. By specifying a directory containing one or multiple sequence files (*.fasta).
-en EXPORT_NAME
Export all sequence name of a *.fasta file.
-o OUTDIR
Output directory to store all results.
–block BLOCK_SIZE
Specifies the maximum number of sites per sub-block, different sub-blocks in sequence file will be sequentially loaded to calculate WIC. Default: 40000 sites.
In this demonstration, the test data is from the the recombination_test_data_v1.4.zip provided in the directory example.
3.1. Aligned input-sequences
If the input sequence-data has been aligned, and it should be loaded via the -a parameter. Multiple sequence alignments (MSA) can be pre-completed by many programs, this is not introduced. Now, let’s focus on the directory aligned_input_sequences in the file recombination_test_data_v1.1.zip.
(1) An aligned sequence-file named alignment_lineages_data.fasta, which including multiple sequences from the query lineage and other reference lineages.
(2) A text-file named sequence_lineage_mapping.txt, which including all the sequence names and lineage names. For example,
Note, the sequence name in sequence_lineage_mapping.txt needs to be consistent with the file alignment_lineages_data.fasta. The name of each lineage in the file sequence_lineage_mapping.txt should be unique, otherwise, there will be duplicate matches in subsequent analysis. The sequence name and the lineage name are separated by a tab.
Before running the command of VirusRecom, let’s think about the search strategy for recombination events. Firstly, we use only polymorphic sites considering that sequences from these lineages are highly similar, which means that the parameter -m p needs to be specified. Secondly, we do not consider gap-containing sites in this test and use the parameter -g n. Instead, if you consider these gap sites, you need to use the parameter -g y. Next, in the first run, let’s try first with a window size of 100 and a step size of 20. Of note the value of “size” at this time represents the number of polymorphic sites because the -m p parameter has been specified. For the two parameters -cp and -mr, we use the default value of 0.9 and 1000 in this test. Finally, we specify a folder to save the results by parameter -o.
Then, switch the current directory to aligned_input_sequences, and run the following command (an example) to detect recombination events in query lineage:
virusrecom -a alignment_lineages_data.fasta -q query_recombinant -map sequence_lineage_mapping.txt -g n -m p -w 100 -s 20 -o outdir
Note: (1) if the current directory is not switched to aligned_input_sequences, the file and directory path in command need the absolute paths instead of relative paths.
(2) the string “query_recombinant” in command is the name of query lineage in the file sequence_lineage_mapping.txt.
After the run is complete, in the directory outdir, there are three subdirectories and two aggregated reports:
(1) In the directory WICs_of_sites, the file *_site_WIC.csv and the file *_site_WIC_from_lineages.pdf are used to record the WIC value of each site.
(2) In the directory WICs_of_slide_window, the file *_mWIC_from_lineages.csv and the file *_mWIC_from_lineages.pdf are used to record the mean WIC of each sliding window.
The user can fine-tune the window size and step size according to the density of points in the generated graph. In general, very dense points means that the noise is too high and the window size can be increased appropriately in next scan.
In addition to the three sub-directories above, VirusRecom provides two summary files. The file Possible_recombination_event_conciseness.txt only retains results of recombination events with p-values less than 0.05.
Possible major parent: reference_lineage_1(global mWIC: 1.8976186779157704)
Other possible parents and significant recombination regions (p<0.05):
reference_lineage_2 7237 to 11539(mWIC: 1.9553354371515168), p_value: 7.831109305531908e-06
Significance test of recombinant regions using Mann-Whitney-U test with two-tailed probabilities, p-value less than 0.05 indicates a significant difference.
In this output report, the major parent of query lineage was reference_lineage_1 and the minor parent was reference_lineage_2, and the recombination region was site 7237 to 11539 and the p-value was 7.83e-06. The identified recombination event was relatively close to the actual (from site 7333 to 11473 in the genome), and the error of the recombination boundary is also acceptable.
In fact, Possible_recombination_event_conciseness.txt is interpretations of the recombination information contained in *_mWIC_from_lineages.pdf. Although VirusRecom shows a good balance between precision and recall in simulated data, false positive or false negatives sometimes occur. Therefore, for the identification results from VirusRecom, users can make own judgment. Tip: recombination events with p-values over 0.001 are less reliable.
If -b y is specified, then VirusRecom will perform the search of recombination breakpoint and plot. For example:
virusrecom -a alignment_lineages_data.fasta -q query_recombinant -map sequence_lineage_mapping.txt -g n -m p -w 100 -s 20 -b y -bw 200 -o outdir
Tip: (1) -b y only takes effect when -m p has been specified.
(2) the step size of breakpoint search is fixed to 1.
The negative logarithm of p-value in each site is in the file *_-lg(p-value)_for_potential_breakpoint.pdf and the file *_-lg(p-value)_for_potential_breakpoint.xlsx.
The highest peak (the highest −lgP value) indicated the possible recombination breakpoint.
3.2. Unaligned input-sequences
VirusRecom can also handle unaligned input-sequences. In this case, multiple sequence alignment is performed by calling external program. In the latest version, mafft, muscle, and clustal-omega is supported. It is worth mentioning that VirusRecom call them from the system path, so they need to be installed on the machine beforehand.
For the example data in directory unaligned_input_sequences, run the following command:
virusrecom -ua unalignment_lineages_data.fas -at mafft -q query_recombinant -map sequence_lineage_mapping.txt -g n -m p -w 100 -s 20 -o outdir
Note: (1) -at mafft means to call mafft in the system path, and the alignment strategy is auto. Besides, using -at muscle to call muscle and using -at clustalo to call clustal-omega.
(2) the string query_recombinant in command is the corresponding mark of query lineage in the file sequence_lineage_mapping.txt.
The interpretation of the output result is consistent with section 3.1.
3.3. Non-lineage data
In VirusRecom, each lineage is allowed to contain only one single sequence. Under this condition, mWIC value of the fragment is essentially a multiple of shared identity. If -g n is used in the calculation, the mWIC is twice as large as shared identity. If -g y is used in the calculation, the mWIC is log25 as large as shared identity.
Of noted, for recombination analysis without lineage data, the additional feature is only recommended for non-highly similar sequences and the user can use it to draw an identity point map.
The test data is in directory non_lineage_data of the file recombination_test_data_v1.4.zip.
The Delta-CoV HNU1-1 is a known recombinant from SpCoV HKU17-USA and ThCoV HKU12, and the break points were identified at genome positions nt 21017 and 25056, which is jointly identified and confirmed by RDP3 and Simplot by Wang et al., 2022.
Considering that they are not highly similar sequences, we use all sites (-m a) in the alignment. Then, we use a larger window value, and run following command:
virusrecom -a alns.fasta -q HNU1-1 -map sequence_lineage_mapping.txt -g n -m a -w 800 -s 100 -cp 0.7 -mr 6000 -le r -o output
The mWIC from reference lineages is as follows:
Note, because each “lineage” contains only one sequence and -g n is used in the example, the mWIC in the picture is actually twice the size of “sequence identity”.
The possible recombination event identified by VirusRecom is as follows:
Possible major parent: HKU17-USA(global mWIC: 1.5914816042426252)
Other possible parents and significant recombination regions (p<0.05):
HKU12 20720 to 25297(mWIC: 1.8039433490697028), p_value: 2.783880536189705e-204
The possible major parent of HNU1-1 is HKU17-USA and minor parent is HKU12, and the recombination region is about 20720-25297 nt in the alignment.
4. Common questions
4.1. Data preparation
Different software has different application scenarios and requirements for input data. VirusRecom was initially designed to study the recombination of the prevalent SARS-CoV-2, and aimed at detecting recombination events among highly similar virus strains based on lineage data. However, the criteria for dividing lineages vary widely among viruses. For those without fine-grained lineage classifications like SARS-CoV-2, in some cases, lineages could be further divided prior to analysis. Besides, if -g n is specified, VirusRecom will delete all gap sites, thus, very short sequences with excessive divergence are not recommended. For some lineages with high internal divergence, may not be appropriate to estimate in lineages manner in VirusRecom, and non-lineage-based approaches might be more appropriate.
4.2. Default values of parameter
For the value of a parameter, if not specified, the software uses the default value.
However, the default value is not suitable for all data. In addition to window size (-w) and step size (-s) of sliding window, values of -cp and -mr also require users to adjust based on the data.
When VirusRecom runs, the value of each parameter is printed printed on the screen and you can check them. What is more, users should try different values in multiple runs, which will effectively reduce false positives and false negatives.
4.3. How to mark lineage in sequence name?
For virusrecom v1.1~1.3, users can easily get it done via -e parameter. The -e parameter can engrave file-name to sequence names in batches. The example is as follows:
virusrecom -e input_directory -o outdir
Tip: The directory input_directory can contain multiple fasta files, and each fasta file can contain multiple sequences. After the running, finally, each sequence name will contain its file-name.
Therefore, if the file-name of fasta file is a lineage name, the lineage name can be written into the sequence name in batches.
4.4. How to change the color scheme in an image?
If you own programming skills, you can directly modify the order of the colors in the plt_corlor_list.py file. If not, you can use output matrix provided by VirusRecom, and they are usually suffixed with .xlsx or .csv.
5. Citation
Zhou ZJ, Yang CH, Ye SB, Yu XW, Qiu Y, Ge XY. VirusRecom: an information-theory-based method for recombination detection of viral lineages and its application on SARS-CoV-2. Brief Bioinform. 2023 Jan 19;24(1):bbac513. doi: 10.1093/bib/bbac513. PMID: 36567622.
VirusRecom: Detecting recombination of viral lineages using information theory
1. Download and install
VirusRecom is developed based on
Python 3, and you can get and install the VirusRecom in a variety of ways.1.1. pip method (recommend)
virusrecom has been distributed to the standard library of PyPI (https://pypi.org/project/virusrecom/), and the latest version can be easily installed by the tool
pip.Firstly, download
Python3(https://www.python.org/), and installPython3andpiptool, then,1.2. Or conda method
virusrecom has been distributed to bioconda (https://anaconda.org/bioconda/virusrecom), and the latest version can be installed using the tool
conda.1.3. Or local installation
In addition to the
pipandcondamethods, you can also install virusrecom manually using the filesetup.py.Firstly, download this repository, then, run:
1.4. Or directly run the source code
virusrecom can also be run using the source code without installation. First, download this repository, then, install the required python environment of virusrecom:
finally, run virusrecom by the file
main.py. Please view the help documentation bypython main.py -h.1.5. Or use the binary files
For the two earlier release packages (versions v1.0 and v1.1), you can also directly run the binary files of virusrecom without installation. The binary files are provided at https://github.com/ZhijianZhou01/virusrecom/releases.
In general, the executable file of virusrecom is located at the
mainfolder. Then, running thevirusrecom.exe(windows system) orvirusrecom(Linux or MacOS system) to start. If you could not get permission to run virusrecom on Linux system or MacOS system, you could change permissions bychmod -R 775 Directoryorchmod -R 777 Directory.2. Getting help
virusrecom is a command-line-interface program, users can get help documentation of the software by entering
virusrecom -horvirusrecom --help.Tip: Compared with virusrecom v1.0, v1.1~1.3 has optimized the parameters of the input file. Besides, since v1.4, virusrecom has updated the input method of lineage mark.
The simple help documentation of virusrecom v1.4.0 is as follows.
For more information about the algorithm of virusrecom, please refer to the publication of virusrecom.
3. Example of usage
The sequences data for test in the documentation was stored at https://github.com/ZhijianZhou01/virusrecom/tree/main/example.
In this demonstration, the test data is from the the
recombination_test_data_v1.4.zipprovided in the directoryexample.3.1. Aligned input-sequences
If the input sequence-data has been aligned, and it should be loaded via the
-aparameter. Multiple sequence alignments (MSA) can be pre-completed by many programs, this is not introduced. Now, let’s focus on the directoryaligned_input_sequencesin the filerecombination_test_data_v1.1.zip.(1) An aligned sequence-file named
alignment_lineages_data.fasta, which including multiple sequences from the query lineage and other reference lineages.(2) A text-file named
sequence_lineage_mapping.txt, which including all the sequence names and lineage names. For example,Note, the sequence name in
sequence_lineage_mapping.txtneeds to be consistent with the filealignment_lineages_data.fasta. The name of each lineage in the filesequence_lineage_mapping.txtshould be unique, otherwise, there will be duplicate matches in subsequent analysis. The sequence name and the lineage name are separated by a tab.Before running the command of VirusRecom, let’s think about the search strategy for recombination events. Firstly, we use only polymorphic sites considering that sequences from these lineages are highly similar, which means that the parameter
-m pneeds to be specified. Secondly, we do not consider gap-containing sites in this test and use the parameter-g n. Instead, if you consider these gap sites, you need to use the parameter-g y. Next, in the first run, let’s try first with a window size of 100 and a step size of 20. Of note the value of “size” at this time represents the number of polymorphic sites because the-m pparameter has been specified. For the two parameters-cpand-mr, we use the default value of 0.9 and 1000 in this test. Finally, we specify a folder to save the results by parameter-o.Then, switch the current directory to
aligned_input_sequences, and run the following command (an example) to detect recombination events in query lineage:Note: (1) if the current directory is not switched to
aligned_input_sequences, the file and directory path in command need the absolute paths instead of relative paths. (2) the string “query_recombinant” in command is the name of query lineage in the filesequence_lineage_mapping.txt.After the run is complete, in the directory
outdir, there are three subdirectories and two aggregated reports:(1) In the directory
WICs_of_sites, the file*_site_WIC.csvand the file*_site_WIC_from_lineages.pdfare used to record the WIC value of each site.(2) In the directory
WICs_of_slide_window, the file*_mWIC_from_lineages.csvand the file*_mWIC_from_lineages.pdfare used to record the mean WIC of each sliding window.The user can fine-tune the window size and step size according to the density of points in the generated graph. In general, very dense points means that the noise is too high and the window size can be increased appropriately in next scan.
In addition to the three sub-directories above, VirusRecom provides two summary files. The file
Possible_recombination_event_conciseness.txtonly retains results of recombination events with p-values less than 0.05.In this output report, the major parent of query lineage was
reference_lineage_1and the minor parent wasreference_lineage_2, and the recombination region was site 7237 to 11539 and the p-value was 7.83e-06. The identified recombination event was relatively close to the actual (from site 7333 to 11473 in the genome), and the error of the recombination boundary is also acceptable.In fact,
Possible_recombination_event_conciseness.txtis interpretations of the recombination information contained in*_mWIC_from_lineages.pdf. Although VirusRecom shows a good balance between precision and recall in simulated data, false positive or false negatives sometimes occur. Therefore, for the identification results from VirusRecom, users can make own judgment. Tip: recombination events with p-values over 0.001 are less reliable.If
-b yis specified, then VirusRecom will perform the search of recombination breakpoint and plot. For example:Tip: (1)
-b yonly takes effect when-m phas been specified. (2) the step size of breakpoint search is fixed to 1.The negative logarithm of p-value in each site is in the file
*_-lg(p-value)_for_potential_breakpoint.pdfand the file*_-lg(p-value)_for_potential_breakpoint.xlsx.The highest peak (the highest −lgP value) indicated the possible recombination breakpoint.
3.2. Unaligned input-sequences
VirusRecom can also handle unaligned input-sequences. In this case, multiple sequence alignment is performed by calling external program. In the latest version, mafft, muscle, and clustal-omega is supported. It is worth mentioning that VirusRecom call them from the system path, so they need to be installed on the machine beforehand.
For the example data in directory
unaligned_input_sequences, run the following command:Note: (1)
-at mafftmeans to call mafft in the system path, and the alignment strategy is auto. Besides, using-at muscleto call muscle and using-at clustaloto call clustal-omega. (2) the stringquery_recombinantin command is the corresponding mark of query lineage in the filesequence_lineage_mapping.txt.The interpretation of the output result is consistent with section 3.1.
3.3. Non-lineage data
In VirusRecom, each lineage is allowed to contain only one single sequence. Under this condition, mWIC value of the fragment is essentially a multiple of shared identity. If -g n is used in the calculation, the mWIC is twice as large as shared identity. If
-g yis used in the calculation, the mWIC is log25 as large as shared identity.Of noted, for recombination analysis without lineage data, the additional feature is only recommended for non-highly similar sequences and the user can use it to draw an identity point map.
The test data is in directory
non_lineage_dataof the filerecombination_test_data_v1.4.zip.The Delta-CoV HNU1-1 is a known recombinant from SpCoV HKU17-USA and ThCoV HKU12, and the break points were identified at genome positions nt 21017 and 25056, which is jointly identified and confirmed by RDP3 and Simplot by Wang et al., 2022.
Considering that they are not highly similar sequences, we use all sites (
-m a) in the alignment. Then, we use a larger window value, and run following command:The mWIC from reference lineages is as follows:
Note, because each “lineage” contains only one sequence and
-g nis used in the example, the mWIC in the picture is actually twice the size of “sequence identity”.The possible recombination event identified by VirusRecom is as follows:
The possible major parent of HNU1-1 is HKU17-USA and minor parent is HKU12, and the recombination region is about 20720-25297 nt in the alignment.
4. Common questions
4.1. Data preparation
Different software has different application scenarios and requirements for input data. VirusRecom was initially designed to study the recombination of the prevalent SARS-CoV-2, and aimed at detecting recombination events among highly similar virus strains based on lineage data. However, the criteria for dividing lineages vary widely among viruses. For those without fine-grained lineage classifications like SARS-CoV-2, in some cases, lineages could be further divided prior to analysis. Besides, if
-g nis specified, VirusRecom will delete all gap sites, thus, very short sequences with excessive divergence are not recommended. For some lineages with high internal divergence, may not be appropriate to estimate in lineages manner in VirusRecom, and non-lineage-based approaches might be more appropriate.4.2. Default values of parameter
For the value of a parameter, if not specified, the software uses the default value. However, the default value is not suitable for all data. In addition to window size (
-w) and step size (-s) of sliding window, values of-cpand-mralso require users to adjust based on the data.When VirusRecom runs, the value of each parameter is printed printed on the screen and you can check them. What is more, users should try different values in multiple runs, which will effectively reduce false positives and false negatives.
4.3. How to mark lineage in sequence name?
For virusrecom v1.1~1.3, users can easily get it done via
-eparameter. The-eparameter can engrave file-name to sequence names in batches. The example is as follows:Tip: The directory
input_directorycan contain multiple fasta files, and each fasta file can contain multiple sequences. After the running, finally, each sequence name will contain its file-name.Therefore, if the file-name of fasta file is a lineage name, the lineage name can be written into the sequence name in batches.
4.4. How to change the color scheme in an image?
If you own programming skills, you can directly modify the order of the colors in the
plt_corlor_list.pyfile. If not, you can use output matrix provided by VirusRecom, and they are usually suffixed with.xlsxor.csv.5. Citation
Zhou ZJ, Yang CH, Ye SB, Yu XW, Qiu Y, Ge XY. VirusRecom: an information-theory-based method for recombination detection of viral lineages and its application on SARS-CoV-2. Brief Bioinform. 2023 Jan 19;24(1):bbac513. doi: 10.1093/bib/bbac513. PMID: 36567622.