Python implementation of the Integrated Probabilistic Annotation (IPA) - A Bayesian annotation method for LC/MS data integrating biochemical relations,
isotope patterns and adduct formation.
Installation
ipaPy2 requires Python 3.9 or higher
Install via pip (recommended)
pip install ipaPy2
Install via bioconda
conda install ipaPy2
Compiling from source (macOS)
Create a folder in which you want to put the library
mkdir IPA
cd IPA
Download the library. If Homebrew is not installed in your machine, you can install it from here https://brew.sh
brew install git
git clone https://github.com/francescodc87/ipaPy2
cd ipaPy2
Create and activate a virtual environment for your folder and install the necessary libraries
The wheel file will be stored in the \dist folder. You can install the library in a new terminal as follows
pip install /path/to/wheelfile.whl
Databases
One of the most powerful features of the IPA method is that it is able to integrate the knowledge gained from previous experiments in the annotation process. There are three files that are used as the IPA database:
1. Adducts file (required)
The ipaPy2 library requires a file contains all the information required for the computation of the adducts. An adducts.csv file is provided with the package here. The file contains the most common adducts. If any exotic adduct (or in-source fragment) needs to be considered, the user must modify the file accordingly. The format required for the adducts file is shown below.
import pandas as pd
import numpy as np
adducts = pd.read_csv('DB/adducts.csv')
adducts.head()
name
calc
Charge
Mult
Mass
Ion_mode
Formula_add
Formula_ded
Multi
0
M+H
M+1.007276
1
1
1.007276
positive
H1
FALSE
1
1
M+NH4
M+18.033823
1
1
18.033823
positive
N1H4
FALSE
1
2
M+Na
M+22.989218
1
1
22.989218
positive
Na1
FALSE
1
3
M+K
M+38.963158
1
1
38.963158
positive
K1
FALSE
1
4
M+
M-0.00054858
1
1
-0.000549
positive
FALSE
FALSE
1
2. MS1 database file (required)
The IPA method requires a pandas dataframe containing the database against which the annotation is performed.
This dataframe must contain the following columns in this exact order (optional columns can have empty fields):
id: unique id of the database entry (e.g., ‘C00031’) - necessary
name: compound name (e.g., ‘D-Glucose’) - necessary
formula: chemical formula (e.g., ‘C6H12O6’) - necessary
inchi: inchi string - optional
smiles: smiles string - optional
RT: if known, retention time range (in seconds) where this compound is expected to elute (e.g., ‘30;60’) - optional
adductsPos: list of adducts that should be considered in Positive mode for this entry (e.g.,’M+Na;M+H;M+’) - necessary
adductsNeg: list of adducts that should be considered in Negative mode for this entry (e.g.,’M-H;M-2H’) - necessary
description: comments on the entry - optional
pk: previous knowledge on the likelihood of this compound to be present in the sample analyse. The value has to be between 1 (compound highly likely to be present in the sample) and 0 (compound cannot be present in the sample).
MS2: id for the MS2 database entries related to this compound - optional
reactions: list of reaction ids involving this compound (e.g., ‘R00010 R00015 R00028’). If required, these can be used to find possible biochemical connections - optional
The column names must be the ones reported here.
While users are strongly advised to build their own ad-hoc database to match their specific instrument setup and sample types, here you can find a relatively big example database.
DB = pd.read_csv('DB/IPA_MS1.csv')
DB.head()
id
name
formula
inchi
smiles
RT
adductsPos
adductsNeg
description
pk
MS2
reactions
0
C00002
ATP
C10H16N5O13P3
InChI=1S/C10H16N5O13P3/c11-8-5-9(13-2-12-8)15(...
NaN
NaN
M+H;M+Na;M+2H;2M+H
M-H;2M-H;M-2H;3M-H
NaN
1
EMBL-MCF_spec365637_1
R00002 R00076 R00085 R00086 R00087 R00088 R000...
1
C00003
NAD+
C21H28N7O14P2
InChI=1S/C21H27N7O14P2/c22-17-12-19(25-7-24-17...
NaN
NaN
M+H;M+Na;M+2H;2M+H
M-H;2M-H;M-2H;3M-H
NaN
1
EMBL-MCF_specxxxxx_10
R00023 R00090 R00091 R00092 R00093 R00094 R000...
2
C00004
NADH
C21H29N7O14P2
InChI=1S/C21H29N7O14P2/c22-17-12-19(25-7-24-17...
NaN
NaN
M+H;M+Na;M+2H;2M+H
M-H;2M-H;M-2H;3M-H
NaN
1
NaN
R00023 R00090 R00091 R00092 R00093 R00094 R000...
3
C00005
NADPH
C21H30N7O17P3
InChI=1S/C21H30N7O17P3/c22-17-12-19(25-7-24-17...
NaN
NaN
M+H;M+Na;M+2H;2M+H
M-H;2M-H;M-2H;3M-H
NaN
1
NaN
R00105 R00106 R00107 R00108 R00109 R00111 R001...
4
C00006
NADP+
C21H29N7O17P3
InChI=1S/C21H28N7O17P3/c22-17-12-19(25-7-24-17...
NaN
NaN
M+H;M+Na;M+2H;2M+H
M-H;2M-H;M-2H;3M-H
NaN
1
EMBL-MCF_specxxxxxx_45
R00104 R00106 R00107 R00108 R00109 R00111 R001...
This example databases was obtained considering the KEGG database, the Natural Products Atlas database and the MoNa database (only compounds having at least one fragmentation spectrum obtained with a QExactive).
For each entry, only a handful of the most common adducts are considered.
To fully exploit the IPA method, it is strongly recommended to constantly update the database when new knowledge is gained from previous experience. Providing a retention time window for compounds previously detected with the analytical system at hand it is particularly useful.
For the sake of the example in this tutorial, a reduced example database is also provided.
DB = pd.read_csv('DB/DB_test_pos.csv')
DB.head()
id
name
formula
inchi
smiles
RT
adductsPos
adductsNeg
description
pk
MS2
reactions
0
C00079
L-Phenylalanine
C9H11NO2
InChI=1S/C9H11NO2/c10-8(9(11)12)6-7-4-2-1-3-5-...
NaN
120;160
M+H;M+Na;M+2H;2M+H
M-H;2M-H;M-2H;3M-H
NaN
1
UA005501_1
R00686 R00688 R00689 R00690 R00691 R00692 R006...
1
C00082
L-Tyrosine
C9H11NO3
InChI=1S/C9H11NO3/c10-8(9(12)13)5-6-1-3-7(11)4...
NaN
50;90
M+H;M+Na;M+2H;2M+H
M-H;2M-H;M-2H;3M-H
NaN
1
UA005601_1
R00031 R00728 R00729 R00730 R00731 R00732 R007...
2
C00114
Choline
C5H14NO
InChI=1S/C5H14NO/c1-6(2,3)4-5-7/h7H,4-5H2,1-3H...
NaN
NaN
M+H;M+Na;M+2H;2M+H
M-H;2M-H;M-2H;3M-H
NaN
1
NaN
R01021 R01022 R01023 R01025 R01026 R01027 R010...
3
C00123
L-Leucine
C6H13NO2
InChI=1S/C6H13NO2/c1-4(2)3-5(7)6(8)9/h4-5H,3,7...
NaN
70;110
M+H;M+Na;M+2H;2M+H
M-H;2M-H;M-2H;3M-H
NaN
1
NaN
R01088 R01089 R01090 R01091 R02552 R03657 R084...
4
C00148
L-Proline
C5H9NO2
InChI=1S/C5H9NO2/c7-5(8)4-2-1-3-6-4/h4,6H,1-3H...
NaN
35;55
M+H;M+Na;M+2H;2M+H
M-H;2M-H;M-2H;3M-H
NaN
1
EMBL-MCF_specxxxxx_7
R00135 R00671 R01246 R01248 R01249 R01251 R012...
3. MS2 database file (only required if MS2 data is available)
This new implementation of the IPA method also allows the user to include MS2 data in the annotation pipeline.
In order to exploit this functionality an MS2 spectra database must be provided.
The MS2 database must be provided as a pandas dataframe including the following columns in this exact order:
compound_id: unique id for each compound, it must match with the ids used in the MS1 database - necessary
id: unique id for the single entry (i.e., spectra) of the database - necessary
name: compound name (e.g., ‘D-Glucose’) - necessary
formula: chemical formula (e.g., ‘C6H12O6’) - necessary
inchi: inchi string - optional
precursorType: the adduct form of the precursor ion (e.g., ‘M+H’) - necessary
instrument: the type of instrument the spectrum was acquired with - optional
collision.energy: the collision energy level used to acquire the spectrum (e.g., ‘15’) - necessary
spectrum: the actual spectrum in the form of a string in the following format ‘mz1:Int1 mz2:Int2 mz3:Int3 …’
It is necessary that the user uses an MS2 database specific to the instrument used to acquire the data.
The MS2 database found here, contains all the MS2 spectra found in the MoNa database acquired with a QExactive. This is a relatively big file, and for the sake of this tutorial a drastically reduced version of it has been included within this repository, and can be found here.
Before using the ipaPy2 package, the processed data coming from an untargeted metabolomics experiment must be properly prepared.
1. MS1 data
The data must be organized in a pandas dataframe containing the following columns:
ids: an unique numeric id for each mass spectrometry feature feature
rel.ids: relation ids. Features must be clustered based on correlation/peak shape/retention time. Features in the same cluster are likely to come from the same metabolite.
mzs: mass-to-charge ratios, usually the average across different samples.
RTs: retention times in seconds, usually the average across different samples.
Int: representative (e.g., maximum or average) intensity detected for each feature across samples (either peak area or peak intensity)
The clustering of the features is a necessary and must be performed before running the IPA method. For this step, the use of widely used data processing software such as mzMatch and CAMERA is recommended.
Nevertheless, the ipaPy2 library provides a function (clusterFeatures()) able to perform such step, starting from a dataframe containing the measured intensities across several samples (at least 3 samples, the more samples the better).
Such dataframe should be organized as follows:
from ipaPy2 import ipa
df=ipa.clusterFeatures(df2)
Clustering features ....
0.0 seconds elapsed
All information about the function can be found in the help of the function
help(ipa.clusterFeatures)
Help on function clusterFeatures in module ipaPy2.ipa:
clusterFeatures(df, Cthr=0.8, RTwin=1, Intmode='max')
Clustering MS1 features based on correlation across samples.
Parameters
----------
df: pandas dataframe with the following columns:
-ids: a unique id for each feature
-mzs: mass-to-charge ratios, usually the average across different
samples.
-RTs: retention times in seconds, usually the average across different
samples.
-Intensities: for each sample, a column reporting the detected
intensities in each sample.
Cthr: Default value 0.8. Minimum correlation allowed in each cluster
RTwin: Default value 1. Maximum difference in RT time between features in
the same cluster
Intmode: Defines how the representative intensity of each feature is
computed. If 'max' (default) the maximum across samples is used.
If 'ave' the average across samples is computed
Returns
-------
df: pandas dataframe in correct format to be used as an input of the
map_isotope_patterns() function
After running, this function returns a pandas dataframe in the correct format for the ipaPy2 package
df.head()
ids
rel.ids
mzs
RTs
Int
0
1
0
116.070544
45.770423
2.170017e+09
1
88
0
117.073678
45.787586
1.256520e+08
2
501
0
231.133673
46.183948
2.519223e+07
3
4429
0
232.136923
46.176715
2.635594e+06
4
2
1
104.106830
40.843309
1.889172e+09
2. MS2 data
If fragmentation data was acquired during the experiment, it can be included in the IPA annotation process.
To do so, the data must be organized in a pandas dataframe containing the following columns, in this exact order:
id: an unique id for each feature for which the MS2 spectrum was acquired (same as in MS1)
spectrum: string containing the spectrum information in the following format ‘mz1:Int1 mz2:Int2 mz3:Int3 …’
ev: collision energy used to acquire the fragmentation spectrum
The Integrated Probabilistic Annotation (IPA) method can be applied in different situations, and the ipaPy2 package allows the users to tailor the IPA pipeline around their specific needs.
This brief tutorial describes the most common scenarios the IPA method can be applied to.
1. Mapping isotope patterns
The first step of the IPA pipeline consists in the mapping of the isotope patterns within the dataset considered. This is achieved through the map_isotope_patterns(). The help of this function provides a detailed description of it.
help(ipa.map_isotope_patterns)
Help on function map_isotope_patterns in module ipaPy2.ipa:
map_isotope_patterns(df, isoDiff=1, ppm=100, ionisation=1, MinIsoRatio=0.5)
mapping isotope patterns in MS1 data.
Parameters
----------
df : pandas dataframe (necessary)
A dataframe containing the MS1 data including the following columns:
-ids: an unique id for each feature
-rel.ids: relation ids. In a previous step of the data processing
pipeline, features are clustered based on peak shape
similarity/retention time. Features in the same
cluster are likely to come from the same metabolite.
All isotope patterns must be in the same rel.id
cluster.
-mzs: mass-to-charge ratios, usually the average across
different samples.
-RTs: retention times in seconds, usually the average across
different samples.
-Ints: representative (e.g., maximum or average) intensity detected
for each feature across samples (either peak area or peak
intensity)
isoDiff : Default value 1. Difference between isotopes of charge 1, does
not need to be exact
ppm: Default value 100. Maximum ppm value allowed between 2 isotopes.
It is very high on purpose
ionisation: Default value 1. positive = 1, negative = -1
MinIsoRatio: mininum intensity ratio expressed (Default value 1%). Only
isotopes with intensity higher than MinIsoRatio% of the main isotope
are considered.
Returns
-------
df: the main input is modified by adding and populating the following
columns
- relationship: the possible values are:
* bp: basepeak, most intense peak within each rel id
* bp|isotope: isotope of the basepeak
* potential bp: most intense peak within each isotope
pattern (excluding the basepeak)
* potential bp|isotope: isotope of one potential bp
- isotope pattern: feature used to cluster the different isotope
patterns within the same relation id
- charge: predicted charge based on the isotope pattern (1,2,3,4,5 or
-1,-2,-3,-4,-5 are the only values allowed)
For the sake of this tutorial, the small dataset example introduced above is considered.
ipa.map_isotope_patterns(df,ionisation=1)
mapping isotope patterns ....
0.1 seconds elapsed
Once finished, this function modifies the pandas dataframe provided as input annotating all isotope patterns.
df.head()
ids
rel.ids
mzs
RTs
Int
relationship
isotope pattern
charge
0
1
0
116.070544
45.770423
2.170017e+09
bp
0
1
1
88
0
117.073678
45.787586
1.256520e+08
bp|isotope
0
1
2
501
0
231.133673
46.183948
2.519223e+07
potential bp
1
1
3
4429
0
232.136923
46.176715
2.635594e+06
potential bp|isotope
1
1
4
2
1
104.106830
40.843309
1.889172e+09
bp
0
1
Some data processing pipelines already have an isotope mapping function and the user can use them as long as they organise the data in the correct format.
2. Compute all adducts
The second step of the pipeline consists in the calculation of all possible adducts that could be formed by the compounds included in the database.
This is done by the function compute_all_adducts(). This function comes with a very detailed help.
help(ipa.compute_all_adducts)
Help on function compute_all_adducts in module ipaPy2.ipa:
compute_all_adducts(adductsAll, DB, ionisation=1, ncores=1)
compute all adducts table based on the information present in the database
Parameters
----------
adductsAll : pandas dataframe (necessary)
Dataframe containing information on all possible
adducts. The file must be in the same format as the example
provided in the DB/adducts.csv
DB : pandas dataframe (necessary)
Dataframe containing the database against which the annotation is
performed. The DB must contain the following columns in this exact
order (optional fields can contain None):
- id: unique id of the database entry (e.g., 'C00031') - necessary
- name: compound name (e.g., 'D-Glucose') - necessary
- formula: chemical formula (e.g., 'C6H12O6') - necessary
- inchi: inchi string - optional
- smiles: smiles string - optional
- RT: if known, retention time range (in seconds) where this
compound is expected to elute (e.g., '30;60') - optional
- adductsPos: list of adducts that should be considered in
positive mode for this entry (e.g.,'M+Na;M+H;M+')
- adductsNeg: list of adducts that should be considered in
negative mode for this entry (e.g.,'M-H;M-2H')
- description: comments on the entry - optional
- pk: previous knowledge on the likelihood of this compound to be
present in the sample analysed. The value has to be between
1 (compound likely to be present in the sample) and 0
(compound cannot be present in the sample).
- MS2: id for the MS2 database entries related to this compound
(optional)
- reactions: list of reactions ids involving this compound
(e.g., 'R00010 R00015 R00028')-optional
ionisation : Default value 1. positive = 1, negative = -1
ncores : default value 1. Number of cores used
Returns
-------
allAdds: pandas dataframe containing the information on all the possible
adducts given the database.
Depending on the size of the dataset used (i.e., number of compounds included), this step can become rather time-consuming, and the use of multiple cores should be considered.
In the context of this tutorial, the heavily reduced example dataset introduced before is considered.
If the same database is used for subsequent experiments without introducing new information, it is recommended to save the results of this function into a .csv file. In this case, the user would need to repeat this step in the future only if the DB changed.
3. Annotation based on MS1 information
At this point, the actual annotation process can start. If no fragmentation data is available, the MS1annotation() function should be used. This function annotates the dataset using the MS1 data and the information stored in the dataset. A detailed description of the function can be accessed through the help:
help(ipa.MS1annotation)
Help on function MS1annotation in module ipaPy2.ipa:
MS1annotation(df, allAdds, ppm, me=0.000548579909065, ratiosd=0.9, ppmunk=None, ratiounk=None, ppmthr=None, pRTNone=None, pRTout=None, ncores=1)
Annotation of the dataset base on the MS1 information. Prior probabilities
are based on mass only, while post probabilities are based on mass, RT,
previous knowledge and isotope patterns.
Parameters
----------
df: pandas dataframe containing the MS1 data. It should be the output of the
function ipa.map_isotope_patterns()
allAdds: pandas dataframe containing the information on all the possible
adducts given the database. It should be the output of either
ipa.compute_all_adducts() or ipa.compute_all_adducts_Parallel()
ppm: accuracy of the MS instrument used
me: accurate mass of the electron. Default 5.48579909065e-04
ratiosd: default 0.9. It represents the acceptable ratio between predicted
intensity and observed intensity of isotopes. It is used to compute
the shape parameters of the lognormal distribution used to
calculate the isotope pattern scores as sqrt(1/ratiosd)
ppmunk: ppm associated to the 'unknown' annotation. If not provided equal
to ppm.
ratiounk: isotope ratio associated to the 'unknown' annotation. If not
provided equal to 0.5
ppmthr: Maximum ppm possible for the annotations. If not provided equal to
2*ppm
pRTNone: Multiplicative factor for the RT if no RTrange present in the
database. If not provided equal to 0.8
pRTout: Multiplicative factor for the RT if measured RT is outside the
RTrange present in the database. If not provided equal to 0.4
ncores: default value 1. Number of cores used
Returns
-------
annotations: a dictionary containing all the possible annotations for the
measured features. The keys of the dictionary are the unique
ids for the features present in df. For each feature, the
annotations are summarized in a pandas dataframe.
annotating based on MS1 information....
0.4 seconds elapsed
This function returns all the possible annotations for all the mass spectrometry features (excluding the ones previously identified as isotopes). The annotations are provided in the form of a dictionary. The keys of the dictionary are the unique ids for the features present in df.
For each feature, all possible annotations are summarised in a dataframe including the following information:
id: Unique id associated with the compound as reported in the database
name: Name of the compound
formula: Chemical formula of the putative annotation
adduct: Adduct type
mz: Theoretical m/z associated with the specific ion
charge: Theoretical charge of the ion
RT range: Retention time range reported in the database for the specific compound
ppm: mass accuracy
isotope pattern score: Score representing how similar the measured and theoretical isopattern scores are
fragmentation pattern score: Cosine similarity. Empty in this case as no MS2 data was provided
prior: Probabilities associated with each possible annotation computed by only considering the mz values (i.e., only considering ppm)
post: Probabilities associated with each possible annotation computed by integrating all the additional information available: retention time range, ppm, isotope pattern score and prior knowledge.
As an example, possible annotations for the feature associated with id=1 (m/z=116.0705438, RT=45.77) is shown below:
annotations[1]
id
name
formula
adduct
m/z
charge
RT range
ppm
isotope pattern score
fragmentation pattern score
prior
post
0
C00148
L-Proline
C5H10NO2
M+H
116.070605
1
35;55
-0.523247
0.331946
None
0.318084
0.454248
1
C00763
D-Proline
C5H10NO2
M+H
116.070605
1
None
-0.523247
0.331946
None
0.318084
0.363398
2
C18170
3-Acetamidopropanal
C5H10NO2
M+H
116.070605
1
500;560
-0.523247
0.331946
None
0.318084
0.181699
3
Unknown
Unknown
None
None
None
None
None
3.000000
0.004161
None
0.045748
0.000655
It should be noticed that in this example, the prior probabilities associated with L-Proline M+H, D-Proline M+H and 3-Acetamidopropanal are exactly the same. This is because all three ions have exactly the same theoretical mass.
However, the post probabilities are different. This is because the retention time associated with this feature is within the retention range reported in the database for L-Proline and outside the one reported for 3-Acetamidopropanal.
An expert in LC/MS-based mass spectrometry would argue that with most chromatographic columns stereoisomers such as L- and D-Proline would share the same RT range. While this is likely to be correct, it must be noted that the IPA method can only use the information present in the database. When populating it, we opted for a more agnostic approach and only included RT ranges for compounds that where actually detected as standards with our experimental setting. If the user wants to include the notion that ‘stereoisomers share the same RT ranges’, they should simply add this information in the database.
Here another example:
annotations[999]
id
name
formula
adduct
m/z
charge
RT range
ppm
isotope pattern score
fragmentation pattern score
prior
post
0
C00079
L-Phenylalanine
C18H23N2O4
2M+H
331.165233
1
120;160
-0.941814
0.472049
None
0.240106
0.550778
1
C02265
D-Phenylalanine
C18H23N2O4
2M+H
331.165233
1
None
-0.941814
0.472049
None
0.240106
0.440622
2
Unknown
Unknown
None
None
None
None
None
3.000000
0.055901
None
0.039575
0.0086
3
C03263
Coproporphyrinogen III
C36H46N4O8
M+2H
331.165233
2
None
-0.941814
0.0
None
0.240106
0.0
4
C05768
Coproporphyrinogen I
C36H46N4O8
M+2H
331.165233
2
None
-0.941814
0.0
None
0.240106
0.0
Also in this case, all the prior probabilities associated with the four ions are exactly the same since all the ions have the same theoretical mass-to-charge ratio. However, the posterior probabilities are significantly different.
Two of these ions (Coproporphyrinogen III M+2H and Coproporphyrinogen I M+2H) have charge +2, while the other two possible annotations have charge +1. The observed isotope pattern is consistent with an ion with charge +1 (i.e., difference between isotopes = 1), and this is reflected in the isotope score pattern and consequently on the posterior probabilities. Moreover, the retention time associated with this feature is within the range reported for L-Phenylalanine in the database. Therefore, the posterior probability associated with L-Phenylalanine 2M+H is the most highest.
4. Annotation based on MS1 and MS2 information
As already mentioned above, fragmentation data can be included in the annotation process by using the MSMSannotation() function. A detailed description of the function can be accessed through the help:
help(ipa.MSMSannotation)
Help on function MSMSannotation in module ipaPy2.ipa:
MSMSannotation(df, dfMS2, allAdds, DBMS2, ppm, me=0.000548579909065, ratiosd=0.9, ppmunk=None, ratiounk=None, ppmthr=None, pRTNone=None, pRTout=None, mzdCS=0, ppmCS=10, CSunk=0.7, evfilt=False, ncores=1)
Annotation of the dataset base on the MS1 and MS2 information. Prior
probabilities are based on mass only, while post probabilities are based
on mass, RT, previous knowledge and isotope patterns.
Parameters
----------
df: pandas dataframe containing the MS1 data. It should be the output of the
function ipa.map_isotope_patterns()
dfMS2: pandas dataframe containing the MS2 data. It must contain 3 columns
-id: an unique id for each feature for which the MS2 spectrum was
acquired (same as in df)
-spectrum: string containing the spectrum information in the following
format 'mz1:Int1 mz2:Int2 mz3:Int3 ...'
-ev: collision energy used to acquire the fragmentation spectrum
allAdds: pandas dataframe containing the information on all the possible
adducts given the database. It should be the output of either
ipa.compute_all_adducts() or ipa.compute_all_adducts_Parallel()
DBMS2: pandas dataframe containing the database containing the MS2
information
ppm: accuracy of the MS instrument used
me: accurate mass of the electron. Default 5.48579909065e-04
ratiosd: default 0.9. It represents the acceptable ratio between predicted
intensity and observed intensity of isotopes. it is used to compute
the shape parameters of the lognormal distribution used to
calculate the isotope pattern scores as sqrt(1/ratiosd)
ppmunk: ppm associated to the 'unknown' annotation. If not provided equal
to ppm.
ratiounk: isotope ratio associated to the 'unknown' annotation. If not
provided equal to 0.5
ppmthr: Maximum ppm possible for the annotations. Ff not provided equal to
2*ppm
pRTNone: Multiplicative factor for the RT if no RTrange present in the
database. If not provided equal to 0.8
pRTout: Multiplicative factor for the RT if measured RT is outside the
RTrange present in the database. If not provided equal to 0.4
mzdCS: maximum mz difference allowed when computing cosine similarity
scores. If one wants to use this parameter instead of ppmCS, this
must be set to 0. Default 0.
ppmCS: maximum ppm allowed when computing cosine similarity scores.
If one wants to use this parameter instead of mzdCS, this must be
set to 0. Default 10.
CSunk: cosine similarity score associated with the 'unknown' annotation.
Default 0.7
evfilt: Default value False. If true, only spectrum acquired with the same
collision energy are considered.
ncores: default value 1. Number of cores used
Returns
-------
annotations: a dictionary containing all the possible annotations for the
measured features. The keys of the dictionary are the unique
ids for the features present in df. For each feature, the
annotations are summarized in a pandas dataframe.
The line below integrates the fragmentation data and the fragmentation database introduced above in the annotation process. The role of the CSunk (“cosine unknown”) parameter should be briefly discussed here. In most cases, the fragmentation database contains fragmentation spectra only for a subset of the compounds in the database. Therefore, when considering a feature for which the fragmentation spectra was acquired, it is often the case that the cosine similarity can only be computed for a subset of the possible annotations. The CSunk value is then assigned to the other possible annotations for comparison.
annotating based on MS1 and MS2 information....
0.7 seconds elapsed
The output of this function has the same structure as the one from the MSannotation() function, but it also includes the fragmentation pattern scores when the fragmentation data is available.
As an example, possible annotations for the feature associated with id=1 is shown below:
annotations[1]
id
name
formula
adduct
m/z
charge
RT range
ppm
isotope pattern score
fragmentation pattern score
prior
post
0
C00148
L-Proline
C5H10NO2
M+H
116.070605
1
35;55
-0.523247
0.331946
0.999759
0.318084
0.543121
1
C00763
D-Proline
C5H10NO2
M+H
116.070605
1
None
-0.523247
0.331946
0.7
0.318084
0.304221
2
C18170
3-Acetamidopropanal
C5H10NO2
M+H
116.070605
1
500;560
-0.523247
0.331946
0.7
0.318084
0.15211
3
Unknown
Unknown
None
None
None
None
None
3.000000
0.004161
0.7
0.045748
0.000548
In this case, the cosine similarity score for the annotation L-Proline M+H is very high, therefore the posterior probability associated with it is higher than the one obtained without considering the MS2 data.
Here another example for a feature having a very similar mass-to-charge ratio (id=90, m/z=117.0705223, RT=63.45).
annotations[90]
id
name
formula
adduct
m/z
charge
RT range
ppm
isotope pattern score
fragmentation pattern score
prior
post
0
C00763
D-Proline
C5H10NO2
M+H
116.070605
1
None
-0.708479
None
0.7
0.317329
0.480821
1
C18170
3-Acetamidopropanal
C5H10NO2
M+H
116.070605
1
500;560
-0.708479
None
0.7
0.317329
0.24041
2
C00148
L-Proline
C5H10NO2
M+H
116.070605
1
35;55
-0.708479
None
0.59986
0.317329
0.206018
3
Unknown
Unknown
None
None
None
None
None
3.000000
None
0.7
0.048013
0.072751
In this case, the cosine similarity score for the annotation L-Proline M+H is not very high. Moreover, the retention time assigned to this feature is outside both retention time ranges reported in the database for L-Proline and 3-Acetamidopropanal. Therefore, the most likely annotation for this feature is D-Proline M+H, the one annotation not rejected directly by the available evidence.
It should be noted that the fragmentation pattern score has a rather weak effect on the posterior probability associated with L-Proline, given how close it is to the fragmentation pattern score associated features that do not have MS2 info in the database (CSunk=0.7). The main reason why the D-Proline annotation appears to be the most likely is due to the fact that the retention time associated to this feature (63.45 s) is outside the retention time ranges associated with L-Proline and 3-Acetamidopropanal.
Until this point, the putative annotations and the associated probabilities computed for each feature are independent from each other. However, the IPA method can be used to update the probabilities by considering the possible relationship between annotations.
For example, the Gibbs_sampler_add() function uses a Gibbs sampler to estimate the posterior probabilities obtained by considering all possible adduct connections.
The help() provides a detailed description of this function:
help(ipa.Gibbs_sampler_add)
Help on function Gibbs_sampler_add in module ipaPy2.ipa:
Gibbs_sampler_add(df, annotations, noits=100, burn=None, delta_add=1, all_out=False, zs=None)
Gibbs sampler considering only adduct connections. The function computes
the posterior probabilities of the annotations considering the adducts
connections.
Parameters
----------
df: pandas dataframe containing the MS1 data. It should be the output of the
function ipa.map_isotope_patterns()
annotations: a dictionary containing all the possible annotations for the
measured features. The keys of the dictionary are the unique
ids for the features present in df. For each feature, the
annotations are summarized in a pandas dataframe. Output of
functions MS1annotation(), MS1annotation_Parallel(),
MSMSannotation() or MSMSannotation_Parallel
noits: number of iterations if the Gibbs sampler to be run
burn: number of iterations to be ignored when computing posterior
probabilities. If None, is set to 10% of total iterations
delta_add: parameter used when computing the conditional priors. The
parameter must be positive. The smaller the parameter the more
weight the adducts connections have on the posterior
probabilities. Default 1.
all_out: logical value. If true the list of assignments found in each
iteration is returned by the function. Default False.
zs: list of assignments computed in a previous run of the Gibbs sampler.
Optional, default None.
Returns
-------
annotations: the function modifies the annotations dictionary by adding 2
columns to each entry. One named 'post Gibbs' contains the
posterior probabilities computed. The other is called
'chi-square pval' containing the p-value from a chi-squared
test comparing the 'post' with the 'post Gibbs' probabilities.
zs: optional, if all_out==True, the function return the full list of
assignments computed. This allows restarting the sampler from where
you are from a previous run.
The function modifies the annotations dictionary by adding two additional columns to each dataframe:
post Gibbs: posterior probabilities obtained from the Gibbs sampler.
chi-square pval: In order to see if the posterior probabilities obtained are statistically different from the priors, a chi-square test is used. The obtained p-value is reported in this coloumn.
If all_out=True, the function also returns the full list of assignments computed. If provided as an input to the Gibbs sampler, it allows to restart it from where you finished.
As an example, the possible annotations for the feature associated with the id 501 is shown below.
annotations[501]
id
name
formula
adduct
m/z
charge
RT range
ppm
isotope pattern score
fragmentation pattern score
prior
post
post Gibbs
chi-square pval
0
C00148
L-Proline
C10H19N2O4
2M+H
231.133933
1
35;55
-1.124747
0.328045
None
0.314538
0.453117
0.662075
4.087367e-186
1
C00763
D-Proline
C10H19N2O4
2M+H
231.133933
1
None
-1.124747
0.328045
None
0.314538
0.362494
0.266163
4.087367e-186
2
C18170
3-Acetamidopropanal
C10H19N2O4
2M+H
231.133933
1
500;560
-1.124747
0.328045
None
0.314538
0.181247
0.071540
4.087367e-186
3
Unknown
Unknown
None
None
None
None
None
3.000000
0.015864
None
0.056386
0.003143
0.000222
4.087367e-186
This feature is clustered with feature id=1, the most likely annotation of which is L-Proline M+H. As expected, considering the adducts connections the ‘post Gibbs’ probability associated with L-Proline 2M+H is significantly higher than the alternative.
The IPA method can also update the probabilities associated to each possible annotations by considering all possible biochemical connections.
Before doing so, it is necessary to provide a pandas dataframe reporting which compounds can be considered biochemically related.
The function Compute_Bio() can be used to compute such a dataframe.
The help() provides a detailed description of this function:
help(ipa.Compute_Bio)
Help on function Compute_Bio in module ipaPy2.ipa:
Compute_Bio(DB, annotations=None, mode='reactions', connections=['C3H5NO', 'C6H12N4O', 'C4H6N2O2', 'C4H5NO3', 'C3H5NOS', 'C6H10N2O3S2', 'C5H7NO3', 'C5H8N2O2', 'C2H3NO', 'C6H7N3O', 'C6H11NO', 'C6H11NO', 'C6H12N2O', 'C5H9NOS', 'C9H9NO', 'C5H7NO', 'C3H5NO2', 'C4H7NO2', 'C11H10N2O', 'C9H9NO2', 'C5H9NO', 'C4H4O2', 'C3H5O', 'C10H12N5O6P', 'C10H15N2O3S', 'C10H14N2O2S', 'CH2ON', 'C21H34N7O16P3S', 'C21H33N7O15P3S', 'C10H15N3O5S', 'C5H7', 'C3H2O3', 'C16H30O', 'C8H8NO5P', 'CH3N2O', 'C5H4N5', 'C10H11N5O3', 'C10H13N5O9P2', 'C10H12N5O6P', 'C9H13N3O10P2', 'C9H12N3O7P', 'C4H4N3O', 'C10H13N5O10P2', 'C10H12N5O7P', 'C5H4N5O', 'C10H11N5O4', 'C10H14N2O10P2', 'C10H12N2O4', 'C5H5N2O2', 'C10H13N2O7P', 'C9H12N2O11P2', 'C9H11N2O8P', 'C4H3N2O2', 'C9H10N2O5', 'C2H3O2', 'C2H2O', 'C2H2', 'CO2', 'CHO2', 'H2O', 'H3O6P2', 'C2H4', 'CO', 'C2O2', 'H2', 'O', 'P', 'C2H2O', 'CH2', 'HPO3', 'NH2', 'PP', 'NH', 'SO3', 'N', 'C6H10O5', 'C6H10O6', 'C5H8O4', 'C12H20O11', 'C6H11O8P', 'C6H8O6', 'C6H10O5', 'C18H30O15'], ncores=1)
Compute matrix of biochemical connections. Either based on a list of
possible connections in the form of a list of formulas or based on the
reactions present in the database.
Parameters
----------
DB: pandas dataframe containing the database against which the annotation
is performed. The DB must contain the following columns in this exact
order (optional fields can contain None):
- id: unique id of the database entry (e.g., 'C00031') - necessary
- name: compound name (e.g., 'D-Glucose') - necessary
- formula: chemical formula (e.g., 'C6H12O6') - necessary
- inchi: inchi string - optional
- smiles: smiles string - optional
- RT: if known, retention time range (in seconds) where this compound
is expected to elute (e.g., '30;60') - optional
- adductsPos: list of adducts that should be considered in positive mode
for this entry (e.g.,'M+Na;M+H;M+') - necessary
- adductsNeg: list of adducts that should be considered in negative
mode for this entry (e.g.,'M-H;M-2H') - necessary
- description: comments on the entry - optional
- pk: previous knowledge on the likelihood of this compound to be
present in the sample analyse. The value has to be between 1
(compound likely to be present in the sample) and 0 (compound
cannot be present in the sample).
- MS2: id for the MS2 database entries related to this compound
(optional)
- reactions: list of reactions ids involving this compound
(e.g., 'R00010 R00015 R00028')-optional, but necessary if
mode='reactions'.
annotations: If equal to None (default) all entries in the DB are considered
(used to pre-compute the Bio matrix), alternatively it should be
a dictionary containing all the possible annotations for the
measured features. The keys of the dictionary are the unique ids
for the features present in df. For each feature, the
annotations are summarized in a pandas dataframe. Output of
functions MS1annotation(), MS1annotation_Parallel(),
MSMSannotation() or MSMSannotation_Parallel. In this case
only the entries currently considered as possible annotations
are used.
mode: either 'reactions' (connections are computed based on the reactions
present in the database) or 'connections' (connections are computed
based on the list of connections provided). Default 'reactions'.
connections: list of possible connections between compounds defined as
formulas. Only necessary if mode='connections'. A list of
common biotransformations is provided as default.
ncores: default value 1. Number of cores used
Returns
-------
Bio: dataframe containing all the possible connections computed.
According to the value assigned to the ‘mode’ parameter, the function can compute all possible biochemical connections in two ways.
If mode=’reactions’, the function connects the compounds that share the same reaction id(s) according to what is reported in the database.
Bio = ipa.Compute_Bio(DB,annotations,mode='reactions')
Bio
computing all possible biochemical connections
considering the reactions stored in the database ...
0.0 seconds elapsed
0
1
0
C00082
C00079
1
C00082
C04368
2
C21092
C00407
3
C02265
C00079
4
C00123
C02486
5
C00763
C00431
6
C00079
C20807
7
C00407
C00183
If mode=’connections’, the function computes the ‘chemical formula difference’ for each pair of compounds considered. If the difference is included in the list of connections, the two compounds are considered connected.
A list of connections is provided as default, but it can be modified.
Bio = ipa.Compute_Bio(DB,annotations,mode='connections')
Bio
computing all possible biochemical connections
considering the provided connections ...
3.1 seconds elapsed
0
1
0
C04282
C05131
1
C04282
C22140
2
C04282
C16744
3
C01879
C05131
4
C01879
C22140
5
C01879
C16744
6
C01877
C05131
7
C01877
C22140
8
C01877
C16744
9
C05131
C02237
10
C05131
C04281
11
C05131
C22141
12
C02237
C22140
13
C02237
C16744
14
C22140
C04281
15
C22140
C22141
16
C16744
C04281
17
C16744
C22141
Depending on the size of the database and the dataset, computing all possible biochemical connections can be extremely computationally demanding and can drastically increase the computation time needed for the annotation. For this reason, a precomputed list of biochemical connections based on the database provided (computed based on ‘reaction’ or ‘connections’ mode) is included in the library and can be used directly without the need of computing the biochemical connections.
Bio = pd.read_csv('DB/allBIO_reactions.csv')
The list of connections computed with mode=’connections’ needs to be unzipped first.
import zipfile
with zipfile.ZipFile("DB/allBio_connections.csv.zip","r") as zip_ref:
zip_ref.extractall("DB/")
Bio=pd.read_csv('DB/allBio_connections.csv')
Alternatively, the user can define their own biochemical connections.
For example:
L-Proline C00148
L-Valine C00183
L-Phenylalanine C00079
L-Leucine C00123
5-Oxoproline C01879
Betaine C00719
Hordatine A C08307
L-Tyrosine C00082
D-Proline C00763
D-Phenylalanine C02265
As an example, the possible annotations for the feature associated with the id 992 is shown below.
annotations[992]
id
name
formula
adduct
m/z
charge
RT range
ppm
isotope pattern score
fragmentation pattern score
prior
post
post Gibbs
chi-square pval
0
C00763
D-Proline
C5H10NO2
M+H
116.070605
1
None
-0.65851
None
0.7
0.317559
0.531683
0.770667
0.0
2
C00148
L-Proline
C5H10NO2
M+H
116.070605
1
35;55
-0.65851
None
0.324512
0.317559
0.123241
0.180889
0.0
1
C18170
3-Acetamidopropanal
C5H10NO2
M+H
116.070605
1
500;560
-0.65851
None
0.7
0.317559
0.265841
0.035556
0.0
3
Unknown
Unknown
None
None
None
None
None
3.00000
None
0.7
0.047324
0.079234
0.012889
0.0
The probability associated with the D-Proline M+H is significantly higher after considering the biochemical connections. This is because D-Proline is biochemically connected to L-Proline (by proline racemase), and the most likely annotation for the feature id=1 is L-Proline M+H (>50%).
7. Computing posterior probabilities integrating both adducts and biochemical connections
It is also possible to run the Gibbs sampler considering biochemical and adduct connections at the same time.
To do so, one can use the function Gibbs_sampler_bio_add().
The help() provides a detailed explanation of the function.
help(ipa.Gibbs_sampler_bio_add)
Help on function Gibbs_sampler_bio_add in module ipaPy2.ipa:
Gibbs_sampler_bio_add(df, annotations, Bio, noits=100, burn=None, delta_bio=1, delta_add=1, all_out=False, zs=None)
Gibbs sampler considering both biochemical and adducts connections. The
function computes the posterior probabilities of the annotations
considering the possible biochemical connections reported in Bio and the
possible adducts connection.
Parameters
----------
df: pandas dataframe containing the MS1 data. It should be the output of the
function ipa.map_isotope_patterns()
annotations: a dictionary containing all the possible annotations for the
measured features. The keys of the dictionary are the unique
ids for the features present in df. For each feature, the
annotations are summarized in a pandas dataframe. Output of
functions MS1annotation(), MS1annotation_Parallel(),
MSMSannotation() or MSMSannotation_Parallel
Bio: dataframe (2 columns), reporting all the possible connections between
compounds. It uses the unique ids from the database. It could be the
output of Compute_Bio() or Compute_Bio_Parallel().
noits: number of iterations if the Gibbs sampler to be run
burn: number of iterations to be ignored when computing posterior
probabilities. If None, is set to 10% of total iterations
delta_bio: parameter used when computing the conditional priors.
The parameter must be positive. The smaller the parameter the
more weight the adducts connections have on the posterior
probabilities. Default 1.
delta_add: parameter used when computing the conditional priors. The
parameter must be positive. The smaller the parameter the more
weight the adducts connections have on the posterior
probabilities. Default 1.
all_out: logical value. If true the list of assignments found in each
iteration is returned by the function. Default False.
zs: list of assignments computed in a previous run of the Gibbs sampler.
Optional, default None.
Returns
-------
annotations: the function modifies the annotations dictionary by adding 2
columns to each entry. One named 'post Gibbs' contains the
posterior probabilities computed. The other is called
'chi-square pval' containing the p-value from a chi-squared
test comparing the 'post' with the 'post Gibbs' probabilities.
zs: optional, if all_out==True, the function return the full list of
assignments computed. This allows restarting the sampler from where you
are from a previous run
computing posterior probabilities including biochemical and adducts connections
initialising sampler ...
Gibbs Sampler Progress Bar: 100%|██████████| 5000/5000 [00:24<00:00, 204.74it/s]
parsing results ...
Done - 24.5 seconds elapsed
annotations[1]
id
name
formula
adduct
m/z
charge
RT range
ppm
isotope pattern score
fragmentation pattern score
prior
post
post Gibbs
chi-square pval
0
C00148
L-Proline
C5H10NO2
M+H
116.070605
1
35;55
-0.523247
0.331946
0.999759
0.318084
0.543121
0.783111
1.468730e-271
1
C00763
D-Proline
C5H10NO2
M+H
116.070605
1
None
-0.523247
0.331946
0.7
0.318084
0.304221
0.212444
1.468730e-271
2
C18170
3-Acetamidopropanal
C5H10NO2
M+H
116.070605
1
500;560
-0.523247
0.331946
0.7
0.318084
0.15211
0.004444
1.468730e-271
3
Unknown
Unknown
None
None
None
None
None
3.000000
0.004161
0.7
0.045748
0.000548
0.000000
1.468730e-271
8. Running the whole pipeline with a single function
Finally, the ipaPy2 library also include a wrapper function that allows running the whole IPA pipeline in one step.
A detailed description of the function can be accessed with the help.
help(ipa.simpleIPA)
Help on function simpleIPA in module ipaPy2.ipa:
simpleIPA(df, ionisation, DB, adductsAll, ppm, dfMS2=None, DBMS2=None, noits=100, burn=None, delta_add=None, delta_bio=None, Bio=None, mode='reactions', CSunk=0.5, isodiff=1, ppmiso=100, ncores=1, me=0.000548579909065, ratiosd=0.9, ppmunk=None, ratiounk=None, ppmthr=None, pRTNone=None, pRTout=None, mzdCS=0, ppmCS=10, evfilt=False, connections=['C3H5NO', 'C6H12N4O', 'C4H6N2O2', 'C4H5NO3', 'C3H5NOS', 'C6H10N2O3S2', 'C5H7NO3', 'C5H8N2O2', 'C2H3NO', 'C6H7N3O', 'C6H11NO', 'C6H11NO', 'C6H12N2O', 'C5H9NOS', 'C9H9NO', 'C5H7NO', 'C3H5NO2', 'C4H7NO2', 'C11H10N2O', 'C9H9NO2', 'C5H9NO', 'C4H4O2', 'C3H5O', 'C10H12N5O6P', 'C10H15N2O3S', 'C10H14N2O2S', 'CH2ON', 'C21H34N7O16P3S', 'C21H33N7O15P3S', 'C10H15N3O5S', 'C5H7', 'C3H2O3', 'C16H30O', 'C8H8NO5P', 'CH3N2O', 'C5H4N5', 'C10H11N5O3', 'C10H13N5O9P2', 'C10H12N5O6P', 'C9H13N3O10P2', 'C9H12N3O7P', 'C4H4N3O', 'C10H13N5O10P2', 'C10H12N5O7P', 'C5H4N5O', 'C10H11N5O4', 'C10H14N2O10P2', 'C10H12N2O4', 'C5H5N2O2', 'C10H13N2O7P', 'C9H12N2O11P2', 'C9H11N2O8P', 'C4H3N2O2', 'C9H10N2O5', 'C2H3O2', 'C2H2O', 'C2H2', 'CO2', 'CHO2', 'H2O', 'H3O6P2', 'C2H4', 'CO', 'C2O2', 'H2', 'O', 'P', 'C2H2O', 'CH2', 'HPO3', 'NH2', 'PP', 'NH', 'SO3', 'N', 'C6H10O5', 'C6H10O6', 'C5H8O4', 'C12H20O11', 'C6H11O8P', 'C6H8O6', 'C6H10O5', 'C18H30O15'])
Wrapper function performing the whole IPA pipeline.
Parameters
----------
df: pandas dataframe containing the MS1 data. It should be the output of the
function ipa.map_isotope_patterns()
DB: pandas dataframe containing the database against which the annotation
is performed. The DB must contain the following columns in this exact
order (optional fields can contain None):
- id: unique id of the database entry (e.g., 'C00031') - necessary
- name: compound name (e.g., 'D-Glucose') - necessary
- formula: chemical formula (e.g., 'C6H12O6') - necessary
- inchi: inchi string - optional
- smiles: smiles string - optional
- RT: if known, retention time range (in seconds) where this compound
is expected to elute (e.g., '30;60') - optional
- adductsPos: list of adducts that should be considered in positive mode
for this entry (e.g.,'M+Na;M+H;M+') - necessary
- adductsNeg: list of adducts that should be considered in negative
mode for this entry (e.g.,'M-H;M-2H') - necessary
- description: comments on the entry - optional
- pk: previous knowledge on the likelihood of this compound to be
present in the sample analyse. The value has to be between 1
(compound likely to be present in the sample) and 0 (compound
cannot be present in the sample).
- MS2: id for the MS2 database entries related to this compound
(optional)
- reactions: list of reactions ids involving this compound
(e.g., 'R00010 R00015 R00028')-optional, but necessary if
mode='reactions'.
adductsAll:a dataframe containing information on all possible adducts. 1
ppm: accuracy of the MS instrument used
dfMS2: pandas dataframe containing the MS2 data (optional). It must contain
3 columns:
-id: an unique id for each feature for which the MS2 spectrum
was acquired (same as in df)
-spectrum: string containing the spectrum inforamtion in the
following format 'mz1:Int1 mz2:Int2 mz3:Int3 ...'
-ev: collision energy used to aquire the fragmentation
spectrum
DBMS2: pandas dataframe containing the database containing the MS2
information (optional)
evfilt: Default value False. If true, only spectra acquired with the same
collision energy are considered.
noits: number of iterations if the Gibbs sampler to be run
burn: number of iterations to be ignored when computing posterior
probabilities. If None, is set to 10% of total iterations
delta_bio: parameter used when computing the conditional priors.
The parameter must be positive. The smaller the parameter the
more weight the adducts connections have on the posterior
probabilities. Default 1.
delta_add: parameter used when computing the conditional priors. The
parameter must be positive. The smaller the parameter the more
weight the adducts connections have on the posterior
probabilities. Default 1.
Bio: dataframe (2 columns), reporting all the possible connections between
compounds. It uses the unique ids from the database. It could be the
output of Compute_Bio() or Compute_Bio_Parallel().
mode: either 'reactions' (connections are computed based on the reactions
present in the database) or 'connections' (connections are computed
based on the list of connections provided). Default 'reactions'.
CSunk: cosine similarity score associated with the 'unknown' annotation.
Default 0.7
isoDiff: Default value 1. Difference between isotopes of charge 1, does not
need to be exact
ppmiso: Default value 100. Maximum ppm value allowed between 2 isotopes.
It is very high on purpose
ncores: default value 1. Number of cores used
me: accurate mass of the electron. Default 5.48579909065e-04
ratiosd: default 0.9. It represents the acceptable ratio between predicted
intensity and observed intensity of isotopes. it is used to compute
the shape parameters of the lognormal distribution used to
calculate the isotope pattern scores as sqrt(1/ratiosd)
ppmunk: ppm associated to the 'unknown' annotation. If not provided equal
to ppm.
ratiounk: isotope ratio associated to the 'unknown' annotation. If not
provided equal to 0.5
ppmthr: Maximum ppm possible for the annotations. Ff not provided equal to
2*ppm
pRTNone: Multiplicative factor for the RT if no RTrange present in the
database. If not provided equal to 0.8
pRTout: Multiplicative factor for the RT if measured RT is outside the
RTrange present in the database. If not provided equal to 0.4
mzdCS: maximum mz difference allowed when computing cosine similarity
scores. If one wants to use this parameter instead of ppmCS, this
must be set to 0. Default 0.
ppmCS: maximum ppm allowed when computing cosine similarity scores.
If one wants to use this parameter instead of mzdCS, this must be
set to 0. Default 10.
connections: list of possible connections between compounds defined as
formulas. Only necessary if mode='connections'. A list of
common biotransformations is provided as default.
Output:
annotations: a dictionary containing all the possible annotations for the measured features. The keys of the dictionary are the
unique ids for the features present in df. For each feature, the annotations are summarized in a pandas dataframe.
Based on the parameters passed on to the function, the end-result of this function will be different.
For example, if one wants to use both the MS1 and MS2 data and not use the Gibbs sampler, the following should be used:
isotopes already mapped
computing all adducts ....
0.1 seconds elapsed
annotating based on MS1 information....
0.4 seconds elapsed
computing posterior probabilities including adducts connections
initialising sampler ...
Gibbs Sampler Progress Bar: 100%|██████████| 5000/5000 [00:19<00:00, 255.13it/s]
parsing results ...
Done - 19.7 seconds elapsed
Or, if one wants to use both the MS1 and MS2 data and consider both adducts and biochemical connections in the Gibbs sampler, the following should be used.
ipaPy2
Python implementation of the Integrated Probabilistic Annotation (IPA) - A Bayesian annotation method for LC/MS data integrating biochemical relations, isotope patterns and adduct formation.
Installation
ipaPy2 requires Python 3.9 or higher
Install via pip (recommended)
Install via bioconda
Compiling from source (macOS)
Compiling from source (Linux)
Compiling from source (Windows)
Databases
One of the most powerful features of the IPA method is that it is able to integrate the knowledge gained from previous experiments in the annotation process. There are three files that are used as the IPA database:
1. Adducts file (required)
The ipaPy2 library requires a file contains all the information required for the computation of the adducts. An adducts.csv file is provided with the package here. The file contains the most common adducts. If any exotic adduct (or in-source fragment) needs to be considered, the user must modify the file accordingly. The format required for the adducts file is shown below.
2. MS1 database file (required)
The IPA method requires a pandas dataframe containing the database against which the annotation is performed. This dataframe must contain the following columns in this exact order (optional columns can have empty fields):
The column names must be the ones reported here. While users are strongly advised to build their own ad-hoc database to match their specific instrument setup and sample types, here you can find a relatively big example database.
This example databases was obtained considering the KEGG database, the Natural Products Atlas database and the MoNa database (only compounds having at least one fragmentation spectrum obtained with a QExactive). For each entry, only a handful of the most common adducts are considered. To fully exploit the IPA method, it is strongly recommended to constantly update the database when new knowledge is gained from previous experience. Providing a retention time window for compounds previously detected with the analytical system at hand it is particularly useful. For the sake of the example in this tutorial, a reduced example database is also provided.
3. MS2 database file (only required if MS2 data is available)
This new implementation of the IPA method also allows the user to include MS2 data in the annotation pipeline. In order to exploit this functionality an MS2 spectra database must be provided. The MS2 database must be provided as a pandas dataframe including the following columns in this exact order:
It is necessary that the user uses an MS2 database specific to the instrument used to acquire the data. The MS2 database found here, contains all the MS2 spectra found in the MoNa database acquired with a QExactive. This is a relatively big file, and for the sake of this tutorial a drastically reduced version of it has been included within this repository, and can be found here.
Data preparation
Before using the ipaPy2 package, the processed data coming from an untargeted metabolomics experiment must be properly prepared.
1. MS1 data
The data must be organized in a pandas dataframe containing the following columns:
Below is reported an example:
The clustering of the features is a necessary and must be performed before running the IPA method. For this step, the use of widely used data processing software such as mzMatch and CAMERA is recommended. Nevertheless, the ipaPy2 library provides a function (clusterFeatures()) able to perform such step, starting from a dataframe containing the measured intensities across several samples (at least 3 samples, the more samples the better). Such dataframe should be organized as follows:
All information about the function can be found in the help of the function
After running, this function returns a pandas dataframe in the correct format for the ipaPy2 package
2. MS2 data
If fragmentation data was acquired during the experiment, it can be included in the IPA annotation process. To do so, the data must be organized in a pandas dataframe containing the following columns, in this exact order:
Below is reported an example:
Usage
The Integrated Probabilistic Annotation (IPA) method can be applied in different situations, and the ipaPy2 package allows the users to tailor the IPA pipeline around their specific needs.
This brief tutorial describes the most common scenarios the IPA method can be applied to.
1. Mapping isotope patterns
The first step of the IPA pipeline consists in the mapping of the isotope patterns within the dataset considered. This is achieved through the map_isotope_patterns(). The help of this function provides a detailed description of it.
For the sake of this tutorial, the small dataset example introduced above is considered.
Once finished, this function modifies the pandas dataframe provided as input annotating all isotope patterns.
Some data processing pipelines already have an isotope mapping function and the user can use them as long as they organise the data in the correct format.
2. Compute all adducts
The second step of the pipeline consists in the calculation of all possible adducts that could be formed by the compounds included in the database. This is done by the function compute_all_adducts(). This function comes with a very detailed help.
Depending on the size of the dataset used (i.e., number of compounds included), this step can become rather time-consuming, and the use of multiple cores should be considered. In the context of this tutorial, the heavily reduced example dataset introduced before is considered.
If the same database is used for subsequent experiments without introducing new information, it is recommended to save the results of this function into a .csv file. In this case, the user would need to repeat this step in the future only if the DB changed.
3. Annotation based on MS1 information
At this point, the actual annotation process can start. If no fragmentation data is available, the MS1annotation() function should be used. This function annotates the dataset using the MS1 data and the information stored in the dataset. A detailed description of the function can be accessed through the help:
This function returns all the possible annotations for all the mass spectrometry features (excluding the ones previously identified as isotopes). The annotations are provided in the form of a dictionary. The keys of the dictionary are the unique ids for the features present in df. For each feature, all possible annotations are summarised in a dataframe including the following information:
As an example, possible annotations for the feature associated with id=1 (m/z=116.0705438, RT=45.77) is shown below:
It should be noticed that in this example, the prior probabilities associated with L-Proline M+H, D-Proline M+H and 3-Acetamidopropanal are exactly the same. This is because all three ions have exactly the same theoretical mass. However, the post probabilities are different. This is because the retention time associated with this feature is within the retention range reported in the database for L-Proline and outside the one reported for 3-Acetamidopropanal.
An expert in LC/MS-based mass spectrometry would argue that with most chromatographic columns stereoisomers such as L- and D-Proline would share the same RT range. While this is likely to be correct, it must be noted that the IPA method can only use the information present in the database. When populating it, we opted for a more agnostic approach and only included RT ranges for compounds that where actually detected as standards with our experimental setting. If the user wants to include the notion that ‘stereoisomers share the same RT ranges’, they should simply add this information in the database.
Here another example:
Also in this case, all the prior probabilities associated with the four ions are exactly the same since all the ions have the same theoretical mass-to-charge ratio. However, the posterior probabilities are significantly different. Two of these ions (Coproporphyrinogen III M+2H and Coproporphyrinogen I M+2H) have charge +2, while the other two possible annotations have charge +1. The observed isotope pattern is consistent with an ion with charge +1 (i.e., difference between isotopes = 1), and this is reflected in the isotope score pattern and consequently on the posterior probabilities. Moreover, the retention time associated with this feature is within the range reported for L-Phenylalanine in the database. Therefore, the posterior probability associated with L-Phenylalanine 2M+H is the most highest.
4. Annotation based on MS1 and MS2 information
As already mentioned above, fragmentation data can be included in the annotation process by using the MSMSannotation() function. A detailed description of the function can be accessed through the help:
The line below integrates the fragmentation data and the fragmentation database introduced above in the annotation process. The role of the CSunk (“cosine unknown”) parameter should be briefly discussed here. In most cases, the fragmentation database contains fragmentation spectra only for a subset of the compounds in the database. Therefore, when considering a feature for which the fragmentation spectra was acquired, it is often the case that the cosine similarity can only be computed for a subset of the possible annotations. The CSunk value is then assigned to the other possible annotations for comparison.
The output of this function has the same structure as the one from the MSannotation() function, but it also includes the fragmentation pattern scores when the fragmentation data is available. As an example, possible annotations for the feature associated with id=1 is shown below:
In this case, the cosine similarity score for the annotation L-Proline M+H is very high, therefore the posterior probability associated with it is higher than the one obtained without considering the MS2 data.
Here another example for a feature having a very similar mass-to-charge ratio (id=90, m/z=117.0705223, RT=63.45).
In this case, the cosine similarity score for the annotation L-Proline M+H is not very high. Moreover, the retention time assigned to this feature is outside both retention time ranges reported in the database for L-Proline and 3-Acetamidopropanal. Therefore, the most likely annotation for this feature is D-Proline M+H, the one annotation not rejected directly by the available evidence. It should be noted that the fragmentation pattern score has a rather weak effect on the posterior probability associated with L-Proline, given how close it is to the fragmentation pattern score associated features that do not have MS2 info in the database (CSunk=0.7). The main reason why the D-Proline annotation appears to be the most likely is due to the fact that the retention time associated to this feature (63.45 s) is outside the retention time ranges associated with L-Proline and 3-Acetamidopropanal.
5. Computing posterior probabilities integrating adducts connections
Until this point, the putative annotations and the associated probabilities computed for each feature are independent from each other. However, the IPA method can be used to update the probabilities by considering the possible relationship between annotations. For example, the Gibbs_sampler_add() function uses a Gibbs sampler to estimate the posterior probabilities obtained by considering all possible adduct connections.
The help() provides a detailed description of this function:
The function modifies the annotations dictionary by adding two additional columns to each dataframe:
If all_out=True, the function also returns the full list of assignments computed. If provided as an input to the Gibbs sampler, it allows to restart it from where you finished.
As an example, the possible annotations for the feature associated with the id 501 is shown below.
This feature is clustered with feature id=1, the most likely annotation of which is L-Proline M+H. As expected, considering the adducts connections the ‘post Gibbs’ probability associated with L-Proline 2M+H is significantly higher than the alternative.
6. Computing posterior probabilities integrating biochemical connections
The IPA method can also update the probabilities associated to each possible annotations by considering all possible biochemical connections.
Before doing so, it is necessary to provide a pandas dataframe reporting which compounds can be considered biochemically related. The function Compute_Bio() can be used to compute such a dataframe. The help() provides a detailed description of this function:
According to the value assigned to the ‘mode’ parameter, the function can compute all possible biochemical connections in two ways. If mode=’reactions’, the function connects the compounds that share the same reaction id(s) according to what is reported in the database.
If mode=’connections’, the function computes the ‘chemical formula difference’ for each pair of compounds considered. If the difference is included in the list of connections, the two compounds are considered connected. A list of connections is provided as default, but it can be modified.
Depending on the size of the database and the dataset, computing all possible biochemical connections can be extremely computationally demanding and can drastically increase the computation time needed for the annotation. For this reason, a precomputed list of biochemical connections based on the database provided (computed based on ‘reaction’ or ‘connections’ mode) is included in the library and can be used directly without the need of computing the biochemical connections.
The list of connections computed with mode=’connections’ needs to be unzipped first.
Alternatively, the user can define their own biochemical connections. For example: L-Proline C00148 L-Valine C00183 L-Phenylalanine C00079 L-Leucine C00123 5-Oxoproline C01879 Betaine C00719 Hordatine A C08307 L-Tyrosine C00082 D-Proline C00763 D-Phenylalanine C02265
As an example, the possible annotations for the feature associated with the id 992 is shown below.
The probability associated with the D-Proline M+H is significantly higher after considering the biochemical connections. This is because D-Proline is biochemically connected to L-Proline (by proline racemase), and the most likely annotation for the feature id=1 is L-Proline M+H (>50%).
7. Computing posterior probabilities integrating both adducts and biochemical connections
It is also possible to run the Gibbs sampler considering biochemical and adduct connections at the same time. To do so, one can use the function Gibbs_sampler_bio_add(). The help() provides a detailed explanation of the function.
8. Running the whole pipeline with a single function
Finally, the ipaPy2 library also include a wrapper function that allows running the whole IPA pipeline in one step. A detailed description of the function can be accessed with the help.
Based on the parameters passed on to the function, the end-result of this function will be different.
For example, if one wants to use both the MS1 and MS2 data and not use the Gibbs sampler, the following should be used:
If instead one wants to use only the MS1 data and only consider the adducts connections in the Gibbs sampler, one should use the following:
Or, if one wants to use both the MS1 and MS2 data and consider both adducts and biochemical connections in the Gibbs sampler, the following should be used.