ipaPy2

Python implementation of the Integrated Probabilistic Annotation (IPA) - A Bayesian annotation method for LC/MS data integrating biochemical relations, isotope patterns and adduct formation. alt text

Installation

ipaPy2 requires Python 3.9 or higher

Install via pip (recommended)

pip install ipaPy2

Install via bioconda

conda install ipaPy2

Compiling from source (macOS)

Create a folder in which you want to put the library
```
mkdir IPA
cd IPA
```
Download the library. If Homebrew is not installed in your machine, you can install it from here https://brew.sh
```
brew install git
```
```
git clone https://github.com/francescodc87/ipaPy2
cd ipaPy2
```

Create and activate a virtual environment for your folder and install the necessary libraries

python3 -m venv ipaPy2
source ipaPy2/bin/activate
pip install wheel
pip install setuptools
pip install twine
pip install pytest==4.4.1
pip install pytest-runner==4.4

Run tests (optional)
```
python setup.py pytest
```
Build your library
```
python setup.py bdist_wheel
```
The wheel file will be stored in the \dist folder. You can install the library in a new terminal as follows
```
pip install /path/to/wheelfile.whl
```

Compiling from source (Linux)

Create a folder in which you want to put the library
```
mkdir IPA
cd IPA
```

Download the library

sudo apt-get install git
git clone https://github.com/francescodc87/ipaPy2
cd ipaPy2

Create and activate a virtual environment for your folder and install the necessary libraries

python3 -m venv ipaPy2
source ipaPy2/bin/activate
pip install wheel
pip install setuptools
pip install twine
pip install pytest==4.4.1
pip install pytest-runner==4.4

Run tests (optional)
```
python setup.py pytest
```
Build your library
```
python setup.py bdist_wheel
```
The wheel file will be stored in the \dist folder. You can install the library in a new terminal as follows
```
pip install /path/to/wheelfile.whl
```

Compiling from source (Windows)

Create a folder in which you want to put the library
```
mkdir IPA
cd IPA
```
Install git (https://github.com/git-guides/install-git)

Download the library

git clone https://github.com/francescodc87/ipaPy2
cd ipaPy2

Create and activate a virtual environment for your folder and install the necessary libraries

python3 -m venv ipaPy2
source ipaPy2/bin/activate
pip install wheel
pip install setuptools
pip install twine
pip install pytest==4.4.1
pip install pytest-runner==4.4

Run tests (optional)
```
python setup.py pytest
```
Build your library
```
python setup.py bdist_wheel
```
The wheel file will be stored in the \dist folder. You can install the library in a new terminal as follows
```
pip install /path/to/wheelfile.whl
```

Databases

One of the most powerful features of the IPA method is that it is able to integrate the knowledge gained from previous experiments in the annotation process. There are three files that are used as the IPA database:

1. Adducts file (required)
The ipaPy2 library requires a file contains all the information required for the computation of the adducts. An adducts.csv file is provided with the package here. The file contains the most common adducts. If any exotic adduct (or in-source fragment) needs to be considered, the user must modify the file accordingly. The format required for the adducts file is shown below.

import pandas as pd
import numpy as np
adducts = pd.read_csv('DB/adducts.csv')
adducts.head()

	name	calc	Charge	Mult	Mass	Ion_mode	Formula_add	Formula_ded	Multi
0	M+H	M+1.007276	1	1	1.007276	positive	H1	FALSE	1
1	M+NH4	M+18.033823	1	1	18.033823	positive	N1H4	FALSE	1
2	M+Na	M+22.989218	1	1	22.989218	positive	Na1	FALSE	1
3	M+K	M+38.963158	1	1	38.963158	positive	K1	FALSE	1
4	M+	M-0.00054858	1	1	-0.000549	positive	FALSE	FALSE	1

2. MS¹ database file (required)
The IPA method requires a pandas dataframe containing the database against which the annotation is performed. This dataframe must contain the following columns in this exact order (optional columns can have empty fields):

id: unique id of the database entry (e.g., ‘C00031’) - necessary
name: compound name (e.g., ‘D-Glucose’) - necessary
formula: chemical formula (e.g., ‘C6H12O6’) - necessary
inchi: inchi string - optional
smiles: smiles string - optional
RT: if known, retention time range (in seconds) where this compound is expected to elute (e.g., ‘30;60’) - optional
adductsPos: list of adducts that should be considered in Positive mode for this entry (e.g.,’M+Na;M+H;M+’) - necessary
adductsNeg: list of adducts that should be considered in Negative mode for this entry (e.g.,’M-H;M-2H’) - necessary
description: comments on the entry - optional
pk: previous knowledge on the likelihood of this compound to be present in the sample analyse. The value has to be between 1 (compound highly likely to be present in the sample) and 0 (compound cannot be present in the sample).
MS2: id for the MS² database entries related to this compound - optional
reactions: list of reaction ids involving this compound (e.g., ‘R00010 R00015 R00028’). If required, these can be used to find possible biochemical connections - optional

The column names must be the ones reported here. While users are strongly advised to build their own ad-hoc database to match their specific instrument setup and sample types, here you can find a relatively big example database.

DB = pd.read_csv('DB/IPA_MS1.csv')
DB.head()

	id	name	formula	inchi	smiles	RT	adductsPos	adductsNeg	description	pk	MS2	reactions
0	C00002	ATP	C10H16N5O13P3	InChI=1S/C10H16N5O13P3/c11-8-5-9(13-2-12-8)15(...	NaN	NaN	M+H;M+Na;M+2H;2M+H	M-H;2M-H;M-2H;3M-H	NaN	1	EMBL-MCF_spec365637_1	R00002 R00076 R00085 R00086 R00087 R00088 R000...
1	C00003	NAD+	C21H28N7O14P2	InChI=1S/C21H27N7O14P2/c22-17-12-19(25-7-24-17...	NaN	NaN	M+H;M+Na;M+2H;2M+H	M-H;2M-H;M-2H;3M-H	NaN	1	EMBL-MCF_specxxxxx_10	R00023 R00090 R00091 R00092 R00093 R00094 R000...
2	C00004	NADH	C21H29N7O14P2	InChI=1S/C21H29N7O14P2/c22-17-12-19(25-7-24-17...	NaN	NaN	M+H;M+Na;M+2H;2M+H	M-H;2M-H;M-2H;3M-H	NaN	1	NaN	R00023 R00090 R00091 R00092 R00093 R00094 R000...
3	C00005	NADPH	C21H30N7O17P3	InChI=1S/C21H30N7O17P3/c22-17-12-19(25-7-24-17...	NaN	NaN	M+H;M+Na;M+2H;2M+H	M-H;2M-H;M-2H;3M-H	NaN	1	NaN	R00105 R00106 R00107 R00108 R00109 R00111 R001...
4	C00006	NADP+	C21H29N7O17P3	InChI=1S/C21H28N7O17P3/c22-17-12-19(25-7-24-17...	NaN	NaN	M+H;M+Na;M+2H;2M+H	M-H;2M-H;M-2H;3M-H	NaN	1	EMBL-MCF_specxxxxxx_45	R00104 R00106 R00107 R00108 R00109 R00111 R001...

This example databases was obtained considering the KEGG database, the Natural Products Atlas database and the MoNa database (only compounds having at least one fragmentation spectrum obtained with a QExactive). For each entry, only a handful of the most common adducts are considered. To fully exploit the IPA method, it is strongly recommended to constantly update the database when new knowledge is gained from previous experience. Providing a retention time window for compounds previously detected with the analytical system at hand it is particularly useful. For the sake of the example in this tutorial, a reduced example database is also provided.

DB = pd.read_csv('DB/DB_test_pos.csv')
DB.head()

	id	name	formula	inchi	smiles	RT	adductsPos	adductsNeg	description	pk	MS2	reactions
0	C00079	L-Phenylalanine	C9H11NO2	InChI=1S/C9H11NO2/c10-8(9(11)12)6-7-4-2-1-3-5-...	NaN	120;160	M+H;M+Na;M+2H;2M+H	M-H;2M-H;M-2H;3M-H	NaN	1	UA005501_1	R00686 R00688 R00689 R00690 R00691 R00692 R006...
1	C00082	L-Tyrosine	C9H11NO3	InChI=1S/C9H11NO3/c10-8(9(12)13)5-6-1-3-7(11)4...	NaN	50;90	M+H;M+Na;M+2H;2M+H	M-H;2M-H;M-2H;3M-H	NaN	1	UA005601_1	R00031 R00728 R00729 R00730 R00731 R00732 R007...
2	C00114	Choline	C5H14NO	InChI=1S/C5H14NO/c1-6(2,3)4-5-7/h7H,4-5H2,1-3H...	NaN	NaN	M+H;M+Na;M+2H;2M+H	M-H;2M-H;M-2H;3M-H	NaN	1	NaN	R01021 R01022 R01023 R01025 R01026 R01027 R010...
3	C00123	L-Leucine	C6H13NO2	InChI=1S/C6H13NO2/c1-4(2)3-5(7)6(8)9/h4-5H,3,7...	NaN	70;110	M+H;M+Na;M+2H;2M+H	M-H;2M-H;M-2H;3M-H	NaN	1	NaN	R01088 R01089 R01090 R01091 R02552 R03657 R084...
4	C00148	L-Proline	C5H9NO2	InChI=1S/C5H9NO2/c7-5(8)4-2-1-3-6-4/h4,6H,1-3H...	NaN	35;55	M+H;M+Na;M+2H;2M+H	M-H;2M-H;M-2H;3M-H	NaN	1	EMBL-MCF_specxxxxx_7	R00135 R00671 R01246 R01248 R01249 R01251 R012...

3. MS² database file (only required if MS² data is available)
This new implementation of the IPA method also allows the user to include MS² data in the annotation pipeline. In order to exploit this functionality an MS² spectra database must be provided. The MS² database must be provided as a pandas dataframe including the following columns in this exact order:

compound_id: unique id for each compound, it must match with the ids used in the MS¹ database - necessary
id: unique id for the single entry (i.e., spectra) of the database - necessary
name: compound name (e.g., ‘D-Glucose’) - necessary
formula: chemical formula (e.g., ‘C6H12O6’) - necessary
inchi: inchi string - optional
precursorType: the adduct form of the precursor ion (e.g., ‘M+H’) - necessary
instrument: the type of instrument the spectrum was acquired with - optional
collision.energy: the collision energy level used to acquire the spectrum (e.g., ‘15’) - necessary
spectrum: the actual spectrum in the form of a string in the following format ‘mz1:Int1 mz2:Int2 mz3:Int3 …’

It is necessary that the user uses an MS² database specific to the instrument used to acquire the data. The MS² database found here, contains all the MS² spectra found in the MoNa database acquired with a QExactive. This is a relatively big file, and for the sake of this tutorial a drastically reduced version of it has been included within this repository, and can be found here.

DBMS2 = pd.read_csv('DB/DBMS2_test_pos.csv')
DBMS2.head()

	compound_id	id	name	formula	inchi	precursorType	instrument	collision.energy	spectrum
0	EMBL-MCF_specxxxxxx_11	EMBL-MCF_spec103039	L-valine	C5H11NO2	InChI=1S/C5H11NO2/c1-3(2)4(6)5(7)8/h3-4H,6H2,1...	M+H	Thermo Q-Exactive Plus	35	55.0550575256:5.821211 57.0581207275:0.385600 ...
1	EMBL-MCF_specxxxxxx_11	EMBL-MCF_spec353465	L-valine	C5H11NO2	InChI=1S/C5H11NO2/c1-3(2)4(6)5(7)8/h3-4H,6H2,1...	M+H	Thermo Q-Exactive Plus	30	49.5028053042:0.000000 49.5031356971:0.000000 ...
2	EMBL-MCF_specxxxxxx_11	EMBL-MCF_spec27828	L-valine	C5H11NO2	InChI=1S/C5H11NO2/c1-3(2)4(6)5(7)8/h3-4H,6H2,1...	M-H	Thermo Q-Exactive Plus	35	55.0173454285:0.819357 58.0282363892:0.155430 ...
3	EMBL-MCF_specxxxxx_7	EMBL-MCF_spec96902	L-proline	C5H9NO2	InChI=1S/C5H9NO2/c7-5(8)4-2-1-3-6-4/h4,6H,1-3H...	M+H	Thermo Q-Exactive Plus	35	50.5765228271:0.040013 51.3066940308:0.039949 ...
4	EMBL-MCF_specxxxxx_7	EMBL-MCF_spec353568	L-proline	C5H9NO2	InChI=1S/C5H9NO2/c7-5(8)4-2-1-3-6-4/h4,6H,1-3H...	M+H	Thermo Q-Exactive Plus	30	49.5028215674:0.000000 49.5031519602:0.000000 ...

Data preparation

Before using the ipaPy2 package, the processed data coming from an untargeted metabolomics experiment must be properly prepared.

1. MS¹ data

The data must be organized in a pandas dataframe containing the following columns:

ids: an unique numeric id for each mass spectrometry feature feature
rel.ids: relation ids. Features must be clustered based on correlation/peak shape/retention time. Features in the same cluster are likely to come from the same metabolite.
mzs: mass-to-charge ratios, usually the average across different samples.
RTs: retention times in seconds, usually the average across different samples.
Int: representative (e.g., maximum or average) intensity detected for each feature across samples (either peak area or peak intensity)

Below is reported an example:

df1=pd.read_csv('ExampleDatasets/README/df_test_pos.csv')
df1.head()

	ids	rel.ids	mzs	RTs	Int
0	1	0	116.070544	45.770423	2.170017e+09
1	88	0	117.073678	45.787586	1.256520e+08
2	501	0	231.133673	46.183948	2.519223e+07
3	4429	0	232.136923	46.176715	2.635594e+06
4	2	1	104.106830	40.843309	1.889172e+09

The clustering of the features is a necessary and must be performed before running the IPA method. For this step, the use of widely used data processing software such as mzMatch and CAMERA is recommended. Nevertheless, the ipaPy2 library provides a function (clusterFeatures()) able to perform such step, starting from a dataframe containing the measured intensities across several samples (at least 3 samples, the more samples the better). Such dataframe should be organized as follows:

df2=pd.read_csv('ExampleDatasets/README/df_test_pos_not_clustered.csv')
df2.head()

	ids	mzs	RTs	sample1	sample2	sample3	sample4	sample5	sample6	sample7	sample8	sample9	sample10
0	1	116.070544	45.770423	1.003660e+09	1.299828e+09	1.878029e+09	1.778238e+09	1.715394e+09	4.340034e+08	1.586635e+09	2.170017e+09	1.312151e+09	2.051875e+09
1	2	104.106830	40.843309	3.778343e+08	8.721901e+08	8.353805e+08	1.889172e+09	1.114844e+09	1.296362e+09	7.361379e+08	7.386887e+08	9.546864e+08	6.969054e+08
2	3	118.085998	43.584638	5.984715e+08	1.399106e+09	2.831220e+08	1.415610e+09	7.557607e+08	7.800359e+08	8.949854e+08	5.074069e+08	6.854525e+08	1.000501e+09
3	4	166.086047	143.321396	1.390905e+09	1.047887e+09	1.053413e+09	2.781809e+08	1.037486e+09	1.117700e+09	6.153332e+08	1.215932e+09	1.264092e+09	1.370995e+09
4	5	132.101745	89.387202	6.071912e+08	1.014152e+09	1.270735e+09	1.069765e+09	4.925938e+08	4.087633e+08	3.777945e+08	2.541470e+08	8.025257e+08	3.544281e+08

from ipaPy2 import ipa
df=ipa.clusterFeatures(df2)

Clustering features ....
0.0 seconds elapsed

All information about the function can be found in the help of the function

help(ipa.clusterFeatures)

Help on function clusterFeatures in module ipaPy2.ipa:

clusterFeatures(df, Cthr=0.8, RTwin=1, Intmode='max')
    Clustering MS1 features based on correlation across samples.
    
    Parameters
    ----------
    df: pandas dataframe with the following columns:
        -ids: a unique id for each feature
        -mzs: mass-to-charge ratios, usually the average across different
              samples.
        -RTs: retention times in seconds, usually the average across different
              samples.
        -Intensities: for each sample, a column reporting the detected
                      intensities in each sample. 
    Cthr: Default value 0.8. Minimum correlation allowed in each cluster
    RTwin: Default value 1. Maximum difference in RT time between features in
           the same cluster
    Intmode: Defines how the representative intensity of each feature is
             computed. If 'max' (default) the maximum across samples is used.
             If 'ave' the average across samples is computed
    Returns
    -------
    df: pandas dataframe in correct format to be used as an input of the
    map_isotope_patterns() function

After running, this function returns a pandas dataframe in the correct format for the ipaPy2 package

df.head()

	ids	rel.ids	mzs	RTs	Int
0	1	0	116.070544	45.770423	2.170017e+09
1	88	0	117.073678	45.787586	1.256520e+08
2	501	0	231.133673	46.183948	2.519223e+07
3	4429	0	232.136923	46.176715	2.635594e+06
4	2	1	104.106830	40.843309	1.889172e+09

2. MS² data

If fragmentation data was acquired during the experiment, it can be included in the IPA annotation process. To do so, the data must be organized in a pandas dataframe containing the following columns, in this exact order:

id: an unique id for each feature for which the MS² spectrum was acquired (same as in MS¹)
spectrum: string containing the spectrum information in the following format ‘mz1:Int1 mz2:Int2 mz3:Int3 …’
ev: collision energy used to acquire the fragmentation spectrum

Below is reported an example:

dfMS2=pd.read_csv('ExampleDatasets/README/MS2data_example.csv')
dfMS2.head()

	id	spectrum	ev
0	1	51.3066132836457:0.884272376680125 59.96532241...	35
1	1	51.3066132836457:0.884272376680125 59.96532241...	15
2	90	62.4153253406374:0.743812036877455 63.93291389...	35
3	992	50.983321052233:0.973529955385613 53.039006800...	35
4	3	55.0551847656264:5.67780579195993 57.058126021...	35

Usage

The Integrated Probabilistic Annotation (IPA) method can be applied in different situations, and the ipaPy2 package allows the users to tailor the IPA pipeline around their specific needs.

This brief tutorial describes the most common scenarios the IPA method can be applied to.

1. Mapping isotope patterns

The first step of the IPA pipeline consists in the mapping of the isotope patterns within the dataset considered. This is achieved through the map_isotope_patterns(). The help of this function provides a detailed description of it.

help(ipa.map_isotope_patterns)

Help on function map_isotope_patterns in module ipaPy2.ipa:

map_isotope_patterns(df, isoDiff=1, ppm=100, ionisation=1, MinIsoRatio=0.5)
    mapping isotope patterns in MS1 data.
    
    Parameters
    ----------
    df : pandas dataframe (necessary)
         A dataframe containing the MS1 data including the following columns:
            -ids: an unique id for each feature
            -rel.ids:   relation ids. In a previous step of the data processing
                        pipeline, features are clustered based on peak shape
                        similarity/retention time. Features in the same
                        cluster are likely to come from the same metabolite.
                        All isotope patterns must be in the same rel.id
                        cluster.
            -mzs: mass-to-charge ratios, usually the average across
                  different samples.
            -RTs: retention times in seconds, usually the average across
                  different samples.
            -Ints: representative (e.g., maximum or average) intensity detected
                   for each feature across samples (either peak area or peak
                   intensity)
    isoDiff : Default value 1. Difference between isotopes of charge 1, does
              not need to be exact
    ppm:   Default value 100. Maximum ppm value allowed between 2 isotopes.
            It is very high on purpose
    ionisation: Default value 1. positive = 1, negative = -1
    MinIsoRatio: mininum intensity ratio expressed (Default value 1%). Only
                isotopes with intensity higher than MinIsoRatio% of the main isotope
                are considered.
    
    Returns
    -------
    df: the main input is modified by adding and populating the following
        columns
        - relationship: the possible values are:
                        * bp: basepeak, most intense peak within each rel id
                        * bp|isotope: isotope of the basepeak
                        * potential bp: most intense peak within each isotope
                                        pattern (excluding the basepeak)
                        * potential bp|isotope: isotope of one potential bp
        - isotope pattern: feature used to cluster the different isotope
                            patterns within the same relation id
        - charge: predicted charge based on the isotope pattern (1,2,3,4,5 or
                  -1,-2,-3,-4,-5 are the only values allowed)

For the sake of this tutorial, the small dataset example introduced above is considered.

ipa.map_isotope_patterns(df,ionisation=1)

mapping isotope patterns ....
0.1 seconds elapsed

Once finished, this function modifies the pandas dataframe provided as input annotating all isotope patterns.

df.head()

	ids	rel.ids	mzs	RTs	Int	relationship	isotope pattern	charge
0	1	0	116.070544	45.770423	2.170017e+09	bp	0	1
1	88	0	117.073678	45.787586	1.256520e+08	bp\|isotope	0	1
2	501	0	231.133673	46.183948	2.519223e+07	potential bp	1	1
3	4429	0	232.136923	46.176715	2.635594e+06	potential bp\|isotope	1	1
4	2	1	104.106830	40.843309	1.889172e+09	bp	0	1

Some data processing pipelines already have an isotope mapping function and the user can use them as long as they organise the data in the correct format.

2. Compute all adducts

The second step of the pipeline consists in the calculation of all possible adducts that could be formed by the compounds included in the database. This is done by the function compute_all_adducts(). This function comes with a very detailed help.

help(ipa.compute_all_adducts)

Help on function compute_all_adducts in module ipaPy2.ipa:

compute_all_adducts(adductsAll, DB, ionisation=1, ncores=1)
    compute all adducts table based on the information present in the database
    
    Parameters
    ----------
    adductsAll : pandas dataframe (necessary)
                 Dataframe containing information on all possible
                 adducts. The file must be in the same format as the example
                 provided in the DB/adducts.csv
    DB : pandas dataframe (necessary)
         Dataframe containing the database against which the annotation is
         performed. The DB must contain the following columns in this exact
         order (optional fields can contain None):
             - id: unique id of the database entry (e.g., 'C00031') - necessary
             - name: compound name (e.g., 'D-Glucose') - necessary
             - formula: chemical formula (e.g., 'C6H12O6') - necessary
             - inchi: inchi string - optional
             - smiles: smiles string - optional
             - RT: if known, retention time range (in seconds) where this
                   compound is expected to elute (e.g., '30;60') - optional
             - adductsPos: list of adducts that should be considered in
                           positive mode for this entry (e.g.,'M+Na;M+H;M+')
             - adductsNeg: list of adducts that should be considered in
                           negative mode for this entry (e.g.,'M-H;M-2H')
             - description: comments on the entry - optional
             - pk: previous knowledge on the likelihood of this compound to be
                   present in the sample analysed. The value has to be between
                   1 (compound likely to be present in the sample) and 0
                   (compound cannot be present in the sample).
             - MS2: id for the MS2 database entries related to this compound
                    (optional)
             - reactions: list of reactions ids involving this compound
                          (e.g., 'R00010 R00015 R00028')-optional 
    ionisation : Default value 1. positive = 1, negative = -1
    ncores : default value 1. Number of cores used
    
    Returns
    -------
    allAdds: pandas dataframe containing the information on all the possible
    adducts given the database.

Depending on the size of the dataset used (i.e., number of compounds included), this step can become rather time-consuming, and the use of multiple cores should be considered. In the context of this tutorial, the heavily reduced example dataset introduced before is considered.

allAddsPos = ipa.compute_all_adducts(adducts, DB, ionisation=1, ncores=1)

computing all adducts ....
0.1 seconds elapsed

allAddsPos.head()

	id	name	adduct	formula	charge	m/z	RT	pk	MS2
0	C00079	L-Phenylalanine	M+H	C9H12NO2	1	166.086255	120;160	1	UA005501_1
1	C00079	L-Phenylalanine	M+Na	C9H11NNaO2	1	188.068197	120;160	1	UA005501_1
2	C00079	L-Phenylalanine	M+2H	C9H13NO2	2	83.546765	120;160	1	UA005501_1
3	C00079	L-Phenylalanine	2M+H	C18H23N2O4	1	331.165233	120;160	1	UA005501_1
4	C00082	L-Tyrosine	M+H	C9H12NO3	1	182.081169	50;90	1	UA005601_1

If the same database is used for subsequent experiments without introducing new information, it is recommended to save the results of this function into a .csv file. In this case, the user would need to repeat this step in the future only if the DB changed.

3. Annotation based on MS¹ information

At this point, the actual annotation process can start. If no fragmentation data is available, the MS1annotation() function should be used. This function annotates the dataset using the MS¹ data and the information stored in the dataset. A detailed description of the function can be accessed through the help:

help(ipa.MS1annotation)

Help on function MS1annotation in module ipaPy2.ipa:

MS1annotation(df, allAdds, ppm, me=0.000548579909065, ratiosd=0.9, ppmunk=None, ratiounk=None, ppmthr=None, pRTNone=None, pRTout=None, ncores=1)
    Annotation of the dataset base on the MS1 information. Prior probabilities
    are based on mass only, while post probabilities are based on mass, RT,
    previous knowledge and isotope patterns.
    
    Parameters
    ----------
    df: pandas dataframe containing the MS1 data. It should be the output of the
        function ipa.map_isotope_patterns()
    allAdds: pandas dataframe containing the information on all the possible
            adducts given the database. It should be the output of either
            ipa.compute_all_adducts() or ipa.compute_all_adducts_Parallel()
    ppm: accuracy of the MS instrument used
    me: accurate mass of the electron. Default 5.48579909065e-04
    ratiosd: default 0.9. It represents the acceptable ratio between predicted
             intensity and observed intensity of isotopes. It is used to compute
             the shape parameters of the lognormal distribution used to
             calculate the isotope pattern scores as sqrt(1/ratiosd)
    ppmunk: ppm associated to the 'unknown' annotation. If not provided equal
            to ppm.
    ratiounk: isotope ratio associated to the 'unknown' annotation. If not
              provided equal to 0.5
    ppmthr: Maximum ppm possible for the annotations. If not provided equal to
            2*ppm
    pRTNone: Multiplicative factor for the RT if no RTrange present in the
             database. If not provided equal to 0.8
    pRTout: Multiplicative factor for the RT if measured RT is outside the
            RTrange present in the database. If not provided equal to 0.4
    ncores: default value 1. Number of cores used
    
    Returns
    -------
    annotations: a dictionary containing all the possible annotations for the
                measured features. The keys of the dictionary are the unique
                ids for the features present in df. For each feature, the
                annotations are summarized in a pandas dataframe.

annotations=ipa.MS1annotation(df,allAddsPos,ppm=3,ncores=1)

annotating based on MS1 information....
0.4 seconds elapsed

This function returns all the possible annotations for all the mass spectrometry features (excluding the ones previously identified as isotopes). The annotations are provided in the form of a dictionary. The keys of the dictionary are the unique ids for the features present in df. For each feature, all possible annotations are summarised in a dataframe including the following information:

id: Unique id associated with the compound as reported in the database
name: Name of the compound
formula: Chemical formula of the putative annotation
adduct: Adduct type
mz: Theoretical m/z associated with the specific ion
charge: Theoretical charge of the ion
RT range: Retention time range reported in the database for the specific compound
ppm: mass accuracy
isotope pattern score: Score representing how similar the measured and theoretical isopattern scores are
fragmentation pattern score: Cosine similarity. Empty in this case as no MS² data was provided
prior: Probabilities associated with each possible annotation computed by only considering the mz values (i.e., only considering ppm)
post: Probabilities associated with each possible annotation computed by integrating all the additional information available: retention time range, ppm, isotope pattern score and prior knowledge.

As an example, possible annotations for the feature associated with id=1 (m/z=116.0705438, RT=45.77) is shown below:

annotations[1]

	id	name	formula	adduct	m/z	charge	RT range	ppm	isotope pattern score	fragmentation pattern score	prior	post
0	C00148	L-Proline	C5H10NO2	M+H	116.070605	1	35;55	-0.523247	0.331946	None	0.318084	0.454248
1	C00763	D-Proline	C5H10NO2	M+H	116.070605	1	None	-0.523247	0.331946	None	0.318084	0.363398
2	C18170	3-Acetamidopropanal	C5H10NO2	M+H	116.070605	1	500;560	-0.523247	0.331946	None	0.318084	0.181699
3	Unknown	Unknown	None	None	None	None	None	3.000000	0.004161	None	0.045748	0.000655

It should be noticed that in this example, the prior probabilities associated with L-Proline M+H, D-Proline M+H and 3-Acetamidopropanal are exactly the same. This is because all three ions have exactly the same theoretical mass. However, the post probabilities are different. This is because the retention time associated with this feature is within the retention range reported in the database for L-Proline and outside the one reported for 3-Acetamidopropanal.

An expert in LC/MS-based mass spectrometry would argue that with most chromatographic columns stereoisomers such as L- and D-Proline would share the same RT range. While this is likely to be correct, it must be noted that the IPA method can only use the information present in the database. When populating it, we opted for a more agnostic approach and only included RT ranges for compounds that where actually detected as standards with our experimental setting. If the user wants to include the notion that ‘stereoisomers share the same RT ranges’, they should simply add this information in the database.

Here another example:

annotations[999]

	id	name	formula	adduct	m/z	charge	RT range	ppm	isotope pattern score	fragmentation pattern score	prior	post
0	C00079	L-Phenylalanine	C18H23N2O4	2M+H	331.165233	1	120;160	-0.941814	0.472049	None	0.240106	0.550778
1	C02265	D-Phenylalanine	C18H23N2O4	2M+H	331.165233	1	None	-0.941814	0.472049	None	0.240106	0.440622
2	Unknown	Unknown	None	None	None	None	None	3.000000	0.055901	None	0.039575	0.0086
3	C03263	Coproporphyrinogen III	C36H46N4O8	M+2H	331.165233	2	None	-0.941814	0.0	None	0.240106	0.0
4	C05768	Coproporphyrinogen I	C36H46N4O8	M+2H	331.165233	2	None	-0.941814	0.0	None	0.240106	0.0

Also in this case, all the prior probabilities associated with the four ions are exactly the same since all the ions have the same theoretical mass-to-charge ratio. However, the posterior probabilities are significantly different. Two of these ions (Coproporphyrinogen III M+2H and Coproporphyrinogen I M+2H) have charge +2, while the other two possible annotations have charge +1. The observed isotope pattern is consistent with an ion with charge +1 (i.e., difference between isotopes = 1), and this is reflected in the isotope score pattern and consequently on the posterior probabilities. Moreover, the retention time associated with this feature is within the range reported for L-Phenylalanine in the database. Therefore, the posterior probability associated with L-Phenylalanine 2M+H is the most highest.

4. Annotation based on MS¹ and MS² information

As already mentioned above, fragmentation data can be included in the annotation process by using the MSMSannotation() function. A detailed description of the function can be accessed through the help:

help(ipa.MSMSannotation)

Help on function MSMSannotation in module ipaPy2.ipa:

MSMSannotation(df, dfMS2, allAdds, DBMS2, ppm, me=0.000548579909065, ratiosd=0.9, ppmunk=None, ratiounk=None, ppmthr=None, pRTNone=None, pRTout=None, mzdCS=0, ppmCS=10, CSunk=0.7, evfilt=False, ncores=1)
    Annotation of the dataset base on the MS1 and MS2 information. Prior
    probabilities are based on mass only, while post probabilities are based
    on mass, RT, previous knowledge and isotope patterns.
    
    Parameters
    ----------
    df: pandas dataframe containing the MS1 data. It should be the output of the
        function ipa.map_isotope_patterns()
    dfMS2: pandas dataframe containing the MS2 data. It must contain 3 columns
        -id: an unique id for each feature for which the MS2 spectrum was
             acquired (same as in df)
        -spectrum: string containing the spectrum information in the following
                   format 'mz1:Int1 mz2:Int2 mz3:Int3 ...'
        -ev: collision energy used to acquire the fragmentation spectrum
    allAdds: pandas dataframe containing the information on all the possible
            adducts given the database. It should be the output of either
            ipa.compute_all_adducts() or ipa.compute_all_adducts_Parallel()
    DBMS2: pandas dataframe containing the database containing the MS2
           information
    ppm: accuracy of the MS instrument used
    me: accurate mass of the electron. Default 5.48579909065e-04
    ratiosd: default 0.9. It represents the acceptable ratio between predicted
            intensity and observed intensity of isotopes. it is used to compute
            the shape parameters of the lognormal distribution used to
            calculate the isotope pattern scores as sqrt(1/ratiosd)
    ppmunk: ppm associated to the 'unknown' annotation. If not provided equal
            to ppm.
    ratiounk: isotope ratio associated to the 'unknown' annotation. If not
              provided equal to 0.5
    ppmthr: Maximum ppm possible for the annotations. Ff not provided equal to
            2*ppm
    pRTNone: Multiplicative factor for the RT if no RTrange present in the
            database. If not provided equal to 0.8
    pRTout: Multiplicative factor for the RT if measured RT is outside the
            RTrange present in the database. If not provided equal to 0.4
    mzdCS: maximum mz difference allowed when computing cosine similarity
           scores. If one wants to use this parameter instead of ppmCS, this
           must be set to 0. Default 0.
    ppmCS: maximum ppm allowed when computing cosine similarity scores.
           If one wants to use this parameter instead of mzdCS, this must be
           set to 0. Default 10.
    CSunk: cosine similarity score associated with the 'unknown' annotation.
            Default 0.7
    evfilt: Default value False. If true, only spectrum acquired with the same
            collision energy are considered.
    ncores: default value 1. Number of cores used
    
    Returns
    -------
    annotations: a dictionary containing all the possible annotations for the
                 measured features. The keys of the dictionary are the unique
                 ids for the features present in df. For each feature, the
                 annotations are summarized in a pandas dataframe.

The line below integrates the fragmentation data and the fragmentation database introduced above in the annotation process. The role of the CSunk (“cosine unknown”) parameter should be briefly discussed here. In most cases, the fragmentation database contains fragmentation spectra only for a subset of the compounds in the database. Therefore, when considering a feature for which the fragmentation spectra was acquired, it is often the case that the cosine similarity can only be computed for a subset of the possible annotations. The CSunk value is then assigned to the other possible annotations for comparison.

 annotations=ipa.MSMSannotation(df,dfMS2,allAddsPos,DBMS2,CSunk=0.7,ppm=3,ncores=1)

annotating based on MS1 and MS2 information....
0.7 seconds elapsed

The output of this function has the same structure as the one from the MSannotation() function, but it also includes the fragmentation pattern scores when the fragmentation data is available. As an example, possible annotations for the feature associated with id=1 is shown below:

annotations[1]

	id	name	formula	adduct	m/z	charge	RT range	ppm	isotope pattern score	fragmentation pattern score	prior	post
0	C00148	L-Proline	C5H10NO2	M+H	116.070605	1	35;55	-0.523247	0.331946	0.999759	0.318084	0.543121
1	C00763	D-Proline	C5H10NO2	M+H	116.070605	1	None	-0.523247	0.331946	0.7	0.318084	0.304221
2	C18170	3-Acetamidopropanal	C5H10NO2	M+H	116.070605	1	500;560	-0.523247	0.331946	0.7	0.318084	0.15211
3	Unknown	Unknown	None	None	None	None	None	3.000000	0.004161	0.7	0.045748	0.000548

In this case, the cosine similarity score for the annotation L-Proline M+H is very high, therefore the posterior probability associated with it is higher than the one obtained without considering the MS² data.

Here another example for a feature having a very similar mass-to-charge ratio (id=90, m/z=117.0705223, RT=63.45).

annotations[90]

	id	name	formula	adduct	m/z	charge	RT range	ppm	isotope pattern score	fragmentation pattern score	prior	post
0	C00763	D-Proline	C5H10NO2	M+H	116.070605	1	None	-0.708479	None	0.7	0.317329	0.480821
1	C18170	3-Acetamidopropanal	C5H10NO2	M+H	116.070605	1	500;560	-0.708479	None	0.7	0.317329	0.24041
2	C00148	L-Proline	C5H10NO2	M+H	116.070605	1	35;55	-0.708479	None	0.59986	0.317329	0.206018
3	Unknown	Unknown	None	None	None	None	None	3.000000	None	0.7	0.048013	0.072751

In this case, the cosine similarity score for the annotation L-Proline M+H is not very high. Moreover, the retention time assigned to this feature is outside both retention time ranges reported in the database for L-Proline and 3-Acetamidopropanal. Therefore, the most likely annotation for this feature is D-Proline M+H, the one annotation not rejected directly by the available evidence. It should be noted that the fragmentation pattern score has a rather weak effect on the posterior probability associated with L-Proline, given how close it is to the fragmentation pattern score associated features that do not have MS² info in the database (CSunk=0.7). The main reason why the D-Proline annotation appears to be the most likely is due to the fact that the retention time associated to this feature (63.45 s) is outside the retention time ranges associated with L-Proline and 3-Acetamidopropanal.

5. Computing posterior probabilities integrating adducts connections

Until this point, the putative annotations and the associated probabilities computed for each feature are independent from each other. However, the IPA method can be used to update the probabilities by considering the possible relationship between annotations. For example, the Gibbs_sampler_add() function uses a Gibbs sampler to estimate the posterior probabilities obtained by considering all possible adduct connections.

The help() provides a detailed description of this function:

help(ipa.Gibbs_sampler_add)

Help on function Gibbs_sampler_add in module ipaPy2.ipa:

Gibbs_sampler_add(df, annotations, noits=100, burn=None, delta_add=1, all_out=False, zs=None)
    Gibbs sampler considering only adduct connections. The function computes
    the posterior probabilities of the annotations considering the adducts
    connections.
    
    Parameters
    ----------
    df: pandas dataframe containing the MS1 data. It should be the output of the
        function ipa.map_isotope_patterns()
    annotations: a dictionary containing all the possible annotations for the
                measured features. The keys of the dictionary are the unique
                ids for the features present in df. For each feature, the
                annotations are summarized in a pandas dataframe. Output of
                functions MS1annotation(), MS1annotation_Parallel(),
                MSMSannotation() or MSMSannotation_Parallel
    noits: number of iterations if the Gibbs sampler to be run
    burn: number of iterations to be ignored when computing posterior
          probabilities. If None, is set to 10% of total iterations
    delta_add: parameter used when computing the conditional priors. The
               parameter must be positive. The smaller the parameter the more
               weight the adducts connections have on the posterior
               probabilities. Default 1.
    all_out: logical value. If true the list of assignments found in each
             iteration is returned by the function. Default False.
    zs: list of assignments computed in a previous run of the Gibbs sampler. 
        Optional, default None.
    
    Returns
    -------
    annotations: the function modifies the annotations dictionary by adding 2
                 columns to each entry. One named 'post Gibbs' contains the
                 posterior probabilities computed. The other is called
                 'chi-square pval' containing the p-value from a chi-squared
                 test comparing the 'post' with the 'post Gibbs' probabilities.
    zs: optional, if all_out==True, the function return the full list of
        assignments computed. This allows restarting the sampler from where
        you are from a previous run.

zs = ipa.Gibbs_sampler_add(df,annotations,noits=1000,delta_add=0.1, all_out=True)

computing posterior probabilities including adducts connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|██████████| 1000/1000 [00:03<00:00, 258.48it/s]

parsing results ...
Done -  3.9 seconds elapsed

The function modifies the annotations dictionary by adding two additional columns to each dataframe:

post Gibbs: posterior probabilities obtained from the Gibbs sampler.
chi-square pval: In order to see if the posterior probabilities obtained are statistically different from the priors, a chi-square test is used. The obtained p-value is reported in this coloumn.

If all_out=True, the function also returns the full list of assignments computed. If provided as an input to the Gibbs sampler, it allows to restart it from where you finished.

ipa.Gibbs_sampler_add(df,annotations, noits=4000,delta_add=0.1,zs=zs)

computing posterior probabilities including adducts connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|██████████| 4000/4000 [00:15<00:00, 261.29it/s]

parsing results ...
Done -  15.4 seconds elapsed

As an example, the possible annotations for the feature associated with the id 501 is shown below.

annotations[501]

	id	name	formula	adduct	m/z	charge	RT range	ppm	isotope pattern score	fragmentation pattern score	prior	post	post Gibbs	chi-square pval
0	C00148	L-Proline	C10H19N2O4	2M+H	231.133933	1	35;55	-1.124747	0.328045	None	0.314538	0.453117	0.662075	4.087367e-186
1	C00763	D-Proline	C10H19N2O4	2M+H	231.133933	1	None	-1.124747	0.328045	None	0.314538	0.362494	0.266163	4.087367e-186
2	C18170	3-Acetamidopropanal	C10H19N2O4	2M+H	231.133933	1	500;560	-1.124747	0.328045	None	0.314538	0.181247	0.071540	4.087367e-186
3	Unknown	Unknown	None	None	None	None	None	3.000000	0.015864	None	0.056386	0.003143	0.000222	4.087367e-186

This feature is clustered with feature id=1, the most likely annotation of which is L-Proline M+H. As expected, considering the adducts connections the ‘post Gibbs’ probability associated with L-Proline 2M+H is significantly higher than the alternative.

6. Computing posterior probabilities integrating biochemical connections

The IPA method can also update the probabilities associated to each possible annotations by considering all possible biochemical connections.

Before doing so, it is necessary to provide a pandas dataframe reporting which compounds can be considered biochemically related. The function Compute_Bio() can be used to compute such a dataframe. The help() provides a detailed description of this function:

help(ipa.Compute_Bio)

Help on function Compute_Bio in module ipaPy2.ipa:

Compute_Bio(DB, annotations=None, mode='reactions', connections=['C3H5NO', 'C6H12N4O', 'C4H6N2O2', 'C4H5NO3', 'C3H5NOS', 'C6H10N2O3S2', 'C5H7NO3', 'C5H8N2O2', 'C2H3NO', 'C6H7N3O', 'C6H11NO', 'C6H11NO', 'C6H12N2O', 'C5H9NOS', 'C9H9NO', 'C5H7NO', 'C3H5NO2', 'C4H7NO2', 'C11H10N2O', 'C9H9NO2', 'C5H9NO', 'C4H4O2', 'C3H5O', 'C10H12N5O6P', 'C10H15N2O3S', 'C10H14N2O2S', 'CH2ON', 'C21H34N7O16P3S', 'C21H33N7O15P3S', 'C10H15N3O5S', 'C5H7', 'C3H2O3', 'C16H30O', 'C8H8NO5P', 'CH3N2O', 'C5H4N5', 'C10H11N5O3', 'C10H13N5O9P2', 'C10H12N5O6P', 'C9H13N3O10P2', 'C9H12N3O7P', 'C4H4N3O', 'C10H13N5O10P2', 'C10H12N5O7P', 'C5H4N5O', 'C10H11N5O4', 'C10H14N2O10P2', 'C10H12N2O4', 'C5H5N2O2', 'C10H13N2O7P', 'C9H12N2O11P2', 'C9H11N2O8P', 'C4H3N2O2', 'C9H10N2O5', 'C2H3O2', 'C2H2O', 'C2H2', 'CO2', 'CHO2', 'H2O', 'H3O6P2', 'C2H4', 'CO', 'C2O2', 'H2', 'O', 'P', 'C2H2O', 'CH2', 'HPO3', 'NH2', 'PP', 'NH', 'SO3', 'N', 'C6H10O5', 'C6H10O6', 'C5H8O4', 'C12H20O11', 'C6H11O8P', 'C6H8O6', 'C6H10O5', 'C18H30O15'], ncores=1)
    Compute matrix of biochemical connections. Either based on a list of
    possible connections in the form of a list of formulas or based on the
    reactions present in the database.
    
    Parameters
    ----------
    DB: pandas dataframe containing the database against which the annotation
        is performed. The DB must contain the following columns in this exact
        order (optional fields can contain None):
        - id: unique id of the database entry (e.g., 'C00031') - necessary
        - name: compound name (e.g., 'D-Glucose') - necessary
        - formula: chemical formula (e.g., 'C6H12O6') - necessary
        - inchi: inchi string - optional
        - smiles: smiles string - optional
        - RT: if known, retention time range (in seconds) where this compound
                is expected to elute (e.g., '30;60') - optional
        - adductsPos: list of adducts that should be considered in positive mode
                      for this entry (e.g.,'M+Na;M+H;M+') - necessary
        - adductsNeg: list of adducts that should be considered in negative
                      mode for this entry (e.g.,'M-H;M-2H') - necessary
        - description: comments on the entry - optional
        - pk: previous knowledge on the likelihood of this compound to be
             present in the sample analyse. The value has to be between 1
             (compound likely to be present in the sample) and 0 (compound
             cannot be present in the sample).
        - MS2: id for the MS2 database entries related to this compound
               (optional)
        - reactions: list of reactions ids involving this compound
                    (e.g., 'R00010 R00015 R00028')-optional, but necessary if 
                    mode='reactions'.
    annotations: If equal to None (default) all entries in the DB are considered 
                (used to pre-compute the Bio matrix), alternatively it should be
                a dictionary containing all the possible annotations for the
                measured features. The keys of the dictionary are the unique ids
                for the features present in df. For each feature, the
                annotations are summarized in a pandas dataframe. Output of
                functions MS1annotation(), MS1annotation_Parallel(),
                MSMSannotation() or MSMSannotation_Parallel. In this case
                only the entries currently considered as possible annotations
                are used.
    mode: either 'reactions' (connections are computed based on the reactions
          present in the database) or 'connections' (connections are computed
          based on the list of connections provided). Default 'reactions'.
    connections: list of possible connections between compounds defined as
                formulas. Only necessary if mode='connections'. A list of
                common biotransformations is provided as default.
    ncores: default value 1. Number of cores used
    
    Returns
    -------
        Bio: dataframe containing all the possible connections computed.

According to the value assigned to the ‘mode’ parameter, the function can compute all possible biochemical connections in two ways. If mode=’reactions’, the function connects the compounds that share the same reaction id(s) according to what is reported in the database.

Bio = ipa.Compute_Bio(DB,annotations,mode='reactions')
Bio

computing all possible biochemical connections
considering the reactions stored in the database ...
0.0 seconds elapsed

	0	1
0	C00082	C00079
1	C00082	C04368
2	C21092	C00407
3	C02265	C00079
4	C00123	C02486
5	C00763	C00431
6	C00079	C20807
7	C00407	C00183

If mode=’connections’, the function computes the ‘chemical formula difference’ for each pair of compounds considered. If the difference is included in the list of connections, the two compounds are considered connected. A list of connections is provided as default, but it can be modified.

Bio = ipa.Compute_Bio(DB,annotations,mode='connections')
Bio

computing all possible biochemical connections
considering the provided connections ...
3.1 seconds elapsed

	0	1
0	C04282	C05131
1	C04282	C22140
2	C04282	C16744
3	C01879	C05131
4	C01879	C22140
5	C01879	C16744
6	C01877	C05131
7	C01877	C22140
8	C01877	C16744
9	C05131	C02237
10	C05131	C04281
11	C05131	C22141
12	C02237	C22140
13	C02237	C16744
14	C22140	C04281
15	C22140	C22141
16	C16744	C04281
17	C16744	C22141

Depending on the size of the database and the dataset, computing all possible biochemical connections can be extremely computationally demanding and can drastically increase the computation time needed for the annotation. For this reason, a precomputed list of biochemical connections based on the database provided (computed based on ‘reaction’ or ‘connections’ mode) is included in the library and can be used directly without the need of computing the biochemical connections.

Bio = pd.read_csv('DB/allBIO_reactions.csv')

The list of connections computed with mode=’connections’ needs to be unzipped first.

import zipfile
with zipfile.ZipFile("DB/allBio_connections.csv.zip","r") as zip_ref:
    zip_ref.extractall("DB/")

Bio=pd.read_csv('DB/allBio_connections.csv')

Alternatively, the user can define their own biochemical connections. For example: L-Proline C00148 L-Valine C00183 L-Phenylalanine C00079 L-Leucine C00123 5-Oxoproline C01879 Betaine C00719 Hordatine A C08307 L-Tyrosine C00082 D-Proline C00763 D-Phenylalanine C02265

Bio=pd.DataFrame([['C00148','C00763'],
                  ['C00079','C02265'],
                  ['C08307','C00082'],
                  ['C08307','C00079']])

ipa.Gibbs_sampler_bio(df,annotations,Bio,noits=5000,delta_bio=0.1)

computing posterior probabilities including biochemical connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|██████████| 5000/5000 [00:23<00:00, 217.12it/s]


parsing results ...
Done -  23.1 seconds elapsed

As an example, the possible annotations for the feature associated with the id 992 is shown below.

annotations[992]

	id	name	formula	adduct	m/z	charge	RT range	ppm	isotope pattern score	fragmentation pattern score	prior	post	post Gibbs
0	C00763	D-Proline	C5H10NO2	M+H	116.070605	1	None	-0.65851	None	0.7	0.317559	0.531683	0.770667
2	C00148	L-Proline	C5H10NO2	M+H	116.070605	1	35;55	-0.65851	None	0.324512	0.317559	0.123241	0.180889
1	C18170	3-Acetamidopropanal	C5H10NO2	M+H	116.070605	1	500;560	-0.65851	None	0.7	0.317559	0.265841	0.035556
3	Unknown	Unknown	None	None	None	None	None	3.00000	None	0.7	0.047324	0.079234	0.012889

The probability associated with the D-Proline M+H is significantly higher after considering the biochemical connections. This is because D-Proline is biochemically connected to L-Proline (by proline racemase), and the most likely annotation for the feature id=1 is L-Proline M+H (>50%).

7. Computing posterior probabilities integrating both adducts and biochemical connections

It is also possible to run the Gibbs sampler considering biochemical and adduct connections at the same time. To do so, one can use the function Gibbs_sampler_bio_add(). The help() provides a detailed explanation of the function.

help(ipa.Gibbs_sampler_bio_add)

Help on function Gibbs_sampler_bio_add in module ipaPy2.ipa:

Gibbs_sampler_bio_add(df, annotations, Bio, noits=100, burn=None, delta_bio=1, delta_add=1, all_out=False, zs=None)
    Gibbs sampler considering both biochemical and adducts connections. The
    function computes the posterior probabilities of the annotations
    considering the possible biochemical connections reported in Bio and the
    possible adducts connection.
    
    Parameters
    ----------
    df: pandas dataframe containing the MS1 data. It should be the output of the
        function ipa.map_isotope_patterns()
    annotations: a dictionary containing all the possible annotations for the
                 measured features. The keys of the dictionary are the unique
                 ids for the features present in df. For each feature, the
                 annotations are summarized in a pandas dataframe. Output of
                 functions MS1annotation(), MS1annotation_Parallel(),
                 MSMSannotation() or MSMSannotation_Parallel
    Bio: dataframe (2 columns), reporting all the possible connections between
         compounds. It uses the unique ids from the database. It could be the
         output of Compute_Bio() or Compute_Bio_Parallel().
    noits: number of iterations if the Gibbs sampler to be run
    burn: number of iterations to be ignored when computing posterior
          probabilities. If None, is set to 10% of total iterations
    delta_bio: parameter used when computing the conditional priors.
               The parameter must be positive. The smaller the parameter the
               more weight the adducts connections have on the posterior
               probabilities. Default 1.
    delta_add: parameter used when computing the conditional priors. The
               parameter must be positive. The smaller the parameter the more
               weight the adducts connections have on the posterior
               probabilities. Default 1.
    all_out: logical value. If true the list of assignments found in each
            iteration is returned by the function. Default False.
    zs: list of assignments computed in a previous run of the Gibbs sampler.
        Optional, default None.
    
    Returns
    -------
    annotations: the function modifies the annotations dictionary by adding 2
                columns to each entry. One named 'post Gibbs' contains the
                posterior probabilities computed. The other is called
                'chi-square pval' containing the p-value from a chi-squared
                test comparing the 'post' with the 'post Gibbs' probabilities.
    zs: optional, if all_out==True, the function return the full list of
        assignments computed. This allows restarting the sampler from where you
        are from a previous run

ipa.Gibbs_sampler_bio_add(df,annotations,Bio,noits=5000,delta_bio=0.1,delta_add=0.1)

computing posterior probabilities including biochemical and adducts connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|██████████| 5000/5000 [00:24<00:00, 204.74it/s]


parsing results ...
Done -  24.5 seconds elapsed

annotations[1]

	id	name	formula	adduct	m/z	charge	RT range	ppm	isotope pattern score	fragmentation pattern score	prior	post	post Gibbs	chi-square pval
0	C00148	L-Proline	C5H10NO2	M+H	116.070605	1	35;55	-0.523247	0.331946	0.999759	0.318084	0.543121	0.783111	1.468730e-271
1	C00763	D-Proline	C5H10NO2	M+H	116.070605	1	None	-0.523247	0.331946	0.7	0.318084	0.304221	0.212444	1.468730e-271
2	C18170	3-Acetamidopropanal	C5H10NO2	M+H	116.070605	1	500;560	-0.523247	0.331946	0.7	0.318084	0.15211	0.004444	1.468730e-271
3	Unknown	Unknown	None	None	None	None	None	3.000000	0.004161	0.7	0.045748	0.000548	0.000000	1.468730e-271

8. Running the whole pipeline with a single function

Finally, the ipaPy2 library also include a wrapper function that allows running the whole IPA pipeline in one step. A detailed description of the function can be accessed with the help.

help(ipa.simpleIPA)

Help on function simpleIPA in module ipaPy2.ipa:

simpleIPA(df, ionisation, DB, adductsAll, ppm, dfMS2=None, DBMS2=None, noits=100, burn=None, delta_add=None, delta_bio=None, Bio=None, mode='reactions', CSunk=0.5, isodiff=1, ppmiso=100, ncores=1, me=0.000548579909065, ratiosd=0.9, ppmunk=None, ratiounk=None, ppmthr=None, pRTNone=None, pRTout=None, mzdCS=0, ppmCS=10, evfilt=False, connections=['C3H5NO', 'C6H12N4O', 'C4H6N2O2', 'C4H5NO3', 'C3H5NOS', 'C6H10N2O3S2', 'C5H7NO3', 'C5H8N2O2', 'C2H3NO', 'C6H7N3O', 'C6H11NO', 'C6H11NO', 'C6H12N2O', 'C5H9NOS', 'C9H9NO', 'C5H7NO', 'C3H5NO2', 'C4H7NO2', 'C11H10N2O', 'C9H9NO2', 'C5H9NO', 'C4H4O2', 'C3H5O', 'C10H12N5O6P', 'C10H15N2O3S', 'C10H14N2O2S', 'CH2ON', 'C21H34N7O16P3S', 'C21H33N7O15P3S', 'C10H15N3O5S', 'C5H7', 'C3H2O3', 'C16H30O', 'C8H8NO5P', 'CH3N2O', 'C5H4N5', 'C10H11N5O3', 'C10H13N5O9P2', 'C10H12N5O6P', 'C9H13N3O10P2', 'C9H12N3O7P', 'C4H4N3O', 'C10H13N5O10P2', 'C10H12N5O7P', 'C5H4N5O', 'C10H11N5O4', 'C10H14N2O10P2', 'C10H12N2O4', 'C5H5N2O2', 'C10H13N2O7P', 'C9H12N2O11P2', 'C9H11N2O8P', 'C4H3N2O2', 'C9H10N2O5', 'C2H3O2', 'C2H2O', 'C2H2', 'CO2', 'CHO2', 'H2O', 'H3O6P2', 'C2H4', 'CO', 'C2O2', 'H2', 'O', 'P', 'C2H2O', 'CH2', 'HPO3', 'NH2', 'PP', 'NH', 'SO3', 'N', 'C6H10O5', 'C6H10O6', 'C5H8O4', 'C12H20O11', 'C6H11O8P', 'C6H8O6', 'C6H10O5', 'C18H30O15'])
     Wrapper function performing the whole IPA pipeline.
     
     Parameters
     ----------
     df: pandas dataframe containing the MS1 data. It should be the output of the
         function ipa.map_isotope_patterns()
         
     DB: pandas dataframe containing the database against which the annotation
         is performed. The DB must contain the following columns in this exact
         order (optional fields can contain None):
         - id: unique id of the database entry (e.g., 'C00031') - necessary
         - name: compound name (e.g., 'D-Glucose') - necessary
         - formula: chemical formula (e.g., 'C6H12O6') - necessary
         - inchi: inchi string - optional
         - smiles: smiles string - optional
         - RT: if known, retention time range (in seconds) where this compound
                 is expected to elute (e.g., '30;60') - optional
         - adductsPos: list of adducts that should be considered in positive mode
                       for this entry (e.g.,'M+Na;M+H;M+') - necessary
         - adductsNeg: list of adducts that should be considered in negative
                       mode for this entry (e.g.,'M-H;M-2H') - necessary
         - description: comments on the entry - optional
         - pk: previous knowledge on the likelihood of this compound to be
              present in the sample analyse. The value has to be between 1
              (compound likely to be present in the sample) and 0 (compound
              cannot be present in the sample).
         - MS2: id for the MS2 database entries related to this compound
                (optional)
         - reactions: list of reactions ids involving this compound
                     (e.g., 'R00010 R00015 R00028')-optional, but necessary if 
                     mode='reactions'.
     adductsAll:a dataframe containing information on all possible adducts.              1
     ppm: accuracy of the MS instrument used
     dfMS2: pandas dataframe containing the MS2 data (optional). It must contain
            3 columns:
                    -id: an unique id for each feature for which the MS2 spectrum
                        was acquired (same as in df)
                    -spectrum: string containing the spectrum inforamtion in the
                               following format 'mz1:Int1 mz2:Int2 mz3:Int3 ...'
                    -ev: collision energy used to aquire the fragmentation
                        spectrum
     DBMS2: pandas dataframe containing the database containing the MS2
            information (optional)
     evfilt: Default value False. If true, only spectra acquired with the same
             collision energy are considered.
     noits: number of iterations if the Gibbs sampler to be run
     burn: number of iterations to be ignored when computing posterior
           probabilities. If None, is set to 10% of total iterations
     delta_bio: parameter used when computing the conditional priors.
                The parameter must be positive. The smaller the parameter the
                more weight the adducts connections have on the posterior
                probabilities. Default 1.
     delta_add: parameter used when computing the conditional priors. The
                parameter must be positive. The smaller the parameter the more
                weight the adducts connections have on the posterior
                probabilities. Default 1.
     Bio: dataframe (2 columns), reporting all the possible connections between
          compounds. It uses the unique ids from the database. It could be the
          output of Compute_Bio() or Compute_Bio_Parallel().
     mode: either 'reactions' (connections are computed based on the reactions
           present in the database) or 'connections' (connections are computed
           based on the list of connections provided). Default 'reactions'.
    CSunk: cosine similarity score associated with the 'unknown' annotation.
            Default 0.7    
    isoDiff: Default value 1. Difference between isotopes of charge 1, does not
             need to be exact
     ppmiso: Default value 100. Maximum ppm value allowed between 2 isotopes.
             It is very high on purpose
     ncores: default value 1. Number of cores used
     
    
     me: accurate mass of the electron. Default 5.48579909065e-04
     ratiosd: default 0.9. It represents the acceptable ratio between predicted
             intensity and observed intensity of isotopes. it is used to compute
             the shape parameters of the lognormal distribution used to
             calculate the isotope pattern scores as sqrt(1/ratiosd)
     ppmunk: ppm associated to the 'unknown' annotation. If not provided equal
             to ppm.
     ratiounk: isotope ratio associated to the 'unknown' annotation. If not
               provided equal to 0.5
     ppmthr: Maximum ppm possible for the annotations. Ff not provided equal to
             2*ppm
     pRTNone: Multiplicative factor for the RT if no RTrange present in the
             database. If not provided equal to 0.8
     pRTout: Multiplicative factor for the RT if measured RT is outside the
             RTrange present in the database. If not provided equal to 0.4
     mzdCS: maximum mz difference allowed when computing cosine similarity
            scores. If one wants to use this parameter instead of ppmCS, this
            must be set to 0. Default 0.
     ppmCS: maximum ppm allowed when computing cosine similarity scores.
            If one wants to use this parameter instead of mzdCS, this must be
            set to 0. Default 10.
     connections: list of possible connections between compounds defined as
                 formulas. Only necessary if mode='connections'. A list of
                 common biotransformations is provided as default.
     Output:
         annotations: a dictionary containing all the possible annotations for the measured features. The keys of the dictionary are the
                      unique ids for the features present in df. For each feature, the annotations are summarized in a pandas dataframe.

Based on the parameters passed on to the function, the end-result of this function will be different.

For example, if one wants to use both the MS¹ and MS² data and not use the Gibbs sampler, the following should be used:

annotations= ipa.simpleIPA(df,ionisation=1, DB=DB,adductsAll=adducts,ppm=3,dfMS2=dfMS2,DBMS2=DBMS2)

isotopes already mapped
computing all adducts ....
0.1 seconds elapsed
annotating based on MS1 and MS2 information....
0.8 seconds elapsed

If instead one wants to use only the MS¹ data and only consider the adducts connections in the Gibbs sampler, one should use the following:

annotations= ipa.simpleIPA(df,ionisation=1, DB=DB,adductsAll=adducts,ppm=3,noits=5000,delta_add=0.1)

isotopes already mapped
computing all adducts ....
0.1 seconds elapsed
annotating based on MS1 information....
0.4 seconds elapsed
computing posterior probabilities including adducts connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|██████████| 5000/5000 [00:19<00:00, 255.13it/s]

parsing results ...
Done -  19.7 seconds elapsed

Or, if one wants to use both the MS¹ and MS² data and consider both adducts and biochemical connections in the Gibbs sampler, the following should be used.

annotations= ipa.simpleIPA(df,ionisation=1, DB=DB,adductsAll=adducts,ppm=3,dfMS2=dfMS2,DBMS2=DBMS2,noits=5000,
                             Bio=Bio,
                             delta_add=0.1, 
                             delta_bio=0.4)

isotopes already mapped
computing all adducts ....
0.1 seconds elapsed
annotating based on MS1 and MS2 information....
0.7 seconds elapsed
computing posterior probabilities including biochemical and adducts connections
initialising sampler ...


Gibbs Sampler Progress Bar: 100%|██████████| 5000/5000 [00:24<00:00, 207.29it/s]


parsing results ...
Done -  24.2 seconds elapsed