The package Sea Observations Utility for Reprocessing, Calibration and Evaluation (SOURCE) is a Python 3.x package
which has been developed within the framework of RITMARE project (http://www.ritmare.it) by the oceanography team in
Istituto Nazionale di Geofisica e Vulcanologia INGV (http://www.ingv.it).
SOURCE aims to manage jobs with in situ observations and model data from Ocean General Circulation Models (OGCMs) in order to:
Assess the quality of sea observations using original quality flags and reprocessing the data using global range check, spike removal, stuck value test and recursive statistic quality check;
return optimized daily and hourly time series of specific EOV (Essential Ocean Variables);
extract and aggregate in time model data at specific locations and depths;
evaluate OGCMs accuracy in terms of difference and absolute error.
SOURCE is written in Python, an interpreted programming language highly adopted in the last decade because it is versatile,
ease-to-use and fast to develop. SOURCE is developed and maintained as a module and it benefits from
Python’s open source utilities, such as:
Vectorized numerical data analysis (numPy, sciPy, ObsPy and pandas);
machine learning tools (scikit-learn);
hierarchical data storage (netCDF-4) (HDF-5 extension);
relational metadata storage using Structured Query Language (SQL) as management system.
SOURCE is relocatable in the sense that it can be adapted to any basin worldwide, provided that the input data
follow a specific standard format.
Condition of use
SOURCE usability is subjected to Creative Commons CC-BY-SA-NC license.
How to cite
If you use this software, please cite the following article: SOURCE: Sea Observations Utility for Reprocessing, Calibration and Evaluation. Here it is the citation:
https://doi.org/10.3389/fmars.2021.750387
Code location
The code development is carried out using git, a distributed version control system,
which allows to track and disseminate all new builds, releases, and bug fixes.
SOURCE is released for public use in the ZENODO platform at http://doi.org/10.5281/zenodo.5008245
with a Creative Commons CC-BY-SA-NC license.
Installation
User has to download the latest release in zipped version from here:
http://doi.org/10.5281/zenodo.5008245
Alternatively, using git SOURCE source code can be cloned directly from a branch:
After the extraction of the archive (if needed), the installation of the software is the same of a generic Python package
using the setup.py installer:
python3 setup.py install
Please make sure to have all the prerequisites installed in order to properly use SOURCE.
Module structure
SOURCE is composed of three main modules:
Observations module which manages in situ data pre and post processing and metadata relational database building;
model post processing module which manages model data aggregation and interpolation at specific platforms defined by
the observational module;
calibration and Validation (Cal/Val) module which allows to assess the quality of OGCMs versus observations.
Run options
SOURCE can be run in two modes:
creation (default): A new in situ, model or Cal/Val database is created from scratch;
update: A in situ, model or Cal/Val database is created using another existing historical database and can
also be concatenated to it to enlarge the collections.
NOTE: as the creation mode is the default, only the differences in update mode will be noted in this documentation.
How to load
Every part of the module can loaded with its arguments in two modes:
directly into an existing Python environment using the import command;
launched directly using the Python terminal. The module is OS independent.
Python environment execution
The entire module can be imported using:
import SOURCE
[...]
To load for example only the observations module one can alternatively use:
import SOURCE.obs_postpro
[...]
or
from SOURCE import obs_postpro
[...]
To load for example in situ pre processing submodule one can alternatively use:
Every component of SOURCE have a small description that can be obtained while loading it without any argument.
Further, a message of the mandatory and optional arguments will arise to help the component run correctly.
The most important modules and functions of SOURCE have also a helper that can be called in Python environment
by using the help function. For example:
Regardless of calling inside a Python environment or in OS terminal window, every module, sub-module or function
outputs some information about what it is doing, called logging, in standard output or standard error.
Almost every component of source has the verbose option that can be disabled by setting it to False.
By default, the verbosity is set to True.
Access and download Copernicus marine products
All Copernicus Marine Environment Monitoring Service (CMEMS) related procedures needs to access CMEMS data products
already downloaded and stored in the machine where the software is installed (also NFS file systems are supported).
Access to CMEMS data is easy and free of charge, but user need to be registered and logged in.
Here is the link to CMEMS registration form.
In order to download the data by using the web browser, the following steps are needed:
Search for the needed product(s);
Add selected product(s) to the cart;
On each product in the cart, enter to view the details and click on “Download product”;
Login to CMEMS service;
Choose download options and then ftp access.
The data can also be downloaded directly using CMEMS credentials with wget or curl programs.
in situ relational DB
The in situ relational database give full metadata information for the processed
platforms. It consists of four files:
devices.csv, CSV table with the sequent header:
Device ID;
Device Name.
organizations.csv, CSV table with the sequent header:
Organization ID;
Organization name;
Organization Country (reverse searched from url extension, empty for generic url extensions);
Organization Weblink (if available).
variables.csv, CSV table with the sequent header:
Variable ID;
Variable long_name attribute;;
Variable standard_name attribute;
Variable units.
probes.csv, CSV table with the sequent header:
Probe ID;
Probe SOURCE platform_code attribute;
Probe name (if available or matches with probes_names.csv table);
Probe WMO;
Device type ID;
Organization ID;
Variable IDs;
Per variable average longitudes;
Per variable average latitudes;
Per variable record starts;
Per variable record ends;
Per variable sampling times (ddd hh:mm:ss form);
Per variable depth levels;
Per variable quality controls information;
Per variable ancillary notes;
Probe link (if available).
Observations module
The observations module consists in 4 sub modules:
pre processing;
reprocessing;
metadata DB merging;
whole in situ DB merging.
How to…
Create a reprocessed in situ historical database (creation mode):
Preprocess the data using insitu_tac_pre_processing from obs_postpro module;
Reprocess the preprocessed data using obs_postpro.
Create a reprocessed in situ update database (update mode):
Preprocess the new data using insitu_tac_pre_processing with update mode activated from obs_postpro module;
Reprocess the preprocessed data using obs_postpro, giving the climatology directory of the historical collection.
Concatenate historical databases with an update:
Merge the in situ relational DBs using metadata_merger;
Merge the in situ databases using real_time_concatenator from obs_postpro module;
Concatenate two preprocessed in situ databases (every mode):
Merge the two in situ databases and metadata using pointwise_datasets_concatenator.
CMEMS in situ Thematic Assembly Center (TAC) observations pre processing.
Prepare CMEMS observations data sources
In order to properly pre process CMEMS service observations data,
the data itself must have been already downloaded using CMEMS service
to a common directory.
NOTE: all the datasets needed for preprocessing have to be stored in the same directory, without subfolders.
Mandatory inputs
in_dir: CMEMS downloaded in situ observations directory;
in_fields_standard_name_str input variables standard_name attributes to process (space separated string),
for example: “sea_water_temperature sea_water_practical_salinity”, please read
CF conventions standard name table
to find the correct strings to insert);
work_dir: base working directory;
out_dir: output directory;
valid_qc_values: CMEMS DAC quality flags values to use (space separated string with 0 to 9 values, for example: “0 1 2”).
Please read CMEMS Product User Manual (PUM)
to properly set the flag values.
Optional inputs
update_mode (default False): run the module in update mode, will disable low data platforms filtering;
first_date_str (default None): start date in YYYYMMDD or in YYYY-MM-DD HH:MM:SS format.;
last_date_str (default None): end date in YYYYMMDD or in YYYY-MM-DD HH:MM:SS format;
region_boundaries_str (default “-180 180 0 180”): region longitude - latitude limits (space separated string,
min_lon, max_lon (deg E); min_lat, max_lat (deg N)). Used to draw a LatLon box where to run;
med_sea_masking (default False): masking foreign seas switch for Mediterranean Sea processing.
Used to remove pre processed probes outside the basin when LatLon box of the Mediterranean Sea is selected;
in_instrument_types_str (default None): CMEMS “instrument type” metadata filter (space separated string). Used
to process only certain platform types (for example: “‘mooring’ ‘coastal structure’”.
Please read CMEMS Product User Manual (PUM)
to properly write the attribute string. NOTE: must put quotes outside attributes with spaces to protect them from character escaping);
names_file (default internal file probes_names.csv): CSV table with two columns:
platform_code;
platform name.
verbose(default True): verbosity switch.
Outputs
metadata relational database, containing
devices.csv*, organizations.csv, variables.csv and probes.csv
(read specific section);
processing_information.csv: CSV table with the sequent header:
platform_code;
institution;
platform name;
WMO;
Platform type;
Average longitude;
Average latitude;
processing information.
observations database in netCDF-4 format, divided by variable standard names and with selected quality flags applied,
containing:
platform instantaneous latitude, longitude and depth dimension variables;
platform time;
DAC quality checked time series;
global attributes containing original datasets and pre processing information.
Observational module reprocessing tool from preprocessed DB. May work in creation
or update mode, if platform climatologies are provided instead of self
computed from the processed data.
devices.csv*, organizations.csv, variables.csv and probes.csv
(read specific section);
in_dir: Pre processed observations netCDF database directory;
in_fields_standard_name_str input variables standard_name attributes to process (space separated string),
for example: “sea_water_temperature sea_water_practical_salinity”, please read “variables.csv” from
preprocessed metadata relational DB to find the correct strings to insert);
work_dir: base working directory;
out_dir: output directory;
routine_qc_iterations: Routine quality check iterations number (N, integer). Options:
N = -1 for original DAC quality controls only (NO QC));
N = 0 for gross check quality controls only (NO_SPIKES_QC);
N >= 1 for N statistic quality check iterations (STATISTIC_QC_N);
Optional inputs
climatology_dir (default None) Platform climatology data directory for update mode;
first_date_str (default None): start date in YYYYMMDD or in YYYY-MM-DD HH:MM:SS format.;
last_date_str (default None): end date in YYYYMMDD or in YYYY-MM-DD HH:MM:SS format;
region_boundaries_str (default “-180 180 0 180”): region longitude - latitude limits (space separated string,
min_lon, max_lon (deg E); min_lat, max_lat (deg N)). Used to draw a LatLon box where to run;
med_sea_masking (default False): masking foreign seas switch for Mediterranean Sea processing.
Used to remove pre processed probes outside the basin when LatLon box of the Mediterranean Sea is selected;
in_instrument_types_str (default None): “instrument type” metadata filter (space separated string). Used
to process only certain platform types (for example: “‘mooring’ ‘coastal structure’”.
Please read the devices table to properly write the attribute string.
NOTE: must put quotes outside attributes with spaces to protect them from character escaping);
verbose(default True): verbosity switch.
Outputs
metadata relational database, edited during reprocessing, containing
devices.csv*, organizations.csv, variables.csv and probes.csv
(read specific section);
rejection_process.csv (if routine_qc_iterations >= 0):
CSV table with the sequent header:
Probe CMEMS platform_code attribute;
Variable standard_names;
Total data amount for each variable;
filled data for each variable;
rejection amount for each variable by global range check data;
rejection amount for each variable by spike test data;
rejection amount for each variable by stuck value test data;
(if routine_qc_iterations >= 1)
rejection amount for each variable for each statistic phase.
reprocessed database in netCDF-4 format, divided by and variable standard names and with selected quality flags applied,
containing:
probe latitude;
probe longitude;
field depths;
time counter and boundaries;
RAW, post processed and time averaged fields;
global attributes containing original datasets and post process specs.
platform climatologies: per-probe and per-field monthly mean climatology
averages, standard deviation and filtered density profiles dataset.
Model data nearest point extractor and concatenator.
Prepare model data sources
There are two different data sources that SOURCE can handle:
CMEMS model data;
NEMO ocean model outputs formatted model data.
In order to properly process CMEMS service model data,
the data itself must have been already downloaded using CMEMS service (read specific section)
to a common directory. Notes:
All the datasets needed for preprocessing have to be stored in the same directory;
All the variables stored in the netCDF files MUST have properly set the standard_name
attribute, otherwise SOURCE will not find them;
There MUST not be time duplication in model data.
In order to speedup the concatenation, one suggestion would be to split model datasets in
the input directory into subfolders. Differently from observations data,
model data can also be stored in subfolders. In this case, each folder MUST be named
with the standard_name attribute of the field that the datasets contains inside. Example:
input directory –> sea_water_temperature directory –> all datasets with sea_water_temperature here
etc.
devices.csv*, organizations.csv, variables.csv and probes.csv
(read specific section);
in_dir: model data input directory;
in_fields_standard_name_str input variables standard_name attributes to process (space separated string),
for example: “sea_water_temperature sea_water_practical_salinity”, please read “variables.csv” from
preprocessed metadata relational DB to find the correct strings to insert);
work_dir: base working directory;
out_dir: output directory;
grid_observation_distance: grid-to-observation maximum acceptable distance (km);
Optional inputs
mesh_mask_file (default None): model mesh mask file (if not provided land points are taken using model datasets themselves);
first_date_str (default None): start date in YYYYMMDD or in YYYY-MM-DD HH:MM:SS format.;
last_date_str (default None): end date in YYYYMMDD or in YYYY-MM-DD HH:MM:SS format;
devices.csv*, organizations.csv, variables.csv and probes.csv
(read specific section);
first_dir: first data input directory (model);
second_dir: second data input directory (observations);
in_fields_standard_name_str input variables standard_name attributes to process (space separated string),
for example: “sea_water_temperature sea_water_practical_salinity”, please read “variables.csv” from
preprocessed metadata relational DB to find the correct strings to insert);
out_dir: output directory;
Optional inputs
first_date_str (default None): start date in YYYYMMDD or in YYYY-MM-DD HH:MM:SS format.;
last_date_str (default None): end date in YYYYMMDD or in YYYY-MM-DD HH:MM:SS format;
first_title_str (default first): First database comprehension variable name title (with no spaces);
second_title_str (default second): Second database comprehension variable name title (with no spaces).
Outputs
model database in netCDF-4 format, divided by variable standard names, containing:
first horizontal coordinates;
second horizontal coordinates;
field depths;
time counter and boundaries;
first time series;
second time series;
absolute error profile time series;
difference profile time series;
time averaged absolute error profile;
time averaged difference profile;
global attributes containing additional information.
devices.csv*, organizations.csv, variables.csv and probes.csv
(read specific section);
in_fields_standard_name_str input variables standard_name attributes to process (space separated string);
out_kml_file: output file;
Optional inputs
first_date_str (default None): start date in YYYYMMDD or in YYYY-MM-DD HH:MM:SS format.;
last_date_str (default None): end date in YYYYMMDD or in YYYY-MM-DD HH:MM:SS format;
region_boundaries_str (default “-180 180 0 180”): region longitude - latitude limits (space separated string,
min_lon, max_lon (deg E); min_lat, max_lat (deg N)). Used to draw a LatLon box where to run;
Outputs
KML file , containing geo referenced map of probes with information.
SOURCE
The package Sea Observations Utility for Reprocessing, Calibration and Evaluation (SOURCE) is a Python 3.x package which has been developed within the framework of RITMARE project (http://www.ritmare.it) by the oceanography team in Istituto Nazionale di Geofisica e Vulcanologia INGV (http://www.ingv.it).
SOURCE aims to manage jobs with in situ observations and model data from Ocean General Circulation Models (OGCMs) in order to:
SOURCE is written in Python, an interpreted programming language highly adopted in the last decade because it is versatile, ease-to-use and fast to develop. SOURCE is developed and maintained as a module and it benefits from Python’s open source utilities, such as:
SOURCE is relocatable in the sense that it can be adapted to any basin worldwide, provided that the input data follow a specific standard format.
Condition of use
SOURCE usability is subjected to Creative Commons CC-BY-SA-NC license.
How to cite
If you use this software, please cite the following article: SOURCE: Sea Observations Utility for Reprocessing, Calibration and Evaluation. Here it is the citation:
Code location
The code development is carried out using git, a distributed version control system, which allows to track and disseminate all new builds, releases, and bug fixes. SOURCE is released for public use in the ZENODO platform at http://doi.org/10.5281/zenodo.5008245 with a Creative Commons CC-BY-SA-NC license.
Installation
User has to download the latest release in zipped version from here:
Alternatively, using git SOURCE source code can be cloned directly from a branch:
After the extraction of the archive (if needed), the installation of the software is the same of a generic Python package using the setup.py installer:
Please make sure to have all the prerequisites installed in order to properly use SOURCE.
Module structure
SOURCE is composed of three main modules:
Run options
SOURCE can be run in two modes:
How to load
Every part of the module can loaded with its arguments in two modes:
Python environment execution
The entire module can be imported using:
To load for example only the observations module one can alternatively use:
or
To load for example in situ pre processing submodule one can alternatively use:
or
or
or similar, and then use the functions of the module with their arguments in the usual way.
Note
The double dot import or load is needed because every function lives in a Python file that contains it with the same name of the function.
Terminal execution
To execute for example the in situ pre processing sub-module one can use:
Help and logging
Every component of SOURCE have a small description that can be obtained while loading it without any argument. Further, a message of the mandatory and optional arguments will arise to help the component run correctly.
The most important modules and functions of SOURCE have also a helper that can be called in Python environment by using the help function. For example:
Regardless of calling inside a Python environment or in OS terminal window, every module, sub-module or function outputs some information about what it is doing, called logging, in standard output or standard error. Almost every component of source has the verbose option that can be disabled by setting it to False.
By default, the verbosity is set to True.
Access and download Copernicus marine products
All Copernicus Marine Environment Monitoring Service (CMEMS) related procedures needs to access CMEMS data products already downloaded and stored in the machine where the software is installed (also NFS file systems are supported). Access to CMEMS data is easy and free of charge, but user need to be registered and logged in. Here is the link to CMEMS registration form. In order to download the data by using the web browser, the following steps are needed:
in situ relational DB
The in situ relational database give full metadata information for the processed platforms. It consists of four files:
Observations module
The observations module consists in 4 sub modules:
How to…
CMEMS in situ TAC pre processing sub-module
CMEMS in situ Thematic Assembly Center (TAC) observations pre processing.
Prepare CMEMS observations data sources
In order to properly pre process CMEMS service observations data, the data itself must have been already downloaded using CMEMS service to a common directory. NOTE: all the datasets needed for preprocessing have to be stored in the same directory, without subfolders.
Mandatory inputs
Optional inputs
Outputs
metadata relational database, containing
devices.csv*, organizations.csv, variables.csv and probes.csv (read specific section);
processing_information.csv: CSV table with the sequent header:
observations database in netCDF-4 format, divided by variable standard names and with selected quality flags applied, containing:
Module wise dependencies
find_variable_name, pointwise_datasets_concatenator, time_check, time_calc
Observations module wise dependencies
insitu_tac_platforms_finder, insitu_tac_timeseries_extractor, data_information_calc, time_from_index, depth_calc, mean_variance_nc_variable, unique_values_nc_variable, quality_check_applier
Reprocessing sub-module
Observational module reprocessing tool from preprocessed DB. May work in creation or update mode, if platform climatologies are provided instead of self computed from the processed data.
Mandatory inputs
Optional inputs
Outputs
metadata relational database, edited during reprocessing, containing
devices.csv*, organizations.csv, variables.csv and probes.csv (read specific section);
rejection_process.csv (if routine_qc_iterations >= 0): CSV table with the sequent header:
reprocessed database in netCDF-4 format, divided by and variable standard names and with selected quality flags applied, containing:
platform climatologies: per-probe and per-field monthly mean climatology averages, standard deviation and filtered density profiles dataset.
Module wise dependencies
duplicated_records_remover, records_monotonicity_fixer, time_check, time_calc
Observations module wise dependencies
time_averager, time_series_post_processing, quality_check_applier, depth_aggregator, depth_calc
Metadata merger sub-module
Merger for two SOURCE relational metadata databases, with option of first one dominant.
Mandatory inputs
Optional inputs
Outputs
Model data module
The Model data module consists allows to process CMEMS or NEMO (link) model data.
How to…
CMEMS / NEMO model processing module
Model data nearest point extractor and concatenator.
Prepare model data sources
There are two different data sources that SOURCE can handle:
In order to properly process CMEMS service model data, the data itself must have been already downloaded using CMEMS service (read specific section) to a common directory. Notes:
In order to speedup the concatenation, one suggestion would be to split model datasets in the input directory into subfolders. Differently from observations data, model data can also be stored in subfolders. In this case, each folder MUST be named with the standard_name attribute of the field that the datasets contains inside. Example:
Mandatory inputs
Optional inputs
Outputs
Module wise dependencies
ptmp_to_temp
Model data module wise dependencies
model_datasets_concatenator, vertical_interpolation
Skill assessment module
How to:
Mandatory inputs
Optional inputs
Outputs
Module wise dependencies
find_variable_name, time_calc
Other modules
KML creator module
Create Google Earth KML file with probes locations.
Mandatory inputs
Optional inputs
Outputs
Real time concatenator module
Create Google Earth KML file with probes locations.
Mandatory inputs
Optional inputs
Outputs
Module wise dependencies
duplicated_records_remover, pointwise_datasets_concatenator, records_monotonicity_fixer, time_check
Other module-wise functions
Remove duplicated records in netCDF files.
Mandatory inputs
Optional inputs
Outputs
Find variable name given an attribute name and value.
Mandatory inputs
Optional inputs
Outputs
remove duplicated records in netCDF files.
Mandatory inputs
Optional inputs
Outputs
Module wise dependencies
time_calc
Compute sea water in situ temperature from potential temperature and salinity.
Mandatory inputs
Optional inputs
Outputs
Module wise dependencies
find_variable_name
Reorder decreasing records segments in netCDF files.
Mandatory inputs
Optional inputs
Outputs
Compute most probable record sampling in netCDF files.
Mandatory inputs
Optional inputs
Outputs
Compute time step verification in netCDF files.
Mandatory inputs
Optional inputs
Outputs
Other observations module wise functions
Compute total, valid, invalid, no QC and filled data number for a depth sliced field.
Mandatory inputs
Optional inputs
Outputs
Module wise dependencies
find_variable_name
Aggregate instantaneous depth variable and average horizontal coordinates
Mandatory inputs
Optional inputs
Outputs
Compute rounded depth array for non floating platforms.
Mandatory inputs
Optional inputs
Outputs
CMEMS IN SITU TAC surrounding datasets finder.
Mandatory inputs
Optional inputs
Outputs
Extract a field from CMEMS INSITU TAC insitu observations netCDF files.
Mandatory inputs
Optional inputs
Outputs
Module wise dependencies
find_variable_name
Compute variable average and variance over time dimension.
Mandatory inputs
Optional inputs
Outputs
Module wise dependencies
find_variable_name
Apply a specific quality check to a variable stored in a netCDF dataset.
Mandatory inputs
Optional inputs
Outputs
Module wise dependencies
find_variable_name
Compute custom weighted mean in oversampled observation files.
Mandatory inputs
Optional inputs
Outputs
Module wise dependencies
find_variable_name
Find time value from index from a netCDF observation file.
Mandatory inputs
Optional inputs
Outputs
Reprocess a field in insitu observations netCDF files.
Mandatory inputs
Optional inputs
Outputs
Module wise dependencies
find_variable_name, time_calc
Compute variable unique values over time dimension.
Mandatory inputs
Optional inputs
Outputs
Module wise dependencies
find_variable_name
Compute SOURCE global rejection statistics rejection_process.csv produced by obs_postpro.py.
Mandatory inputs
Optional inputs
Outputs
Other model data module wise functions
Process model data to in situ platform locations.
Mandatory inputs
Optional inputs
Outputs
Module wise dependencies
find_variable_nameProcess model data to in situ platform locations.
Mandatory inputs
Optional inputs
Outputs