Overcoming the challenges of data integration in ecosystem studies with machine learning workflows: an example from the Santos project

Fonseca, Gustavo; Vieira, Danilo Candido

doi:10.1590/2675-2824071.22044gf

Acessibilidade / Reportar erro

Brasil

Ocean and Coastal Research

Español English

Brasil

Español English

sumário « anterior atual seguinte »

Sumário

Methods • Ocean Coast. Res. 71 (suppl 3) • 2023 • https://doi.org/10.1590/2675-2824071.22044gf copy

Overcoming the challenges of data integration in ecosystem studies with machine learning workflows: an example from the Santos project

Authorship SCIMAGO INSTITUTIONS RANKINGS

Abstract

Integrating intricate environmental data within a unified analytical framework for extensive conservation and monitoring initiatives encounters several challenges. These challenges encompass defining a conceptual model outlining cause-and-effect relationships, addressing dissimilarities in data source quantity and information content, grappling with missing or noisy data, fine-tuning model optimization, achieving accurate predictions, and tackling the issue of imbalanced observations across factors. In the context of the Santos project, dedicated to comprehending the spatio-temporal dynamics of benthic, pelagic, and physical systems for the facilitation of conservation and monitoring programs, the application of machine learning’s random forest (RF) technique for modeling univariate data offers notable advantages. This approach adeptly handles non-linearity, covariation, and interactive effects among predictors. For modeling multivariate data sets, a hybrid strategy combining a self-organizing map (SOM) and RF is harnessed to effectively tackle the challenges. Addressing missing values, the bagging imputation technique demonstrated superior performance compared to other methods. Both machine learning techniques discussed herein exhibit resilience against the impact of noisy data, yet the identification of noisy data remains feasible based on model outputs. In scenarios of imbalanced data sets, we investigate the correlation between the RF model’s overall statistics and those of individual classes. The joint interpretation of these statistics aids in comprehending model limitations and facilitates discussions on the environmental mechanisms shaping observed patterns. We propose two analytical workflows that not only enable the exploration and enhancement of model accuracy but also facilitate the investigation of potential cause-and-effect relationships inherent in the data. Furthermore, these workflows lay the foundation for implementing long-term learning algorithms, a pivotal increment for monitoring initiatives. Notably, these workflows, alongside the discussed analytical challenges, can be seamlessly implemented within iMESc, an open-source application.

Keywords:
Self-organizing map; Random forest; Oceanography; Modeling; Santos basin

1	Accuracy	Acc	the proportion of predictions that the model classified correctly
2	Misclassification rate	Mis	The proportion of predictions that the model misclassified
3	Confidence Interval	CI	a likelihood that the true accuracy for this model lies within this range
4	No-information rate	NIR	the largest proportion of the observed classes.
5	Kappa	k	the accuracy of the classifier normalized by the expected accuracy simply by chance
6	p-value	p-value	the significance of the accuracy performing better the no-information rate
7	Sensitivity or Recall	Sens	the proportion of the positive class correctly predicted
8	Specificity	Spec	the proportion of the negative class correctly predicted
9	Precision	Prec	The proportion of positive identifications that were correct
10	Prevalence	Prev	the frequency of the positive class in the model
11	F1 Score	F1	the harmonic means between precision and sensitivity
12	Positive Predictive Value	PPV	the number of the positive class correctly predicted as a proportion of the total positive class predictions
13	Negative Predictive Value	NPV	the number of the negative class correctly predicted as a proportion of the total negative class predictions
14	Detection Rate	DR	the number of correct positive class predictions made as a proportion of all of the predictions
15	Detection Prevalence	Dprev	the number of positive class predictions as a proportion of all predictions
16	Balanced Accuracy	BA	The average between the true positive and true negative rates

	Training	Test
Acc	0.92	0.95
k	0.89	0.94
95%CI	(0.89 - 0.93)	(0.77-0.99)
NIR	0.39	0.36
p-value	<0.01	<0.01

		Sens	Spec	PPV	NPV	Prec	F1	Prev	DR	DP	BA
Training	CB	0.92	1.00	0.98	0.99	0.98	0.95	0.08	0.07	0.07	0.96
	CFU	0.92	0.98	0.88	0.98	0.88	0.90	0.16	0.14	0.16	0.95
	CCS	0.77	0.98	0.78	0.98	0.78	0.77	0.08	0.06	0.08	0.87
	DS	0.92	0.99	0.98	0.95	0.98	0.95	0.39	0.36	0.37	0.95
	CS	0.96	0.96	0.86	0.99	0.86	0.91	0.21	0.20	0.23	0.96
	LPP	0.93	0.99	0.93	0.99	0.93	0.93	0.09	0.08	0.09	0.96

Test	CB	0.50	1.00	1.00	0.95	1.00	0.67	0.09	0.05	0.05	0.75
	CFU	1.00	1.00	1.00	1.00	1.00	1.00	0.14	0.14	0.14	1.00
	CCS	1.00	0.95	0.67	1.00	0.67	0.80	0.09	0.09	0.14	0.98
	DS	1.00	1.00	1.00	1.00	1.00	1.00	0.36	0.36	0.36	1.00
	CS	1.00	1.00	1.00	1.00	1.00	1.00	0.23	0.23	0.23	1.00
	LPP	1.00	1.00	1.00	1.00	1.00	1.00	0.09	0.09	0.09	1.00

Instituto Oceanográfico da Universidade de São Paulo Praça do Oceanográfico 191, CEP: 05508-120, São Paulo, SP - Brasil, Tel.: (11) 3091-6501 - São Paulo - SP - Brazil
E-mail: diretoria.io@usp.br

Acompanhe os números deste periódico no seu leitor de RSS

[1] ^∗ Corresponding author: gfonseca.unifesp@gmail.com