Acessibilidade / Reportar erro

Overcoming the challenges of data integration in ecosystem studies with machine learning workflows: an example from the Santos project

Abstract

Integrating intricate environmental data within a unified analytical framework for extensive conservation and monitoring initiatives encounters several challenges. These challenges encompass defining a conceptual model outlining cause-and-effect relationships, addressing dissimilarities in data source quantity and information content, grappling with missing or noisy data, fine-tuning model optimization, achieving accurate predictions, and tackling the issue of imbalanced observations across factors. In the context of the Santos project, dedicated to comprehending the spatio-temporal dynamics of benthic, pelagic, and physical systems for the facilitation of conservation and monitoring programs, the application of machine learning’s random forest (RF) technique for modeling univariate data offers notable advantages. This approach adeptly handles non-linearity, covariation, and interactive effects among predictors. For modeling multivariate data sets, a hybrid strategy combining a self-organizing map (SOM) and RF is harnessed to effectively tackle the challenges. Addressing missing values, the bagging imputation technique demonstrated superior performance compared to other methods. Both machine learning techniques discussed herein exhibit resilience against the impact of noisy data, yet the identification of noisy data remains feasible based on model outputs. In scenarios of imbalanced data sets, we investigate the correlation between the RF model’s overall statistics and those of individual classes. The joint interpretation of these statistics aids in comprehending model limitations and facilitates discussions on the environmental mechanisms shaping observed patterns. We propose two analytical workflows that not only enable the exploration and enhancement of model accuracy but also facilitate the investigation of potential cause-and-effect relationships inherent in the data. Furthermore, these workflows lay the foundation for implementing long-term learning algorithms, a pivotal increment for monitoring initiatives. Notably, these workflows, alongside the discussed analytical challenges, can be seamlessly implemented within iMESc, an open-source application.

Keywords:
Self-organizing map; Random forest; Oceanography; Modeling; Santos basin

Instituto Oceanográfico da Universidade de São Paulo Praça do Oceanográfico 191, CEP: 05508-120, São Paulo, SP - Brasil, Tel.: (11) 3091-6501 - São Paulo - SP - Brazil
E-mail: diretoria.io@usp.br