Ecoinformatics Conference Service, International Conference on Ecological Informatics 6

Uncertainty analysis and ensemble selection of statistical and machine learning models that predict species distribution.

Susan Patricia Worner, Gwénaël Leday, Takagoshi Ikeda

Last modified: 2008-09-13

Abstract


Prediction species distribution range changes as a result of climate change or new opportunities for invasion are required for effective management of potential damaging impact and other consequences. To date there have been many approaches to modelling species distributional changes. When sufficient data are available, simulation or mechanistic models have been used. But in the absence of more detailed data, approaches range from standard statistical models to more recent machine learning methods. Such models are based on species current distribution and their relationship with habitat characteristics, particularly, climate variables. Previous studies have shown considerable uncertainty associated with both parametric and machine learning model predictions of future species distribution. In this study a meta-analysis was performed, 1) to determine how nine supervised models would perform over six problem sets using seven performance metrics and, 2) to determine whether an ensemble of models would give better performance accuracy. The problem sets comprised the current global distributions of six insect species. The independent variables were a range of climate characteristics for 432 regions over which each species was recorded as either present or absent. The models used were: discriminant analysis (linear and quadratic), logistic regression, Bayesian networks, decision trees (CART and conditional tree), K-Nearest neighbours, support vector machine and artificial neural networks. The seven performance metrics used were: accuracy, precision, recall, F-score, Kappa, ROC curve and Lift chart. As expected, we found that, no model was superior to any other over all problem sets. Moreover, prediction accuracy on cross-classified data as measured by the performance criteria used in this study was idiosyncratic to each problem set. An ensemble of models did give improvements over individual models but again the results were inconsistent. We discuss the limitations of the data used in this study and if further improvements in model prediction are possible.