Kwaku Peprah Adjei

INTRODUCTION Biodiversity data from surveys and other monitoring programs such as Citizen Science (called ‘CS’ hereafter) play a vital role in estimating species distributions and making conservation decisions . With an influx of data available to researchers through the various biodiversity databases such as iNaturalist , GBIF , Artsdatabanken (https://www.artsdatabanken.no/), eBird , amongst many others, there has been increased coverage and inferences about species distributions. However, these biodiversity data (especially those from CS) are subject to sampling and systematic biases, raising concerns about their use in scientific research . The sampling bias may be due to sampling variation in space and time, and the systematic bias could arise because of species misreporting and misidentification as well as imperfect detection, amongst many others . As discussed by , various statistical approaches have been used in analysing biodiversity data such as the generalised linear (mixed) models , hierarchical models such as N-Mixture models ( and references therein) and joint species distribution models ( and references therein). These methods depend on the type of response variable. Some modellers of biodiversity data use Maxent , a constrained optimisation approach that finds the optimal species density subject to constraints . Attaching any uncertainty to Maxent’s predictions is impossible as a non-stochastic approach. However, has shown that Maxent is equivalent to point process models and that the point process framework can be used to obtain standard errors on model coefficients and predictions. Biodiversity data have also been modelled as typical geostatistic data with a binary response by including pseudo-absences . These pseudo-absences are used to account for the biases in Maxent to generate reliable and unbiased species distribution models . However, this approach adds an arbitrary amount of data and ignores the spatial autocorrelation between absences, as pointed out in . However, some studies have explored different methods to control the selection of pseudo-absences to fit the species distribution models . Furthermore, other approaches propose modelling biodiversity data as a thinned point pattern . A point pattern is a collection of points with random locations. That is, the locations of the points are not fixed or previously chosen. A spatial point process is the model that determines the location and amount of points in an area. Thinning is an operation on point patterns that uses a specified rule to determine which points in the point pattern are deleted. For biodiversity data, the thinning of the point pattern of the true species occurrences is caused by inherent biases in them. For instance, proposed that a single covariate related to a source of bias (e.g. accessibility) should be included in the linear predictor as a log-linear function to determine the probability of retaining an occurrence (i.e. a variable restricted to ( − ∞, 0]) as described in ). and , however, proposed integrating extra sources of information, such as professional surveys, to account for these biases. These biases can be explicitly modelled as a function of known covariates if that information is available or as an extra random effect when data related to the biases is unavailable . However, these approaches do not consider how the various sources of biases discussed in the next paragraphs affect the observed data. Firstly, biodiversity data are typically affected by biased sampling and unbiased imperfect sampling processes. Observers, mostly from CS projects, usually choose where they go. This decision is frequently influenced by factors such as accessibility and where observers expect to find more occurrences (i.e. preferential sampling; ). This is known as the biased sampling process. On the other hand, some areas will be inaccessible due to government regulations or because they are protected areas or a sub-sample of biological materials are taken at a given location . This sampling incompleteness is an example of an unbiased imperfect sampling process. and have acknowledged the role of the sampling process as a thinning factor in the context of point patterns. explored a model that assumes the thinning process as a log-linear function that affects the intensity of the observed point pattern. Further work by and explored the implications of not properly accounting for the sampling bias in biodiversity data. This was done by comparing improvements in goodness-of-fit and ecological interpretability of a model that accounts for sampling bias to approaches such as Maxent or a model that does not account for the sampling bias at all. In both cases, modelling the sampling bias as a thinning operation on a point pattern reduced the bias in the estimated effect of ecological covariates on the spatial distribution of species occurrences. However, these developments have not explored other sources of bias that can affect the process of collecting biodiversity data. Secondly, imperfect detection is also another source of bias in biodiversity data . Imperfect detection occurs when the observer fails to detect the species even though it was present. Detectability must be accounted for in estimating species trend and abundance because some species are overlooked, and the detection of the species will depend on its behaviour . The detection and identification of these species are influenced by the observer’s attention to a particular species and place, time and factors determining visibility, such as weather conditions . Failure to account for the imperfect detection when analysing biodiversity data may bias parameter estimates \citep[especially when the detection probability varies systematically; ][]{welsh2013fitting} and thus make statistical inference less reliable . Integrated species distribution models (hereafter, iSDMs) have been proposed to analyse these observed biodiversity data from various sources and sampling protocols \citep[example of studies that use iSDMs include ][]{koshkina2017integrated, dorazio2014accounting, erickson2021accounting}. These iSDMs capitalize on the strengths of each dataset to better capture species distributions and dynamics . During most biodiversity data collection, more than one species are observed or reported. The report of one species can be misclassified as another. For example, iNaturalist reports potential misclassified species when a species of interest is queried on their website (www.inaturalist.org). Therefore, it is essential that we treat biodiversity data as multispecies data with possible misclassifications and model it as such . Misidentification and/or misreporting species and other sources of false positives (collectively known as misclassification) are critical issues to consider in biodiversity data. Various methods have been developed to model these false positives, such as those proposed by and and references therein. These methods can be model-based, which includes taking a subset of the data with verifiable certainty, instructing observers only to record observations they are sure about, and increasing observer experiences . These methods can also be design-based, for example, dependent double-observer method . Failure to account for misclassification in biodiversity data modelling can increase bias and decrease the precision of the parameter estimates , leading to accidental culling of endangered species and assessments of the population status and incorrect conservation decisions . Accounting for the biases existing in biodiversity data is paramount for users of these data. However, there is no consensus on how this can be done, and it is indeed a growing research field within both statistics and ecology . In this paper, we propose a hierarchical multispecies model for biodiversity data that simultaneously accounts for multiple sources of bias and specifies each of them probabilistically. Some work has been done to propose a framework for observer-based biases in CS data , but no work has been done to develop a Bayesian framework that provides insight into modelling biases in biodiversity data. It relies on a straightforward specification of the observed biodiversity data as a thinned point pattern. This point pattern is affected by common biases such as uneven sampling effort, imperfect detection and misclassification. We propose three stages of thinning, one for each type of bias considered. Each stage produces a probability of retaining a point from the previous stages. This novel approach is flexible and can accommodate more biases in biodiversity data beyond the ones discussed in this paper. This paper aims to introduce our modelling framework and its properties. The proposed framework in this paper is not excused from identifiability issues . A model is not identifiable when multiple combinations of parameter values are solutions to the equation we are interested in solving . Identifiability issues in some complex SDMs cannot be proven mathematically but through numerical integration of its Fisher information matrix . In this study, we also check for identifiability issues in the proposed framework by describing our framework as a generalisation of the developed SDMs that account for sources of biases. Due to these identifiability issues, the inference is non-trivial and thus, the inference is left to a future paper. This paper is organised as follows: in section 2, we introduce the multispecies biodiversity data collection process. In section 3, we explore the properties of the proposed model for biodiversity data by exploring potential issues of identifiability that this model can face and discuss possible ways to solve these issues. Finally, in section 4, we make concluding remarks and suggest future work.