Statistical analysis
Each of the four OpenFoam simulations (i.e. for NW, NE, SE and SW wind
directions) resulted in 76,913 data points across all cliff sections.
For each wind direction, the modelled airflow parameters within each
occupied/ unoccupied section were reduced to the following summary
statistics; median, interquartile range (IQR) and skewness. Median
statistics were used to identify the strength of each wind parameter
within a section. The IQR values identify the variability or gradient of
each wind parameter within a section. High skewness statistics for
horizontal wind speed (skewed right) correspond to shelter.
The complete set of 27 airflow parameters was tested for collinearity by
producing a correlation hierarchy table for each initial wind condition.
Highly correlated terms (Pearson correlation coefficient ≥ 0.7) were
removed from the analysis, leading to the inclusion of 15 wind
parameters. The mean slope steepness per section was added to the total
set and parameters were standardized using the MuMin
package52 version 1.43.17.
We considered that it would be stretching the data to model the density
or number of breeding birds as a continuous variable, particularly as
the extent of each section was determined by the need to survey cliffs,
and did not accord with the beginning or end of occupied/ unoccupied
areas. We therefore ran separate models, with colony presence defined as
(i) the presence of any breeding birds, (ii) the largest (iii) the
densest colonies. For the latter two categories, areas with breeding
birds that did not fall into either the largest or densest categories
were excluded from the modelling process instead of defining them as
absence. The excluded areas represented 23.5% and 27.7% of all
breeding birds respectively. This resulted in trained datasets of 43 and
44 sections for the largest and densest datasets.
A two-step approach was used to build the global model and identify the
final, best-fitting models. First, to reduce the large number of
covariates, a random forest classifier was fitted, using the package
randomForest53 version 4.6.14. The 10 most important
terms were selected to build the global logistic regression model. This
parameter set was further simplified using the dredge function (MuMin
package52), to perform stepwise Bayesian information
criterion (BIC) selection, penalising for model size. In the case of the
models for largest and densest colonies in a NW wind, the number of
terms in dredge had to be gradually reduced to eight and seven
respectively, to prevent fitting of models with probabilities of zero
and one. The simplest model among those with a difference in BIC ≤ 2 was
selected as the best final model.
The final model was assessed for goodness of fit using the
McFadden54 pseudo R2. Values between
0.2 – 0.4 were considered as very satisfactory42.
Model performance was also evaluated in terms of overall accuracy (OA);
true skill statistic (TSS), sensitivity and
specificity55,56. The effect size of each predictor
included in the final model was determined by computing the odds ratio.
The previous steps were repeated for all four initial wind directions,
with three logistic regression models of colony presence/absence
implemented per direction, in order to identify links between wind and
slope that were robust to different classifications of breeding colony.
All statistical analyses were conducted in in R57version 3.6.3 and RStudio58 version 1.1.463.