Statistical analysis
Each of the four OpenFoam simulations (i.e. for NW, NE, SE and SW wind directions) resulted in 76,913 data points across all cliff sections. For each wind direction, the modelled airflow parameters within each occupied/ unoccupied section were reduced to the following summary statistics; median, interquartile range (IQR) and skewness. Median statistics were used to identify the strength of each wind parameter within a section. The IQR values identify the variability or gradient of each wind parameter within a section. High skewness statistics for horizontal wind speed (skewed right) correspond to shelter.
The complete set of 27 airflow parameters was tested for collinearity by producing a correlation hierarchy table for each initial wind condition. Highly correlated terms (Pearson correlation coefficient ≥ 0.7) were removed from the analysis, leading to the inclusion of 15 wind parameters. The mean slope steepness per section was added to the total set and parameters were standardized using the MuMin package52 version 1.43.17.
We considered that it would be stretching the data to model the density or number of breeding birds as a continuous variable, particularly as the extent of each section was determined by the need to survey cliffs, and did not accord with the beginning or end of occupied/ unoccupied areas. We therefore ran separate models, with colony presence defined as (i) the presence of any breeding birds, (ii) the largest (iii) the densest colonies. For the latter two categories, areas with breeding birds that did not fall into either the largest or densest categories were excluded from the modelling process instead of defining them as absence. The excluded areas represented 23.5% and 27.7% of all breeding birds respectively. This resulted in trained datasets of 43 and 44 sections for the largest and densest datasets.
A two-step approach was used to build the global model and identify the final, best-fitting models. First, to reduce the large number of covariates, a random forest classifier was fitted, using the package randomForest53 version 4.6.14. The 10 most important terms were selected to build the global logistic regression model. This parameter set was further simplified using the dredge function (MuMin package52), to perform stepwise Bayesian information criterion (BIC) selection, penalising for model size. In the case of the models for largest and densest colonies in a NW wind, the number of terms in dredge had to be gradually reduced to eight and seven respectively, to prevent fitting of models with probabilities of zero and one. The simplest model among those with a difference in BIC ≤ 2 was selected as the best final model.
The final model was assessed for goodness of fit using the McFadden54 pseudo R2. Values between 0.2 – 0.4 were considered as very satisfactory42. Model performance was also evaluated in terms of overall accuracy (OA); true skill statistic (TSS), sensitivity and specificity55,56. The effect size of each predictor included in the final model was determined by computing the odds ratio.
The previous steps were repeated for all four initial wind directions, with three logistic regression models of colony presence/absence implemented per direction, in order to identify links between wind and slope that were robust to different classifications of breeding colony. All statistical analyses were conducted in in R57version 3.6.3 and RStudio58 version 1.1.463.