Daniel Holmberg, Ella Rauth, Mikko Markkinen, Aaro Suominen Introduction As part of the Data Science Project I course, our group worked for Pauli Paasonen and Tuomo Nieminen from the Institute for Atmospheric and Earth System Research (INAR). One of their research topics is cloud formation from aerosols, particularly cloud condensation nuclei (CCN). A proxy of CCN number concentrations are number concentrations of particles with dry diameters larger than 100nm (N100). Studying these aerosols is important because there still is significant uncertainty about the strength of their cooling effect on the global climate. Depending on factors such as the shape, size, and distributions of aerosols, different types of clouds are formed, which in turn have different albedos. Unfortunately, measuring N100 concentrations directly is very expensive, difficult, and location specific. Field data is therefore also only very sparsely available. The topic of our project was hence to create models that predict N100 concentrations based on more easily measurable variables, such as air temperature and Carbon-Monoxide (CO) concentration in the air. Temperature is known to act as an indicator of biogenic aerosol formation and CO concentration is a tracer for anthropogenic aerosol emissions. Data DescriptionAerosol (N100) concentration is measured by in-situ experiments at research stations around the world. For our project we had access to data from 22 locations, 15 of which were located in Europe. The data sets were of various lengths, ranging from less than 2 years in Alert, Canada to 15 years in Hyytiälä, Finland. Measurements are split into 10 minute intervals, which we aggregated to daily means.To predict N100 concentrations we used European Centre for Medium-Range Weather Forecasts (ECMWF) Atmospheric Composition Reanalysis 4 (EAC4) data generated using the Copernicus Atmosphere Monitoring Service (CAMS). Reanalysis combines model data with model data from across the globe. The principle of data assimilation is used where every 12 hours previous forecasts are combined with newly available observations \cite{store}. However, EAC4 provides estimates more often so we have 6-hourly data. EAC4 also interpolates horizontally and at 60 different hybrid sigma/pressure (model) levels in the vertical \cite{wiki}. We chose the lowest model level at 10 meters above ground. The two main variables we used for predicting N100 concentrations are temperature (T, measured in °K) and Carbon Monoxide (CO) concentration. In addition, we experimented with using Nitrogen Oxide (NO), Nitrogen Dioxide (NO2), Sulfur Dioxide (SO2), Isoprene (C5H8), and Cyclodecyne (C10H16) concentrations in one of our models. C5H8 and C10H16 are terpenes, organic compounds produced by a variety of plants, that could correlate with aerosol concentration. All reanalysis data was aggregated to daily averages and cut to match the date ranges of the stations in the N100 dataset.ModelsThis section describes and compares four of the models that we created for this project. We always split our data into 75% for training and 25% for testing, after shuffling the datapoints, to avoid creating biases in the model. As main performance metrics of model performance we used root-mean-square error (RMSE) and coefficient of determination (R2 ) of the non-transformed predictions. Moreover, we plotted observed versus predicted N100 concentrations (log-transformed and non-transformed) to evaluate models. In scatterplots that show log-transformed data predictions smaller than zero have been removed.Baseline Linear Regression ModelOur clients noted that when the temperature rises over 0 °C, biogenic emissions (for which T is an indicator) quickly dominate over anthropogenic emissions (for which CO is a tracer). We were therefore advised to multiply temperature with a constant and take an exponent over the product. Fig. 1 shows how the exponent behaves with \(c\ =\ 0.01\). The optimal value of \(c\) was searched by fitting the regression and calculating RMSE for each \(c\in\left[0.001,\ 0.2\right]\) with \(0.001\) steps. Additionally, we observed that the distribution of CO concentrations has a strong positive skew. We therefore decided to log-transform the CO data to obtain a less skewed distribution, that would make it easier to model the impact of CO on N100. The model equation we obtained for our baseline linear regression model is hence the following: \(N_{100} =a+b\times \exp(c\times T)+d\times\log(CO)\)where a, b, c, and d are constants. This model should not be very vulnerable to overfitting, as the two independent variables (T and CO) can be shown to have a very low correlation of 0.024, so that they are likely highly independent of each other.