Air-pollution monitoring is sparse across most of the United States, so geostatistical models are important for reconstructing concentrations of fine particulate air pollution (PM2.5) for use in health studies. We present XGBoost-IDW Synthesis (XIS), a daily high-resolution PM2.5 machine-learning model covering the contiguous US from 2003 through 2021. XIS uses aerosol optical depth from satellites and a parsimonious set of additional predictors to make predictions at arbitrary points, capturing near-roadway gradients and allowing the estimation of address-level exposures. We built XIS with a computationally tractable workflow for extensibility to future years, and we used weighted evaluation to fairly assess performance in sparsely monitored regions. Averaging across all years in site-level cross-validation, the weighted mean absolute error of predictions (MAE) was 2.13 μg/m3, a substantial improvement over the mean absolute deviation from the median, which was 4.23 μg/m3. Comparing XIS to a leading product from the US Environmental Protection Agency, the Fused Air Quality Surface Using Downscaling (FAQSD), we obtained a 22% reduction in MAE. We also found a stronger relationship between PM2.5 and social vulnerability with XIS than with the FAQSD. Thus, XIS has potential for reconstructing environmental exposures, and its predictions have applications in environmental justice and human health.
Machine-learning algorithms are becoming popular techniques to predict ambient air PM2.5 concentrations at high spatial resolutions (1x1 km) using satellite-based aerosol optical depth (AOD). Most machine-learning models have aimed to predict 24h-averaged PM2.5 concentrations (mean PM2.5). Over Mexico, none has been developed to predict subdaily peak levels, such as the maximum daily one-hour concentration (max PM2.5). We present a new modeling approach based on extreme gradient boosting (XGBoost) and inverse-distance weighting that uses AOD data, meteorology, and land-use variables to predict mean and max PM2.5 in Central Mexico (including the Mexico City Metropolitan Area) from 2004 through 2019. Our models for mean and max PM2.5 exhibited good performance, with overall cross-validated mean absolute errors (MAE) of 3.68 and 9.21 μg/m3 , respectively, compared to mean absolute deviations from the median (MAD) of 8.55 and 15.64 μg/m3. We also investigated applications of our mean PM2.5 predictions that can aid local authorities in air-quality management and public-health surveillance, such as the co-occurrence of high PM2.5 and heat, compliance with local air-quality standards, and the relationship of PM2.5 exposure with social marginalization.
Background: Accurate and precise estimates of ambient air temperatures that can capture fine-scale within-day variability are necessary for studies of air temperature and health. Method: We developed statistical models for predicting temperature at each hour in each cell of a 927-m square grid across the Northeast and Mid-Atlantic United States from 2003 to 2019, across ~4,000 meteorological stations from the Integrated Mesonet, using inputs such as elevation, an inverse distance-weighted interpolation of temperature, and satellite-based vegetation and land surface temperature. We used a rigorous spatial cross-validation scheme and spatially weighted the errors to estimate how well model predictions would generalize to new cell-days. We assess the within county association of temperature and social vulnerability in a heat wave as an example application. Results: We found that a model based on the XGBoost machine-learning algorithm was fast and accurate, obtaining weighted root mean square errors (RMSEs) around 1.6 K, compared to standard deviations around 11.0 K. We found similar accuracy when validating our model on an external dataset from Weather Underground. Assessing predictions from the North American Land Data Assimilation System-2 (NLDAS-2), another hourly model, in the same way, we found it was much less accurate, with RMSEs around 2.5 K. Finally, we demonstrated the health relevance of our model by showing that our temperature estimates were associated with social vulnerability across the region during a heat wave, whereas the NLDAS-2 showed a much weaker association. Conclusion: Our high spatiotemporal resolution air temperature model provides a strong contribution for future health studies in this region.