Alqamah Sayeed - Authorea

Health and environmental hazards related to high pollutant concentrations have become a serious issue from the perspectives of public policy and human health. The objective of this research is to improve the estimation of grid-wise PM2.5, a criteria pollutant, by reducing systematic bias in estimating PM2.5 empirically from speciation provided by MERRA-2 using a ML approach. We present a unique application of machine learning (ML) for estimating hourly PM2.5 concentrations at grid points of Modern-Era Retrospective analysis for Research and Applications version 2 (MERRA-2). The model was trained using various meteorological parameters and aerosol species simulated by MERRA-2 and ground measurements from Environmental Protection Agency (EPA) air quality system (AQS) stations. monitors. The ML approach significantly improved performance and reduced mean bias in the 0-10 µg m-3 range. We also used the Random Forest ML model for each EPA region using one year of collocated datasets. The resulting ML models for each EPA region were validated and the aggregate data set has a Pearson correlation of 0.88 (RMSE = 4.8 µg m-3) and 0.82 (RMSE = 5.8 µg m-3) for training and testing, respectively. The correlation (and RMSE) increased to 0.89 (4.0), 0.95 (1.6), 0.94 (1.1) for daily, monthly, and yearly average comparisons. The results from initial implementation of the ML model for global region are encouraging but require more research and development to overcome challenges associated with data gaps in many parts of the world.