Conclusion

Building a model to predict the number of gas leaks in a given area in New York City, could save lives and help streamline inspections.  These authors set out to use publicly available data sets to find predictors of gas leaks and the model which best predicts the gas leaks in a given area based on the previous years. After a review of a series of complex models, it was found that, using a Root Mean Squared Error measure of accuracy, none of these models performed as well as the naive model at predicting the number and location of gas leaks across New York City at both the census tract and zip code levels.  The more complex models were, however, able to find predictors for these leaks including population dynamics, characteristics of the built environment and construction in the area. 
It is important to note that with such a complex set of factors—including the physical state of the gas distribution network, streets, and buildings, as well as (mostly unobserved) human behaviors that introduce risk—it would be extremely difficult to assign any sort of causation to gas leaks. A conventional statistical analysis might focus on correlative relationships, but that is complicated when there is a large number of features and feature selection is non-trivial. Our research intuition is that building predictive models certainly would not address cause and may not shed deep insight into correlative factors for gas leaks, but could add value as applied to prioritizing more deterministic methods, such as on-site inspections. These probabilistic models would prove their worth in terms of accuracy and add value by optimizing the inspection effort. Additionally, as predictive models become more powerful and research with these datasets continues, a predictive model approach could lend insight into important correlations that might be used to develop risk profiles for characteristics of the built environment, such as building type.