Several models were implemented: linear regression, linear regression with Lasso (L1) and Ridge (L2) regularization. Top features were identified through multiple techniques, including linear correlation, lasso regularization, maximal information coefficient (similar to linear coefficient, but can identify non-linear relationships), recursive feature elimination, and ridge regularization. A random forest model and a simple multi-layer perceptron (MLP) neural network were also implemented. All of the models were generated in python, using the scikit-learn machine learning packages,\cite{scikit-learn} except the neural network which was developed using tensorflow and its python bindings.\cite{tensorflow2015-whitepaper} The codes for these models are available in the public code repository accompanying this paper.\cite{github}
Each model was initially trained on 2013 data, using 2013 gas leaks as the target variable to properly weight model parameters. They were then cross-validated by predicting 2014 gas leaks using 2013 features, with the assumption that a true predictive model would have only the previous years' data to predict the next year's leaks. In this manner, each model was optimized by tuning hyper-parameters (such as specific features selected with linear regression or the learning rate and number of hidden units in the neural network). Once sufficiently optimized, each model was tested by predicting 2015 gas leaks using 2014 features and compared based on overall RMSE.

Results

Tables I and II and Figures \ref{576363} and \ref{508370} give visual and tabular performance indicators for each of the models.  At the zip code level and the census tract level, the naive model boasted the lowest overall RMSE, with 0.002446 at the zip code level and 0.3173 at the census tract level.
Top Model Performance (Zip Code Level)
Model Total RMSE
Naive 0.002446
Random forest 0.002946
Ridge regression 0.003661
Linear regression (select features) 0.003663
MLP neural network 0.005156