.. EE 508 / DS 537: Data Science for Conservation Decisions


.. only:: instructor

   Move to RMSE / ``root_mean_squared_error`` (and perhaps R2)?

.. _lab3-1:

Predicting land acquisition cost and forest change across Massachusetts
=======================================================================

In this exercise, we will use multivariate regressions and machine learning models (tree ensembles) to estimate the parcel-level **opportunity cost of land protection** (proxied by private value) and the **probability of forest loss** (based on historical patterns) across hundreds of thousands of parcels in Massachusetts, United States.

.. admonition:: Deliverables

   -  Estimation of land values:

      -  A table of within-sample mean squared errors (MSEs) for three linear regressions.

      -  Maps of predicted values and residuals of a linear regression model with quadratic terms & interactions) (2 maps).

      -  A table of the average out-of-sample mean squared errors (MSEs) for all statistical models covered in this exercise (5-fold random cross-validation).

      -  Maps of (a) predicted values and (b) residuals of the best-performing extremely randomized trees models (2 maps).

   -  A map of predicted forest cover change

   -  Optional (extra credit): a folder containing information of a better predictive model for land acquisition cost (code, cross-validated MSE, data).


.. admonition:: Due date

   Friday, Oct 31, 2025, by 6:00 p.m.


Understand how data on costs and threat can help with policy targeting
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The analysis we conduct in this lab is a variation of a study published by Newburn et al. about Sonoma county, California, a renowned vineyard region in the U.S.

   `Newburn et al. (2006) Habitat and Open Space at Risk of Land-Use Conversion: Targeting Strategies for Land Conservation. American Journal of Agricultural Economics <https://www.jstor.org/stable/3697964>`_

   This is the direct link to the journal article at JSTOR. You can log in with your Boston University credentials (search for *Boston University*, then use your Kerberos password.

`David A. Newburn <https://agnr.umd.edu/about/directory/david-newburn/>`_, associate professor in environmental and resource economics at the University of Maryland, developed this analysis as a PhD candidate at UC Berkeley. We will build a very similar analysis in this lab, moving from the wine-growing countryside near San Francisco to the forests of Massachusetts.

Read the essential chapters of Newburn *et al.*'s analysis. I expect you to have understood the following sections to do well on writeups and tests:

.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - Chapter
     - Instructions
   * - Introduction
     - Great overview into funding for locally-funded conservation action in the United States.
   * - Modeling Framework for Prioritizing Conservation Easements
     - Read only as far as it is accessible to you.

       I won't ask you to digest, reproduce, or explain the algebra (pages 30-33), although I recommend it if you have the resources.

       Note that we'll simplify the problem: we'll use a single budget year.
   * - Empirical Procedure

       -  Research Study Area
       -  Environmental Benefits Index
     - Introduction to Sonoma and the (simple) definition of the objective function
   * - -  Land-Use Change Model
       -  Valuation of Development Rights Model
     - For the optimization to work, we need to estimate land cover change and conservation cost. These two chapters introduce the approach.

       You don't need to understand every piece.

       But study **Table 1** and **Table 2** to get an idea of the parcel-level data the authors argue is needed and available for modeling the likelihood of land cover change and the cost of halting it.
   * - Results and Discussion on Targeting Simulations
     - **Figure 1** and **Table 5** contain the essential findings.

       You can read them now, or revisit the reading when we start :ref:`Lab 3-2 <lab3-2>`.

Continue reading when you have studied **Table 1** and **Table 2** and have pondered the question of how you would produce such a dataset from scratch.

In this first part of Lab 3, we start with the same problem: we need to estimate conservation cost (the value of partial property rights) and the likelihood of loss (land cover conversion). We'll use different data, methods, and assumptions, but the structure of the problem and solution process is very similar.


Meet the modeling packages
~~~~~~~~~~~~~~~~~~~~~~~~~~

We will explore statistical models with ``statsmodels`` and ``scikit-learn``.

``statsmodels`` and ``scikit-learn`` are among the most popular Python packages for fitting statistical models to data. Both include basic models, such as linear or logistic regressions. They differ in philosophy, usage, and the overall suite of models they support. Learning to work with both allows you to build on their respective strengths. 

-  ``statsmodels`` follows the classical regression framework for defining, analyzing, and selecting statistical models.

   It is best for users that are primarily interested in **hypothesis testing, causal inference, and explanation.**

   ``statsmodels`` makes it easy to test for the significance of relationships between independent and dependent variables. Its usage is similar to R, Stata, SPSS, and similar programs: upon model fitting, the package reports results from a range of statistical tests that can assist with model assessment and selection, including measures of model fit (R2, AIC, etc.) and significance tests of parameter estimates.

   Besides linear regression, its `suite of models <https://www.statsmodels.org/stable/py-modindex.html>`_ includes logistic regressions, quantile regressions, survival analysis, and others.

-  ``scikit-learn`` is one of the most popular Python libraries for predictive modeling using simple machine learning methods.
   
   Its design places an emphasis on **predictive accuracy** rather than on statistical inference.

   The central model selection criteria are the accuracy and precision with which a model trained on a subset of the data (training data) can predict data that was **not** used to fit the model (test data) – a technique called cross-validation.

   ``scikit-learn`` has a very active support community, and offers a range of `modern machine learning methods <https://scikit-learn.org/stable/supervised_learning.html>`_, including random forests, extremely randomized trees, boosted regressions, support vector machines, and simple neural networks.

Neither ``statsmodels`` nor ``scikit-learn`` provide strong support for deep learning methods (e.g. convolutional neural networks). If you want to use these, look into ``tensorflow`` or ``pytorch``.

You are welcome to throw new models at the problem as part of your deliverables. Your ``ee508`` environment already has ``xgboost``, ``lightgbm`` and ``catboost`` installed.


Examine the parcel data
~~~~~~~~~~~~~~~~~~~~~~~

Get the data
------------

You will need two vector data files and a metadata file for this project:

   | :file:`USMA_state_outline.parquet`
   | :file:`USMA_parcels_places-2019.parquet`
   | :file:`places_metadata.xlsx`

You can find all three in:

   `~/ee508/data/external/lab3/ <https://drive.google.com/drive/u/0/folders/1lvqxcLQI1oGUKJZTsEJbzoEDDwF1TCU1>`_


Load the parcels
----------------

Start a new Jupyter notebook. Import ``numpy``, ``pandas``, ``geopandas``, and ``matplotlib.pyplot`` to their usual aliases. Set up your directory structure and paths.

Using ``gpd.read_parquet()`` to read :file:`USMA_state_outline.parquet` into a ``GeoDataFrame`` you call ``state_outline``. Plot it. Does it look familiar?

Read :file:`USMA_parcels_places-2019.parquet` into a ``GeoDataFrame`` you call ``parcels``.

.. tip::

   The vector data in this lab is provided in the `GeoParquet <https://geoparquet.org/>`_ format (``.parquet`` file extension).

   It's **great** for storing large ``GeoDataFrame`` data. Compared to other formats, it's incredible fast for reading and writing large vector data volumes. It keeps the indices. It often compresses well, too.

   See for yourself: after importing the parcels, use ``.to_file()`` to write it to a Geopackage ``.gpkg``. Read it back to memory with ``gpd.read_file()`` and compare speeds. You won't need any further persuasion.

   Not all programs support it. QGIS 3.44.2 supports it on Windows, but on Mac, you need to use ``conda`` to install a separate, Geoparquet-compatible QGIS installation (see instructions on `GeoParquet <https://geoparquet.org/>`_).

This parcel dataset is an extract from `PLACES <http://placeslab.org/places>`_, a parcel-level data processing system designed to expedite the synthesis of datasets for land conservation in the United States. It is written in Python using the same packages you use in this class.

The dataset includes all parcels in Massachusetts that were larger than one hectare and could be linked up to tract-level demographic data on income and population density (n\ :sub:`obs` = 218,716).

Look up the description of the meaning, source, and computation of each variable in :file:`places_metadata.xlsx`.

The underlying vector data, including the last sale data and the parcel ID (``'parcel_id'``), are sourced from the `Standardized Assessor's Parcels <https://www.mass.gov/info-details/massgis-data-property-tax-parcels>`_ provided by MassGIS. The ``'parcel_id'`` is unique for all parcels in Massachusetts.

Speed up the visualization with centroids
-----------------------------------------

We will frequently make maps of values of new variables (e.g., predicted values) for tens of thousands of parcels. Because the vector polygon data is large (195MB compressed), making maps of parcel **polygons** with ``plot`` will be slow (~20 sec. on my machine).

A faster workaround is to plot only the parcel **centroids** as points, and scale them by the size of the parcel: this reduces plotting time by a factor of four.

-  Create a ``.copy()`` of ``parcels`` that you call ``parcel_centroids``. Set its geometry to the centroid of the polygons:

   .. code-block::

      parcel_centroids['geometry'] = parcel_centroids['geometry'].centroid

-  Join a column (e.g. ``'slope'``) from ``parcels`` as a ``Series`` to ``parcel_centroids`` and plot its values.

   Recall that you can do this "on-the-fly":

   ::

      parcel_centroids.join(parcels['slope']).plot('slope', linewidth=0)

   Joining ``parcels`` to ``parcel_centroids`` "on the fly" (i.e., right before plotting) allows you to always use the most recent data in ``parcels``. This will become useful later when we plot and compare model results.

You can scale the point size in proportion to the size of the parcels by calling ``plot`` with a ``markersize`` argument that receives either:

   -  a column name from the ``GeoDataFrame`` (e.g., ``markersize='ha'``) or

   -  a numeric ``Series`` with the same length as the number of rows in you ``GeoDataFrame`` (e.g. ``markersize=parcels['ha']``).

We will also be happy to have a few additional columns that store the markersize for each parcel (so you can pass ``markersize='ms'`` or ``markersize='ms_large'`` as an argument to ``plot()``).


::

   parcel_centroids['ms'] = parcels['ha'] / 25
   parcel_centroids['ms_large'] = parcels['ha'] / 10


Pass these arguments to the ``plot`` function for all centroid plots in this lab:

-  ``markersize='ms'`` for maps of all parcels
-  ``markersize='ms_large'`` for maps that contain only a selection of parcels, e.g. parcels selected into a program (to see smaller parcels better).
-  ``linewidth=0`` to remove the boundary lines, which confound the visual impression of the area-based mapping.
-  ``alpha=0.5`` (or similar) to reduce the opacity of the points to 50%. This will make the points partially transparent. This is useful if you have overlapping points and wish to see all of them.

If any of your plots of a continuous variable does not show much heterogeneity due to the presence of extreme max/min values, recall that you can use the ``scheme`` argument to ask ``geopandas`` to add a classification system for the color mapping (e.g. ``scheme='quantiles'``).


Create filters and features
---------------------------

Before we jump into the modeling process, we have to pre-compute a few variables that will be needed in all of the following models:

-  The outcome variable for our deforestation risk estimation will be the change in the percentage of parcel-level forest cover between 1990 and 2010:

.. dropdown:: Reveal solution

   ::

      parcels['perc_forest_change'] = (
          parcels['perc_forest_2010'] - parcels['perc_forest_1990']
      )

-  The outcome variable for our cost estimation will be the log of the per-hectare sale price of the last sale of the property (``'lastsale_price'``).

   Using logged prices is very common in property value prediction: similar to individual wealth and income distributions, the distribution of property prices tends to very skewed. Taking the logarithm brings us closer to a common regression assumption: that errors are normally distributed.

.. dropdown:: Reveal solution
   
   ::

      parcels['lastsale_price_per_ha_log'] = np.log(
          parcels['lastsale_price'] / parcels['ha']
      )

-  We will need the year of the last sale to identify recent sales and to include the year among the predictors.

   The following code transforms a ``Series`` with dates in string format to a date format, and then uses the ``dt`` (short for date) accessor to obtain its ``year``:

   ::

      parcels['lastsale_year'] = pd.to_datetime(parcels['lastsale_date']).dt.year

-  We also need to drop properties that had an easement when they sold, because we want our estimates to reflect the full value of an unencumbered property. We can achieve this by creating a new variable that gives us the percentage of easement (``'perc_easement'``) on a property *at the time of its last sale* (``_sale``). This also means that we assume that, before the year of the easement creation (``'easement_year'``), there was no easement on the property:

.. dropdown:: Reveal solution

   ::

      parcels['perc_easement_sale'] = np.where(
          parcels['easement_year'] <= parcels['lastsale_year']
          parcels['perc_easement'],
          0
      )

   -  Note the use of ``np.where(condition, x, y)``, which is similar to ``ifelse`` in R. For each element, if ``condition`` is ``True``, it returns the value of ``x`` (or the corresponding element in ``x``, if ``x`` is a ``Series``), otherwise it returns the value of ``y`` (or its corresponding element in ``y``, if ``y`` is a ``Series``).

-  Finally, let’s normalize the building area to the parcel size (building area per parcel area - akin to building density).

.. dropdown:: Reveal solution

   ::

      parcels['bldg_area_per_ha'] = parcels['bldg_area_m2'] / parcels['ha']

The next step is to build a suite of datasets that only contain the data that we'll use in the statistical estimation (dropping everything else).

We need one dataset to predict last sale prices (for which we will use the suffix ``_sale``) and another one to estimate the probability of forest change (for which we will use the suffix ``_forest``).

For each dataset, we will need

-  a DataFrame, ``X`` (uppercase by convention), that contains all predictor variables (with suffixes: ``X_sale`` or ``X_forest``), and
-  a Series, ``y`` (lowercase by convention), that contains the independent variable (with suffixes: ``y_sale`` or ``y_forest``).

We will use different sets of predictor variables for the sale price and the forest change model. This is for three reasons:

-  We'll end up using different time periods for each predictive model. Forest change is a phenomenon that happens rarely, so we choose as long as a time period as possible (20 years) to get a sense of patterns. However, last sales data is better for recent time periods and errors increase as you go back in time; we will therefore use sales data starting in 2000. To the extent possible, the predictor variables will have to be from the beginning of the respective time period.

-  We will use year-2011 land cover classes and recent building data in the sales price model, as they provide a proxy for land use (agricultural, forest, development, etc.). However, we won't use them in the forest change model, as this could confound the outcome prediction (i.e., we would risk predicting 1990-2010 forest change by its 2011 outcome).

-  The year of sale is only relevant in the land price model.


:: 

   XCOLS_SALE = [
       'slope',
       'travel',
       'perc_wetland',
       'coast_2500',
       'river_frontage',
       'lake_frontage',
       'pop_dens_tract_2012-2016',
       'hh_inc_avg_tract_2012-2016',
       'perc_forest_2000',
       'perc_prot_2000_500',
       'perc_dev_2000_200',
       'perc_dev_2000_1000',
       'perc_dev_2000_5000',
       'perc_dev_open',
       'perc_dev_low',
       'perc_dev_medium',
       'perc_dev_high',
       'perc_crops',
       'perc_pasture',
       'perc_grassland',
       'bldg_area_per_ha',
       'lastsale_year',
   ]

   XCOLS_FOREST = [
       'slope',
       'travel',
       'perc_wetland',
       'coast_2500',
       'river_frontage',
       'lake_frontage',
       'pop_dens_tract_2012-2016',
       'hh_inc_avg_tract_2012-2016',
       'perc_forest_1990',
       'perc_prot_1990_500',
       'perc_dev_1990_200',
       'perc_dev_1990_1000',
       'perc_dev_1990_5000',
   ]


The **sale price prediction dataset** should only contain parcels:

   1. that have no missing values in the predictors (``XCOLS_SALE``) and outcomes
   2. that were sold between 2000 and 2019 (inclusive of both years), and
   3. that were not protected by an easement at the time they were sold.

Go ahead and identify those rows. There should be 44,424 of them. Once you have identified them, use ``parcels.loc[..., ...].copy()`` to create ``X_sale`` and ``y_sale``.

One way of coding up a multi-criteria selection is to build a boolean ``Series`` - let's call it ``is_selected_sale`` -  that is ``True`` for selected parcels with a last sale. You can then use that ``Series`` to select rows for both datasets (``X`` and ``y``) - and check how many items will be selected by taking its sum: ``is_selected_sale.sum()``.

.. dropdown:: Reveal solution

   ::

      is_selected_sale = (
          parcels[XCOLS_SALE].isnull().sum(1).eq(0)
          & parcels['lastsale_year'].ge(2000)
          # ... add the other criteria ...
      )
      X_sale = parcels.loc[is_selected_sale, XCOLS_SALE].copy()
      y_sale = parcels.loc[is_selected_sale, 'lastsale_price_per_ha_log'].copy()

Now that ``X_sale`` has no empty values in ``'lastsale_year'`` anymore, let's turn its values from floats into integers. This will look better in future steps (e.g. when using year dummies).

.. dropdown:: Reveal solution

   ::

      X_sale['lastsale_year'] = X_sale['lastsale_year'].astype(int)

When you're done, you should have a DataFrame ``X_sale`` that has 22 columns and 44,424 rows, and a Series ``y_sale`` with 44,424 items.

For the **forest change dataset**, we want to retain rows:

   1. that have no missing values in the predictors
   2. that were never protected (because protection is expected to change the default forest change path).

We'll implement the latter condition by selecting parcels whose ``'perc_prot'`` value is smaller than or equal to ``5`` (i.e., 5%). This arbitrary threshold allows us to keep parcels that have a very small overlap with a neighboring protection polygons (which is, in most instances, due to mapping errors).

177,277 parcels should meet these two criteria. Once you have identified them, use ``parcels.loc[..., ...].copy()`` to create ``X_forest`` and ``y_forest``. Recall that the ``y`` column is ``'perc_forest_change'``.


Fit explanatory models with statsmodels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Once you have prepared the dataset, running the regression in ``statsmodels`` is easy:

::

   import statsmodels.api as sm
   model_sale_ols = sm.OLS(y_sale, sm.add_constant(X_sale)).fit()
   model_sale_ols.summary()

.. note::

   ``sm.add_constant()`` is ``statsmodels``' approach to adding a constant intercept to the linear regression. The command adds a column named ``const`` with a constant value (:input:`1`) to the ``DataFrame``.

The output of ``.summary()`` is what you would expect from a linear regression in R, Stata, SPSS, or similar statistical packages: you'll find statistical measures of overall fit (R-squared, adjusted R-squared, AIC), as well as coefficients and significance values for the constant intercept and parameter estimates.

You can obtain the predicted (fitted) values as a ``Series`` with ``model_sale_ols.fittedvalues`` and the residuals with ``model_sale_ols.resid``. Note that both ``Series`` already have the correct index (the ``'parcel_id'``). This makes it easy to join them to your parcel data and centroids.

Create a map of the predicted values.

.. dropdown:: Reveal solution

   ::

      parcels_to_plot = parcel_centroids.join(
          model_sale_ols.fittedvalues.rename('pred'), how='inner'
      )

      fig, ax = plt.subplots(figsize=(9, 4.5))
      state_outline.boundary.plot(ax=ax, color='black', linewidth=0.2)
      parcels_to_plot.plot(
          'pred',
          ax=ax,
          markersize='ms_large',
          alpha=0.5,
          cmap='RdYlBu_r',
          scheme='quantiles',
          k=10,
          linewidth=0,
          legend=True,
          legend_kwds={'loc': 'upper left', 'bbox_to_anchor': (1, 1)},
      )
      ax.axis('off')

   -  You can only ``join`` a ``Series`` to a ``DataFrame`` if the ``Series`` has a ``name``. That's why we ``rename()`` the ``Series`` of fitted values (which has no name).

   -  The ``'inner'`` join is necessary so that only observations are retained for which we have a predicted value (in this case, the observations we used in the model).

   -  I used ``markersize='ms_large'`` as we're not displaying all parcels.

Create a map of the residuals (prediction errors).

Do you observe any spatial patterns in the residuals (clustering of colors), or does this map look perfectly random? 

-  If you're wondering what I mean, find Martha's Vineyard and Nantucket (the islands).


Remove time trends with yearly dummies
--------------------------------------

Let see whether we have a temporal trend in the residuals, by plotting the mean of the residuals in each year against the year.

We first need to create a DataFrame that has ``lastsale_year`` and the residuals as columns:

::

   df = X_sale[['lastsale_year']].join(model_sale_ols.resid.rename('resid'))

Then we use ``groupby`` to compute and plot the ``mean`` of the residuals by ``lastsale_year``:

::
   
   df.groupby('lastsale_year')['resid'].mean().plot()

Is this a linear or exponential growth curve – or is something happening around 2008?

Our simple visualization of the spatial and temporal patterns of the residuals reveals that there are spatial and temporal patterns that are not well captured by the current model and its covariates. We will try to improve this model. Before we do, we need to define what we mean with "improvement". In other words, we need a measure of prediction error (or "loss function" in machine learning) that we can use as a yardstick to compare all models.

In this lab, we will aim to minimize the mean squared error (MSE), or, in other words, the mean of all squared differences between actual values and their corresponding fitted (predicted) values.

What is the MSE of the current model? As usual, there are several ways to find out. In this case, the simplest one would be:

::

   model_sale_ols.resid.pow(2).mean()

Alternatively, you can use ``sklearn``'s inbuilt functions:

::
 
   from sklearn.metrics import mean_squared_error
   mean_squared_error(y_sale, model_sale_ols.fittedvalues)

The MSE of our current model is approximately 1.98. Let's see whether we can improve it.

One way to remove the (irregular) time trend is to use yearly dummies. This approach assumes that prices vary across years by an amount that is constant for each year. The constants can be estimated in a regression with dummy variables: for each year except one (the reference / intercept), we will create a variable that has the value 1 if the sale occurred in that year, and 0 otherwise. This approach is called "one hot" encoding.

Creating dummy variables from a categorical variable is so common that ``sklearn`` provides a function for it, the ``OneHotEncoder``:

::

   from sklearn.preprocessing import OneHotEncoder
   enc = OneHotEncoder(categories='auto').fit(X_sale[['lastsale_year']])
   dummies_array = enc.transform(X_sale[['lastsale_year']]).toarray()
   dummies_df = pd.DataFrame(
       dummies_array,
       index=X_sale.index,
       columns=[v[14:] for v in enc.get_feature_names_out()],
   )

What is happening here?

   -  We first create a ``OneHotEncoder`` object and let if find all the categories (years) automatically by "fitting" it to the ``lastsale_year`` column.
   -  We use this object to transform the data in the ``lastsale_year`` column into a 2D-array of dummy values.
   -  We turn that array into a ``DataFrame``.
   -  We use the index of the input DataFrame (``'parcel_id'``) and the year names as the column names.
   -  The small list comprehension cuts off the column name prefix  (``'lastsale_year_'``) from the column names.

Once we have the ``dummies_df`` dataframe, we can create a new version of ``X_sale`` (let's call it ``X_sale2``) in which we replace the ``lastsale_year`` column with year dummies. Note that we need to choose one reference category (year), and omit it from the regression (we omit ``'2000'``).

::

   X_sale2 = (
       X_sale.drop(columns='lastsale_year').join(dummies_df).drop(columns='2000')
   )

Run the regression with the new dataset of explanatory variables ``X_sale2``.

::

   model_sale_ols2 = sm.OLS(y_sale, sm.add_constant(X_sale2)).fit()
   model_sale_ols2.summary()

Has model fit improved?

::

   mean_squared_error(y_sale, model_sale_ols2.fittedvalues)

Plot the coefficient estimates of the year dummies:

::

   model_sale_ols2.params.loc['2001':'2017'].plot()


Add variable transformations and interactions
---------------------------------------------

As with any regression, you can add variable transformations (e.g., polynomial terms) and interactions between variables. The easiest way to do so is to create additional columns with the values of those quadratic terms and interactions.

Create a new dataset ``X_sale3`` that includes quadratic terms and interactions for all variables other than the year dummies:


::

   from itertools import combinations

   X_sale3 = X_sale.drop(columns='lastsale_year').copy()

   new_cols = {}

   for v in X_sale3.columns:
       new_cols[v + '^2'] = X_sale3[v] ** 2

   for v1, v2 in combinations(X_sale3.columns, 2):
       new_cols[v1 + '*' + v2] = X_sale3[v1] * X_sale3[v2]

   X_sale3 = pd.concat(
       [X_sale3, pd.DataFrame(new_cols, index=X_sale3.index)], axis=1
   )

   print(f'{len(X_sale3.columns)} columns')


Run the regression with the new dataset of explanatory variables ``X_sale3``. Call the model ``model_sale_ols3``. Do you find at least some of the quadratic terms and interactions to be significant? Has model fit improved? How about the spatial patterns of the residuals?

Create a new CSV file (e.g. using Excel) and note your findings:

   :file:`~/ee508/reports/lab3/1_mean_prediction_error_within_sample.csv`

The file should be a simple two-column table. The first column ``'model'`` will contain the name of the model (without the 'model' prefix), and the second, ``'mean_squared_error'``, the MSE from that prediction. It is probably quickest to fill in these values manually in Excel, just make sure you save this file in the :file:`.csv` format (not Excel sheet).

+---------------------+------------------------------------------------+
| **model**           | **mean_squared_error**                         |
+=====================+================================================+
| ols                 | 1.980                                          |
+---------------------+------------------------------------------------+
| ols2                | ...                                            |
+---------------------+------------------------------------------------+
| ols3                | ...                                            |
+---------------------+------------------------------------------------+

Create maps of the predicted values and residuals of the last model. Use the colormap ``'Spectral_r'``. Add a legend and a meaningful title to each plot. Save the map as:

   | :file:`~/ee508/reports/lab3/1_sale_ols3_resid.png`
   | :file:`~/ee508/reports/lab3/1_sale_ols3_pred.png`


Fit predictive models with scikit-learn
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We have seen that there are important non-linearities and interactions between several of the predictor variables and sales prices. However, we have not been able to remove the clustering of spatial errors in those predictions. We now transition to ``scikit-learn``, which gives us access to a wider suite of models that might help us improve our predictive performance.

As mentioned, the machine learning perspective of ``scikit-learn`` emphasizes predictive power over explainability: models are primarily judged by the accuracy by which they can predict the value of observations that they have not seen yet (i.e. observations that haven’t been used to fit the model). This approach is referred to as "out-of-sample" prediction or "cross-validation", and needs at least two distinct steps: 1) training (fitting the model on one subset of the data) and 2) testing (measuring its predictive power with another subset of the data).

We have already imported ``mean_squared_error`` from ``scikit-learn`` above. We also need to import a few predictive models and functions for cross-validation:

::

   from sklearn.linear_model import LinearRegression
   from sklearn.ensemble import (
       ExtraTreesRegressor,
       GradientBoostingRegressor,
       RandomForestRegressor
   )
   from sklearn.model_selection import train_test_split, KFold

As with ``statmodels``, fitting a model is straightforward once you have the datasets ready. The syntax is just slightly different, but consistent across all ``sklearn`` models. You first define the model, then you fit it (note that the position of ``X`` and ``y`` is different from ``statsmodels``!).

::

   model = LinearRegression()
   model.fit(X_sale, y_sale)

Once the model is fitted, the predicted values can be obtained with:

::

   model.predict(X_sale)

And the MSE with:

::

   mean_squared_error(y_sale, model.predict(X_sale))

Try it out with the different predictor datasets ``X_sale``, ``X_sale2``, and ``X_sale3``. Do you get the same MSEs as above?

You should. OLS regressions are deterministic, so results should be identical.

One downside of working with ``scikit-learn`` is that there is no ``summary()`` function for the OLS regression that allows you to assess model quality for statistical inference. Coefficient estimates can be obtained from the ``model`` object of a regression (``model.coef_``, ``model.intercept_``) and R² values with ```model.score(X, y)``. However, if you need significance levels of coefficients or other indicators of model fit (e.g. AIC/BIC), you have to compute these yourself. If your interest is statistical inference, it is easier to run the regression in ``statsmodels`` instead.


Predict out-of-sample
---------------------

How good is the **out-of-sample** prediction of our simple linear model? To find out, we need to split our original dataset randomly into training and testing subsets, fit the model on the training sample, and test its predictive accuracy with the test sample. One way to do this is to use ``sklearn``'s function ``train_test_split``, which does the random splitting for you.


::

   X_train, X_test, y_train, y_test = train_test_split(X_sale, y_sale)
   model = LinearRegression()
   model.fit(X_train, y_train)
   mean_squared_error(y_test, model.predict(X_test))

Run this code a few times and look at the error. Does it change?

It should, because each model is based on a different random subset of the dataset.

By default, ``train_test_split`` retains 25% of the sample for testing. You can change that by passing a ``test_size`` argument (e.g., ``test_size=0.5``).

To get a good sense of the predictive power of a given model, you can nest the above code in a ``for`` loop that repeates the out-of-sample MSE computation for a large number of random splits between training and test data. This gives you an *average* expected MSE, which is a more suitable indicator for comparing the predictive power of different models than the result from a single split.


Run a cross-validation
----------------------

Alternatively, you can use **k-fold cross validation**: this approach splits the original dataset into **k** different "folds" (subsets) of equal size (where **k** is some integer value). It then estimates **k** models, where each model is trained on a sample that consists of **k-1** batches (leaving out one batch), and then tested with the batch that was not used for fitting the model. The resulting MSE values are averaged across all models. This approach guarantees that each observation is used only once for validation, which makes the test statistics more robust to outliers.

We will use ``scikit-learn``'s ``KFold`` object to run our cross-validation.

::

   X = X_sale
   y = y_sale
   mse_train = []
   mse_test = []
   for i_train, i_test in KFold(n_splits=5, shuffle=True).split(X):
      print('.', end='')  # print progress: one dot per loop
      X_train, X_test = X.iloc[i_train], X.iloc[i_test]
      y_train, y_test = y.iloc[i_train], y.iloc[i_test]
      model = LinearRegression()
      model.fit(X_train, y_train)
      mse_train += [mean_squared_error(y_train, model.predict(X_train))]
      mse_test += [mean_squared_error(y_test, model.predict(X_test))]

   print('Avg MSE (within-sample):', round(np.mean(mse_train), 4))
   print('Avg MSE (out-of-sample):', round(np.mean(mse_test), 4))

First, we assign the dataset we want to use to ``X`` and ``y``. The reason for doing so is that it allows us to switch out datasets quickly (e.g. ``X_sale2``, ``X_sale3``, ``X_forest``) without having to edit multiple lines. Careful: always make sure ``X`` and ``y`` have the same order and index of observations! We also create two empty lists (``mse_train`` and ``mse_test``) in which we store the MSEs for within-sample and out-of-sample predictions, respectively.

We then create a ``KFold`` object, let it know how many folds we want to use (``5``), and indicate that we want to ``shuffle`` the dataset before splitting it. This **shuffling is important** if you have a dataset that is ordered in ways that introduces systematic differences between parts of the dataset. Here, the dataset is ordered by Massachusetts counties. If we were to fit a model on training data from some counties, and then tried to predict the outcomes in a test sample that’s mostly from another county, prediction errors will be larger than if we fit the model to a random sample and use it to predict another random sample.

The ``KFold.split(X)`` function returns an iterator. For each iteration in the ``for`` loop, this iterator returns a tuple. The first element of that tuple is an array of the **row numbers** (not the index!) for the training sample (here assigned to ``i_train``). The second element of the tuple is an array of the row numbers of the test sample (here assigned to ``i_test``).

Using those arrays, we can obtain the training and test samples for each iteration with ``X.iloc[...]`` and ``y.iloc[...]``.

``np.mean`` returns the average of the lists of MSEs (which is fine to use here as a measure of overall performance, as each fold is approximately the same size).

How large is the mean MSE for out-of-sample predictions of this simple regression compared to the MSE of the within-sample prediction? The two should be quite similar.

What MSEs do you get when using the datasets ``X_sale2`` and ``X_sale3``?

You will notice that ``X_sale3`` needs a few seconds to run the cross-validation. Recall that our cross-validation fits 5 separate models. Some of the machine learning models we will use later are even slower to fit. If you want to monitor the progress of your ``for`` loop while it is running, you can add a ``print`` statement to its beginning, e.g.:

::

   print('.', end='')

Notice how differences between the within-sample and out-of-sample predictions tends to increase as you add variables. This is a typical pattern of overfitting.

Create a :file:`.csv` file to start documenting the average out-of-sample MSE you get for each model. Give the three models the same name that you used above (**ols, ols2, ols3**).

   :file:`~/ee508/reports/lab3/1_mean_prediction_error_out_of_sample.csv`

   *This is a different CSV file with a different name than the previous one!*


Predict with tree ensembles
---------------------------

Let's see whether we can beat the predictive power of our linear regressions with machine learning algorithms. Ensemble methods tend to be good at fitting models to non-linear and irregular patterns. We will test three of them: gradient boosting, random forests, and extremely randomized trees.

Switching the predictive model in ``sklearn`` is easy: we only need to switch out the model. So far, we used a linear regression model (``model = LinearRegression()``).

Switch the model to a gradient boosting regressor.

::
   
   model = GradientBoostingRegressor()

Using ``X_sale`` as the predictor dataset, run the cross-validation. This will take a little longer than the regressions. Do you get a better out-of-sample MSE? Note the value in your CSV (with the model name **gb**).

Switch to a random forest. Let's start with 20 trees. Run the cross-validation:

::
   
   model = RandomForestRegressor(n_estimators=20)

This should take about the same time as the boosted regression. Do you get a better out-of-sample MSE? Note it in your CSV (with the model name **rf**).

Note the differences between the within-sample and out-of-sample prediction MSE. Random forests and extremely randomized trees have large discrepancies between the two. This is because predictions from tree-based ensembles are directly derived from the y-values (prices) of observations found in the final bins (leaves) of each individual decision tree. If we use decision trees to predict the values of their own training data, the resulting predictions will mostly be based on their own values! That's not really a "prediction", more of a self-reference. Reporting the within-sample prediction error for a random forest or extremely randomized tree algorithm is therefore misleading. What counts is the out-of-sample prediction error. In this case, we observe that the out-of-sample prediction error is already better for a simple random forest than for our most "advanced" linear regression.

Try the extremely randomized trees regressor. Note its out-of-sample MSE in the CSV with the model name **et**.

::
   
   model = ExtraTreesRegressor(n_estimators=50)

All of the three ensemble methods we have used allow you to inspect the relative importance of the predictor variables (called "features" in machine learning). The higher the feature importance, the more important the feature is for predicting the outcome. More specifically, a feature gets a higher importance value if its splits can separate high from low outcome values (prices) by a higher amount and for a higher share of observations.

::
   
   feat_imp = pd.Series(model.feature_importances_, index=X.columns)
   feat_imp.sort_values().plot(kind='barh')

Do the residuals of our models exhibit any spatial pattern? Let's find out:

If we were to follow the same logic for computing and plotting residuals as in the linear regression examples, we could compute and map the difference between the model predictions and the observed values as follows:


::

   y_pred = pd.Series(model.predict(X), index=X.index)
   resid = (y - y_pred).rename('resid')
   
   fig, ax = plt.subplots(figsize=(9, 4.5))
   parcel_centroids.join(resid, how='inner').plot(
       'resid',
       ax=ax,
       cmap='Spectral',
       markersize='ms',
       linewidth=0,
       scheme='quantiles',
       legend=True,
       legend_kwds={'loc': 'upper left', 'bbox_to_anchor': (1, 1)},
   )

Note the first line: ``model.predict()`` returns a numpy array. Because numpy arrays don't have indices, we have to cast it into a ``Series`` to give it an index. Do you see how this is similar to what we did with the outputs of ``zonal_stats``?

However, as we have just learned, computing the residuals in this way is misleading, especially for random forests and extremely randomized trees. To get a sense of how the prediction error is *actually* distributed, we should only be allowed to use predictions from models that have not seen the observations whose values are to be predicted. There are a few ways we could achieve that.

-  One, we could use a cross-validation approach as above, using only part of the training data to predict the remainder.

-  Two, for random forests and extremely randomized trees, there exists an easier option: out-of-bag estimates. Out-of-bag (OOB) estimates are predictions that include only predictions from decision trees that haven't seen the observation whose value they are asked to predict. In sklearn, you can compute OOB estimates by passing the arguments ``bootstrap=True`` and ``oob_score=True`` to the ``ExtraTreesRegressor`` or ``RandomForestRegressor``.

::

   model = ExtraTreesRegressor(n_estimators=50, bootstrap=True, oob_score=True)
   model.fit(X, y)
   y_pred = pd.Series(model.oob_prediction_, index=X.index)
   resid = (y - y_pred).rename('resid')

   fig, ax = plt.subplots(figsize=(9, 4.5))
   parcel_centroids.join(resid, how='inner').plot(
       'resid',
       ax=ax,
       cmap='Spectral',
       markersize='ms',
       linewidth=0,
       scheme='quantiles',
       legend=True,
       legend_kwds={'loc': 'upper left', 'bbox_to_anchor': (1, 1)},
   )

Examine the legend in detail: do you notice how its range differs between the previous map and this one?


Tweak your models
-----------------

To incorporate differences across space that are not captured by our other variables, we can add ``x`` and ``y`` coordinates of each parcel to ``X_sale``. Create a dataframe with the ``x`` and ``y`` coordinates of the centroids:

::

   parcel_centroids['x'] = parcel_centroids['geometry'].x
   parcel_centroids['y'] = parcel_centroids['geometry'].y
   parcel_xy = parcel_centroids[['x', 'y']]

Run the cross-validation with the three above methods and an input dataset that contains columns from both ``X_sale`` and ``parcel_xy``:

::

   X = X_sale.join(parcel_xy)

How much do average out-of-sample MSE improve? Note the three new results in your csv file (under the names **gb_xy**, **rf_xy,** and **et_xy**).

All three ensemble-based methods have **hyperparameters** that you can tune to improve the predictive power of your model. An important one is the number of estimators used to fit the model (``n_estimators``). In boosted regressions, ``n_estimators`` refers to the number of boosting stages. In random forests and extremely randomized trees, it refers to the number of trees.

Increase the number of boosting stages in your boosted regression from 100 to 1000. What average out-of-sample MSE do you get? Note it in the CSV (**gb_xy_1000**)

::

   model = GradientBoostingRegressor(n_estimators=1000)

Increase the number of trees in your random forests and extremely random trees from 20 and 50 to 100. What average out-of-sample MSE do you get? Note them in the CSV (**rf_xy_100, et_xy_100**).

::

   model = RandomForestRegressor(n_estimators=100, oob_score=True)

   model = ExtraTreesRegressor(
       n_estimators=100, bootstrap=True, oob_score=True
   )

Note: ``bootstrap=True`` is already the default for ``RandomForestRegressor``, so you don't have to pass this argument explicitly.

Among the models we have examined so far, extremely randomized trees with a high number of estimators give us the lowest out-of-sample MSE for a given amount of processing power. For the rest of this section, we will work with an **extremely randomized trees regressor with 250 estimators**.

Go ahead and compute the out-of-sample MSE of this model using the above k-fold cross-validation method. Note it in the CSV (**et_xy_250**).

Model fitting will take some time. If you want to speed up the process, you can add the argument ``n_jobs=-1`` to your model. This tells Python to use all available CPUs for fitting the model. It will speed up the model fitting, but might slow down other programs.

You have the opportunity to improve the predictions further for extra credit (see below).


Save your model (optional)
--------------------------

You will note that models with more estimators take longer to fit.

If you do not want to fit the same model more than once, you can save time by saving fitted models from your computer's memory to your drive, using the Python module ``pickle``.

To save any Python object to your hard drive, use ``pickle.dump``:

::

   import pickle
   pickle.dump(model, open(DIR_MODELS / 'model_name.pkl', 'wb'))

To load the saved model afterwards, use ``pickle.load``:

::

   model = pickle.load(open(DIR_MODELS / 'model_name.pkl', 'rb'))

.. note::

   The space usage of ``scikit-learn`` models varies tremendously. Linear regressions are tiny. However, random forests and extremely randomized trees can easily need a few GB. Still, loading them is often faster than fitting them.


Fit a full model
----------------

Once we are satisfied with the choice of our predictive model, we can create predictions of land value for the entire state of Massachusetts.

Find the largest sample of parcels for which we can *predict* land cost. It is the subset of all unprotected parcels for which all covariates (minus the year of sale) exist. We also need to set a year for which we would like to make the prediction (2010 here).

::

   xcols_sale_pred = [v for v in XCOLS_SALE if v != 'lastsale_year']

   is_sale_pred = parcels[xcols_sale_pred].isnull().sum(1).eq(0) & parcels[
       'perc_prot'
   ].le(5)

   X_sale_pred = parcels.loc[is_sale_pred, xcols_sale_pred]

   X_sale_pred['lastsale_year'] = 2010

We then fit our favorite model to the full sale data sample (``X_sale.join(parcel_xy)`` and ``y_sale``) and predict land conservation cost for all unencumbered parcels for which we have data.

::

   model = ExtraTreesRegressor(
      n_estimators=250, n_jobs=-1, bootstrap=True, oob_score=True
   )

   model.fit(X_sale.join(parcel_xy), y_sale)

   y_pred = pd.Series(
      model.predict(X_sale_pred.join(parcel_xy)), index=X_sale_pred.index
   )

Go ahead and create maps of predicted values and residuals for the best extremely randomized tree regressor we have run so far (call it :file:`et_xy_250`), using the same syntax as before. Add the outline of Massachusetts, a title, and remove the box. Use the ``'Spectral_r'`` colormap. Save the file as follows:

   | :file:`~/ee508/reports/lab3/1_sale_et_xy_250_pred.png` (e.g. :file:`1_sale_rf_xy_100_pred.png`)
   | :file:`~/ee508/reports/lab3/1_sale_et_xy_250_resid.png`

For the map of residuals, make sure to compute them from differences between actual values and OOB predictions (not within-sample predictions). Remember why?

For the map of predictions, make sure you're including all 177,277 parcels in the map (not only the ones that sold). To keep things simple, you don't need to use OOB predictions for the parcels that were also part of the training data, i.e. you can use the full set of predictions from ``model.predict()``. If you wanted to create a map of actual predictions, you would have to identify the training observations, and replace their predictions from ``model.predict()`` with OOB predictions.

If you want to create more fancy-looking maps, you can use ``parcels`` (polygons) instead of ``parcel_centroids`` (points) to plot the values. This will look better, but also take about 4 times as long. In the interest of students with slower machines, there is no extra credit for this.

Don't forget to save your :file:`.csv` file with the mean out-of-sample MSEs. It now should contain values for at least 13 models.

   :file:`~/ee508/reports/lab3/1_mean_prediction_error_out_of_sample.csv`


Improve your predictions (optional)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you want, you can try to improve the predictions further. If you find a model that consistently reduces the out-of-sample MSE for the sales price prediction, you can gain up to 10% extra credit on this lab. Some rules apply:

You have to use the same observations (rows) of the original dataset and predict the same outcome variable.

You can add any additional predictor variables to the set of predictors. The predictors have to be based on real data (i.e. they cannot be artificially generated by you to improve fit), and, of course, they cannot include the outcome. The dataset already has a few variables we haven't used yet. If you want, you can create your own (e.g. using zonal statistics).

You can use any predictive model, including those we haven't used yet.

You can often improve predictions by modifying the hyperparameters of the machine learning methods. Look up the ``scikit-learn`` help pages for ``GradientBoostingRegressor``, ``RandomForestRegressor`` and ``ExtraTreesRegressor`` for more information.

The model has to be able to be fit to the data within a reasonable time frame, so we do not give an unduly advantage to students with more powerful machines. While it is hard to enforce a hard limit due to differences in systems, there will be a small penalty if your model exceeds the processing time of our reference model (extremely randomized trees with 250 estimators, using ``X_sale`` with coordinates) by a factor of 5.

To submit your model, submit 1) the code to run the model, 2) the column names of the predictors (if you changed the set of predictors), 3) the mean out-of-sample MSE, and, optionally, 4) the data of your predictors (only if you created a new predictor). Put everything in the subfolder:

   :file:`~/ee508/reports/lab3/best_model/`

I will test the predictive power of all submitted models using the mean out-of-sample MSE from a 5-fold cross-validation, with 5 repetitions. The winner gets an extra credit of 10%. Everyone else will get extra credit up to 10% in proportion to how close they got to the best model as compared to the reference model: 10% * (MSE\ :sub:`best` – MSE\ :sub:`ref`).

Predict forest change
~~~~~~~~~~~~~~~~~~~~~

We will use similar machine learning methods to predict rates of forest change across Massachusetts. Measures of parcel-level forest change are usually neither normally nor binominally distributed. Instead, the majority of observations will be zeroes (no change), whereas the remainder will be a continuous measure (% of the parcel). For the purpose of this lab, we will not look in-depth into these statistical issues. Instead, we will treat forest change as a continuous variable. We can still interpret the resulting estimates as expected rates of forest change (probability \* amount).

Using the forest change dataset we created earlier (``X_forest`` and ``y_forest``) and an extremely randomized tree regressor with at least 250 estimators, create a map of predicted forest cover change for all unprotected parcels in Massachusetts (``perc_prot`` ≤ :input:`5`) for which you have complete data. In this case, all parcels are part of the training sample, so make sure to use only OOB predictions for the map!

Save the map as:

   :file:`~/ee508/reports/lab3/1_forestchange_et_pred.png`

If you want, feel free to change the covariates and tweak the regression to improve the out-of-sample prediction error. You don't need to do it, as we will use pre-computed predicted values in the next exercise. However, practicing to fit good predictive models is always useful to build an intuition for the relationships in the data.


Wrap up
~~~~~~~

Submit to the Blackboard Google Assignment a zipped archive of:

   :file:`~/ee508/reports/lab3/`

It should contain the following files:

   | :file:`1_mean_prediction_error_out_of_sample.csv`
   | :file:`1_mean_prediction_error_within_sample.csv`
   | :file:`1_forestchange_et_pred.png`
   | :file:`1_sale_ols3_pred.png`
   | :file:`1_sale_ols3_resid.png`
   | :file:`1_sale_et_xy_250_pred.png`
   | :file:`1_sale_et_xy_250_resid.png`

In :ref:`Lab 3-2 <lab3-2>`, we will use the estimated cost & forest loss risk data to model simulations.