3.1. Predicting land acquisition cost and forest change across Massachusetts

In this exercise, we will use multivariate regressions and machine learning models (tree ensembles) to estimate the parcel-level opportunity cost of land protection (proxied by private value) and the probability of forest loss (based on historical patterns) across hundreds of thousands of parcels in Massachusetts, United States.

Deliverables

Estimation of land values:
- A table of within-sample mean squared errors (MSEs) for three linear regressions.
- Maps of predicted values and residuals of a linear regression model with quadratic terms & interactions) (2 maps).
- A table of the average out-of-sample mean squared errors (MSEs) for all statistical models covered in this exercise (5-fold random cross-validation).
- Maps of (a) predicted values and (b) residuals of the best-performing extremely randomized trees models (2 maps).
A map of predicted forest cover change
Optional (extra credit): a folder containing information of a better predictive model for land acquisition cost (code, cross-validated MSE, data).

Due date

Friday, Oct 31, 2025, by 6:00 p.m.

3.1.1. Understand how data on costs and threat can help with policy targeting

The analysis we conduct in this lab is a variation of a study published by Newburn et al. about Sonoma county, California, a renowned vineyard region in the U.S.

Newburn et al. (2006) Habitat and Open Space at Risk of Land-Use Conversion: Targeting Strategies for Land Conservation. American Journal of Agricultural Economics

This is the direct link to the journal article at JSTOR. You can log in with your Boston University credentials (search for Boston University, then use your Kerberos password.

David A. Newburn, associate professor in environmental and resource economics at the University of Maryland, developed this analysis as a PhD candidate at UC Berkeley. We will build a very similar analysis in this lab, moving from the wine-growing countryside near San Francisco to the forests of Massachusetts.

Read the essential chapters of Newburn et al.’s analysis. I expect you to have understood the following sections to do well on writeups and tests:

Chapter	Instructions
Introduction	Great overview into funding for locally-funded conservation action in the United States.
Modeling Framework for Prioritizing Conservation Easements	Read only as far as it is accessible to you. I won’t ask you to digest, reproduce, or explain the algebra (pages 30-33), although I recommend it if you have the resources. Note that we’ll simplify the problem: we’ll use a single budget year.
Empirical Procedure Research Study Area Environmental Benefits Index	Introduction to Sonoma and the (simple) definition of the objective function
Land-Use Change Model Valuation of Development Rights Model	For the optimization to work, we need to estimate land cover change and conservation cost. These two chapters introduce the approach. You don’t need to understand every piece. But study Table 1 and Table 2 to get an idea of the parcel-level data the authors argue is needed and available for modeling the likelihood of land cover change and the cost of halting it.
Results and Discussion on Targeting Simulations	Figure 1 and Table 5 contain the essential findings. You can read them now, or revisit the reading when we start Lab 3-2.

Continue reading when you have studied Table 1 and Table 2 and have pondered the question of how you would produce such a dataset from scratch.

In this first part of Lab 3, we start with the same problem: we need to estimate conservation cost (the value of partial property rights) and the likelihood of loss (land cover conversion). We’ll use different data, methods, and assumptions, but the structure of the problem and solution process is very similar.

3.1.2. Meet the modeling packages

We will explore statistical models with statsmodels and scikit-learn.

statsmodels and scikit-learn are among the most popular Python packages for fitting statistical models to data. Both include basic models, such as linear or logistic regressions. They differ in philosophy, usage, and the overall suite of models they support. Learning to work with both allows you to build on their respective strengths.

statsmodels follows the classical regression framework for defining, analyzing, and selecting statistical models.

It is best for users that are primarily interested in hypothesis testing, causal inference, and explanation.

statsmodels makes it easy to test for the significance of relationships between independent and dependent variables. Its usage is similar to R, Stata, SPSS, and similar programs: upon model fitting, the package reports results from a range of statistical tests that can assist with model assessment and selection, including measures of model fit (R2, AIC, etc.) and significance tests of parameter estimates.

Besides linear regression, its suite of models includes logistic regressions, quantile regressions, survival analysis, and others.
scikit-learn is one of the most popular Python libraries for predictive modeling using simple machine learning methods.

Its design places an emphasis on predictive accuracy rather than on statistical inference.

The central model selection criteria are the accuracy and precision with which a model trained on a subset of the data (training data) can predict data that was not used to fit the model (test data) – a technique called cross-validation.

scikit-learn has a very active support community, and offers a range of modern machine learning methods, including random forests, extremely randomized trees, boosted regressions, support vector machines, and simple neural networks.

Neither statsmodels nor scikit-learn provide strong support for deep learning methods (e.g. convolutional neural networks). If you want to use these, look into tensorflow or pytorch.

You are welcome to throw new models at the problem as part of your deliverables. Your ee508 environment already has xgboost, lightgbm and catboost installed.

3.1.3. Examine the parcel data

3.1.3.1. Get the data

You will need two vector data files and a metadata file for this project:

USMA_state_outline.parquet

USMA_parcels_places-2019.parquet

places_metadata.xlsx

You can find all three in:

~/ee508/data/external/lab3/

3.1.3.2. Load the parcels

Start a new Jupyter notebook. Import numpy, pandas, geopandas, and matplotlib.pyplot to their usual aliases. Set up your directory structure and paths.

Using gpd.read_parquet() to read USMA_state_outline.parquet into a GeoDataFrame you call state_outline. Plot it. Does it look familiar?

Read USMA_parcels_places-2019.parquet into a GeoDataFrame you call parcels.

Tip

The vector data in this lab is provided in the GeoParquet format (.parquet file extension).

It’s great for storing large GeoDataFrame data. Compared to other formats, it’s incredible fast for reading and writing large vector data volumes. It keeps the indices. It often compresses well, too.

See for yourself: after importing the parcels, use .to_file() to write it to a Geopackage .gpkg. Read it back to memory with gpd.read_file() and compare speeds. You won’t need any further persuasion.

Not all programs support it. QGIS 3.44.2 supports it on Windows, but on Mac, you need to use conda to install a separate, Geoparquet-compatible QGIS installation (see instructions on GeoParquet).

This parcel dataset is an extract from PLACES, a parcel-level data processing system designed to expedite the synthesis of datasets for land conservation in the United States. It is written in Python using the same packages you use in this class.

The dataset includes all parcels in Massachusetts that were larger than one hectare and could be linked up to tract-level demographic data on income and population density (n_obs = 218,716).

Look up the description of the meaning, source, and computation of each variable in places_metadata.xlsx.

The underlying vector data, including the last sale data and the parcel ID ('parcel_id'), are sourced from the Standardized Assessor’s Parcels provided by MassGIS. The 'parcel_id' is unique for all parcels in Massachusetts.

3.1.3.3. Speed up the visualization with centroids

We will frequently make maps of values of new variables (e.g., predicted values) for tens of thousands of parcels. Because the vector polygon data is large (195MB compressed), making maps of parcel polygons with plot will be slow (~20 sec. on my machine).

A faster workaround is to plot only the parcel centroids as points, and scale them by the size of the parcel: this reduces plotting time by a factor of four.

Create a .copy() of parcels that you call parcel_centroids. Set its geometry to the centroid of the polygons:
```
parcel_centroids['geometry'] = parcel_centroids['geometry'].centroid
```
Join a column (e.g. 'slope') from parcels as a Series to parcel_centroids and plot its values.

Recall that you can do this “on-the-fly”:
```
parcel_centroids.join(parcels['slope']).plot('slope', linewidth=0)
```
Joining parcels to parcel_centroids “on the fly” (i.e., right before plotting) allows you to always use the most recent data in parcels. This will become useful later when we plot and compare model results.

You can scale the point size in proportion to the size of the parcels by calling plot with a markersize argument that receives either:

a column name from the GeoDataFrame (e.g., markersize='ha') or

a numeric Series with the same length as the number of rows in you GeoDataFrame (e.g. markersize=parcels['ha']).

We will also be happy to have a few additional columns that store the markersize for each parcel (so you can pass markersize='ms' or markersize='ms_large' as an argument to plot()).

parcel_centroids['ms'] = parcels['ha'] / 25
parcel_centroids['ms_large'] = parcels['ha'] / 10

Pass these arguments to the plot function for all centroid plots in this lab:

markersize='ms' for maps of all parcels
markersize='ms_large' for maps that contain only a selection of parcels, e.g. parcels selected into a program (to see smaller parcels better).
linewidth=0 to remove the boundary lines, which confound the visual impression of the area-based mapping.
alpha=0.5 (or similar) to reduce the opacity of the points to 50%. This will make the points partially transparent. This is useful if you have overlapping points and wish to see all of them.

If any of your plots of a continuous variable does not show much heterogeneity due to the presence of extreme max/min values, recall that you can use the scheme argument to ask geopandas to add a classification system for the color mapping (e.g. scheme='quantiles').

3.1.3.4. Create filters and features

Before we jump into the modeling process, we have to pre-compute a few variables that will be needed in all of the following models:

The outcome variable for our deforestation risk estimation will be the change in the percentage of parcel-level forest cover between 1990 and 2010:

The outcome variable for our cost estimation will be the log of the per-hectare sale price of the last sale of the property ('lastsale_price').

Using logged prices is very common in property value prediction: similar to individual wealth and income distributions, the distribution of property prices tends to very skewed. Taking the logarithm brings us closer to a common regression assumption: that errors are normally distributed.

We will need the year of the last sale to identify recent sales and to include the year among the predictors.

The following code transforms a Series with dates in string format to a date format, and then uses the dt (short for date) accessor to obtain its year:
```
parcels['lastsale_year'] = pd.to_datetime(parcels['lastsale_date']).dt.year
```
We also need to drop properties that had an easement when they sold, because we want our estimates to reflect the full value of an unencumbered property. We can achieve this by creating a new variable that gives us the percentage of easement ('perc_easement') on a property at the time of its last sale (_sale). This also means that we assume that, before the year of the easement creation ('easement_year'), there was no easement on the property:

Finally, let’s normalize the building area to the parcel size (building area per parcel area - akin to building density).

The next step is to build a suite of datasets that only contain the data that we’ll use in the statistical estimation (dropping everything else).

We need one dataset to predict last sale prices (for which we will use the suffix _sale) and another one to estimate the probability of forest change (for which we will use the suffix _forest).

For each dataset, we will need

a DataFrame, X (uppercase by convention), that contains all predictor variables (with suffixes: X_sale or X_forest), and
a Series, y (lowercase by convention), that contains the independent variable (with suffixes: y_sale or y_forest).

We will use different sets of predictor variables for the sale price and the forest change model. This is for three reasons:

We’ll end up using different time periods for each predictive model. Forest change is a phenomenon that happens rarely, so we choose as long as a time period as possible (20 years) to get a sense of patterns. However, last sales data is better for recent time periods and errors increase as you go back in time; we will therefore use sales data starting in 2000. To the extent possible, the predictor variables will have to be from the beginning of the respective time period.
We will use year-2011 land cover classes and recent building data in the sales price model, as they provide a proxy for land use (agricultural, forest, development, etc.). However, we won’t use them in the forest change model, as this could confound the outcome prediction (i.e., we would risk predicting 1990-2010 forest change by its 2011 outcome).
The year of sale is only relevant in the land price model.

XCOLS_SALE = [
    'slope',
    'travel',
    'perc_wetland',
    'coast_2500',
    'river_frontage',
    'lake_frontage',
    'pop_dens_tract_2012-2016',
    'hh_inc_avg_tract_2012-2016',
    'perc_forest_2000',
    'perc_prot_2000_500',
    'perc_dev_2000_200',
    'perc_dev_2000_1000',
    'perc_dev_2000_5000',
    'perc_dev_open',
    'perc_dev_low',
    'perc_dev_medium',
    'perc_dev_high',
    'perc_crops',
    'perc_pasture',
    'perc_grassland',
    'bldg_area_per_ha',
    'lastsale_year',
]

XCOLS_FOREST = [
    'slope',
    'travel',
    'perc_wetland',
    'coast_2500',
    'river_frontage',
    'lake_frontage',
    'pop_dens_tract_2012-2016',
    'hh_inc_avg_tract_2012-2016',
    'perc_forest_1990',
    'perc_prot_1990_500',
    'perc_dev_1990_200',
    'perc_dev_1990_1000',
    'perc_dev_1990_5000',
]

The sale price prediction dataset should only contain parcels:

that have no missing values in the predictors (XCOLS_SALE) and outcomes

that were sold between 2000 and 2019 (inclusive of both years), and

that were not protected by an easement at the time they were sold.

Go ahead and identify those rows. There should be 44,424 of them. Once you have identified them, use parcels.loc[..., ...].copy() to create X_sale and y_sale.

One way of coding up a multi-criteria selection is to build a boolean Series - let’s call it is_selected_sale - that is True for selected parcels with a last sale. You can then use that Series to select rows for both datasets (X and y) - and check how many items will be selected by taking its sum: is_selected_sale.sum().

Now that X_sale has no empty values in 'lastsale_year' anymore, let’s turn its values from floats into integers. This will look better in future steps (e.g. when using year dummies).

When you’re done, you should have a DataFrame X_sale that has 22 columns and 44,424 rows, and a Series y_sale with 44,424 items.

For the forest change dataset, we want to retain rows:

that have no missing values in the predictors

that were never protected (because protection is expected to change the default forest change path).

We’ll implement the latter condition by selecting parcels whose 'perc_prot' value is smaller than or equal to 5 (i.e., 5%). This arbitrary threshold allows us to keep parcels that have a very small overlap with a neighboring protection polygons (which is, in most instances, due to mapping errors).

177,277 parcels should meet these two criteria. Once you have identified them, use parcels.loc[..., ...].copy() to create X_forest and y_forest. Recall that the y column is 'perc_forest_change'.

3.1.4. Fit explanatory models with statsmodels

Once you have prepared the dataset, running the regression in statsmodels is easy:

import statsmodels.api as sm
model_sale_ols = sm.OLS(y_sale, sm.add_constant(X_sale)).fit()
model_sale_ols.summary()

Note

sm.add_constant() is statsmodels’ approach to adding a constant intercept to the linear regression. The command adds a column named const with a constant value (1) to the DataFrame.

The output of .summary() is what you would expect from a linear regression in R, Stata, SPSS, or similar statistical packages: you’ll find statistical measures of overall fit (R-squared, adjusted R-squared, AIC), as well as coefficients and significance values for the constant intercept and parameter estimates.

You can obtain the predicted (fitted) values as a Series with model_sale_ols.fittedvalues and the residuals with model_sale_ols.resid. Note that both Series already have the correct index (the 'parcel_id'). This makes it easy to join them to your parcel data and centroids.

Create a map of the predicted values.

Create a map of the residuals (prediction errors).

Do you observe any spatial patterns in the residuals (clustering of colors), or does this map look perfectly random?

If you’re wondering what I mean, find Martha’s Vineyard and Nantucket (the islands).

3.1.4.1. Remove time trends with yearly dummies

Let see whether we have a temporal trend in the residuals, by plotting the mean of the residuals in each year against the year.

We first need to create a DataFrame that has lastsale_year and the residuals as columns:

df = X_sale[['lastsale_year']].join(model_sale_ols.resid.rename('resid'))

Then we use groupby to compute and plot the mean of the residuals by lastsale_year:

df.groupby('lastsale_year')['resid'].mean().plot()

Is this a linear or exponential growth curve – or is something happening around 2008?

Our simple visualization of the spatial and temporal patterns of the residuals reveals that there are spatial and temporal patterns that are not well captured by the current model and its covariates. We will try to improve this model. Before we do, we need to define what we mean with “improvement”. In other words, we need a measure of prediction error (or “loss function” in machine learning) that we can use as a yardstick to compare all models.

In this lab, we will aim to minimize the mean squared error (MSE), or, in other words, the mean of all squared differences between actual values and their corresponding fitted (predicted) values.

What is the MSE of the current model? As usual, there are several ways to find out. In this case, the simplest one would be:

model_sale_ols.resid.pow(2).mean()

Alternatively, you can use sklearn’s inbuilt functions:

from sklearn.metrics import mean_squared_error
mean_squared_error(y_sale, model_sale_ols.fittedvalues)

The MSE of our current model is approximately 1.98. Let’s see whether we can improve it.

One way to remove the (irregular) time trend is to use yearly dummies. This approach assumes that prices vary across years by an amount that is constant for each year. The constants can be estimated in a regression with dummy variables: for each year except one (the reference / intercept), we will create a variable that has the value 1 if the sale occurred in that year, and 0 otherwise. This approach is called “one hot” encoding.

Creating dummy variables from a categorical variable is so common that sklearn provides a function for it, the OneHotEncoder:

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(categories='auto').fit(X_sale[['lastsale_year']])
dummies_array = enc.transform(X_sale[['lastsale_year']]).toarray()
dummies_df = pd.DataFrame(
    dummies_array,
    index=X_sale.index,
    columns=[v[14:] for v in enc.get_feature_names_out()],
)

What is happening here?

We first create a OneHotEncoder object and let if find all the categories (years) automatically by “fitting” it to the lastsale_year column.

We use this object to transform the data in the lastsale_year column into a 2D-array of dummy values.

We turn that array into a DataFrame.

We use the index of the input DataFrame ('parcel_id') and the year names as the column names.

The small list comprehension cuts off the column name prefix ('lastsale_year_') from the column names.

Once we have the dummies_df dataframe, we can create a new version of X_sale (let’s call it X_sale2) in which we replace the lastsale_year column with year dummies. Note that we need to choose one reference category (year), and omit it from the regression (we omit '2000').

X_sale2 = (
    X_sale.drop(columns='lastsale_year').join(dummies_df).drop(columns='2000')
)

Run the regression with the new dataset of explanatory variables X_sale2.

model_sale_ols2 = sm.OLS(y_sale, sm.add_constant(X_sale2)).fit()
model_sale_ols2.summary()

Has model fit improved?

mean_squared_error(y_sale, model_sale_ols2.fittedvalues)

Plot the coefficient estimates of the year dummies:

model_sale_ols2.params.loc['2001':'2017'].plot()

3.1.4.2. Add variable transformations and interactions

As with any regression, you can add variable transformations (e.g., polynomial terms) and interactions between variables. The easiest way to do so is to create additional columns with the values of those quadratic terms and interactions.

Create a new dataset X_sale3 that includes quadratic terms and interactions for all variables other than the year dummies:

from itertools import combinations

X_sale3 = X_sale.drop(columns='lastsale_year').copy()

new_cols = {}

for v in X_sale3.columns:
    new_cols[v + '^2'] = X_sale3[v] ** 2

for v1, v2 in combinations(X_sale3.columns, 2):
    new_cols[v1 + '*' + v2] = X_sale3[v1] * X_sale3[v2]

X_sale3 = pd.concat(
    [X_sale3, pd.DataFrame(new_cols, index=X_sale3.index)], axis=1
)

print(f'{len(X_sale3.columns)} columns')

Run the regression with the new dataset of explanatory variables X_sale3. Call the model model_sale_ols3. Do you find at least some of the quadratic terms and interactions to be significant? Has model fit improved? How about the spatial patterns of the residuals?

Create a new CSV file (e.g. using Excel) and note your findings:

~/ee508/reports/lab3/1_mean_prediction_error_within_sample.csv

The file should be a simple two-column table. The first column 'model' will contain the name of the model (without the ‘model’ prefix), and the second, 'mean_squared_error', the MSE from that prediction. It is probably quickest to fill in these values manually in Excel, just make sure you save this file in the .csv format (not Excel sheet).

model	mean_squared_error
ols	1.980
ols2	…
ols3	…

Create maps of the predicted values and residuals of the last model. Use the colormap 'Spectral_r'. Add a legend and a meaningful title to each plot. Save the map as:

~/ee508/reports/lab3/1_sale_ols3_resid.png

~/ee508/reports/lab3/1_sale_ols3_pred.png

3.1.5. Fit predictive models with scikit-learn

We have seen that there are important non-linearities and interactions between several of the predictor variables and sales prices. However, we have not been able to remove the clustering of spatial errors in those predictions. We now transition to scikit-learn, which gives us access to a wider suite of models that might help us improve our predictive performance.

As mentioned, the machine learning perspective of scikit-learn emphasizes predictive power over explainability: models are primarily judged by the accuracy by which they can predict the value of observations that they have not seen yet (i.e. observations that haven’t been used to fit the model). This approach is referred to as “out-of-sample” prediction or “cross-validation”, and needs at least two distinct steps: 1) training (fitting the model on one subset of the data) and 2) testing (measuring its predictive power with another subset of the data).

We have already imported mean_squared_error from scikit-learn above. We also need to import a few predictive models and functions for cross-validation:

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import (
    ExtraTreesRegressor,
    GradientBoostingRegressor,
    RandomForestRegressor
)
from sklearn.model_selection import train_test_split, KFold

As with statmodels, fitting a model is straightforward once you have the datasets ready. The syntax is just slightly different, but consistent across all sklearn models. You first define the model, then you fit it (note that the position of X and y is different from statsmodels!).

model = LinearRegression()
model.fit(X_sale, y_sale)

Once the model is fitted, the predicted values can be obtained with:

model.predict(X_sale)

And the MSE with:

mean_squared_error(y_sale, model.predict(X_sale))

Try it out with the different predictor datasets X_sale, X_sale2, and X_sale3. Do you get the same MSEs as above?

You should. OLS regressions are deterministic, so results should be identical.

One downside of working with scikit-learn is that there is no summary() function for the OLS regression that allows you to assess model quality for statistical inference. Coefficient estimates can be obtained from the model object of a regression (model.coef_, model.intercept_) and R² values with `model.score(X, y). However, if you need significance levels of coefficients or other indicators of model fit (e.g. AIC/BIC), you have to compute these yourself. If your interest is statistical inference, it is easier to run the regression in statsmodels instead.

3.1.5.1. Predict out-of-sample

How good is the out-of-sample prediction of our simple linear model? To find out, we need to split our original dataset randomly into training and testing subsets, fit the model on the training sample, and test its predictive accuracy with the test sample. One way to do this is to use sklearn’s function train_test_split, which does the random splitting for you.

X_train, X_test, y_train, y_test = train_test_split(X_sale, y_sale)
model = LinearRegression()
model.fit(X_train, y_train)
mean_squared_error(y_test, model.predict(X_test))

Run this code a few times and look at the error. Does it change?

It should, because each model is based on a different random subset of the dataset.

By default, train_test_split retains 25% of the sample for testing. You can change that by passing a test_size argument (e.g., test_size=0.5).

To get a good sense of the predictive power of a given model, you can nest the above code in a for loop that repeates the out-of-sample MSE computation for a large number of random splits between training and test data. This gives you an average expected MSE, which is a more suitable indicator for comparing the predictive power of different models than the result from a single split.

3.1.5.2. Run a cross-validation

Alternatively, you can use k-fold cross validation: this approach splits the original dataset into k different “folds” (subsets) of equal size (where k is some integer value). It then estimates k models, where each model is trained on a sample that consists of k-1 batches (leaving out one batch), and then tested with the batch that was not used for fitting the model. The resulting MSE values are averaged across all models. This approach guarantees that each observation is used only once for validation, which makes the test statistics more robust to outliers.

We will use scikit-learn’s KFold object to run our cross-validation.

X = X_sale
y = y_sale
mse_train = []
mse_test = []
for i_train, i_test in KFold(n_splits=5, shuffle=True).split(X):
   print('.', end='')  # print progress: one dot per loop
   X_train, X_test = X.iloc[i_train], X.iloc[i_test]
   y_train, y_test = y.iloc[i_train], y.iloc[i_test]
   model = LinearRegression()
   model.fit(X_train, y_train)
   mse_train += [mean_squared_error(y_train, model.predict(X_train))]
   mse_test += [mean_squared_error(y_test, model.predict(X_test))]

print('Avg MSE (within-sample):', round(np.mean(mse_train), 4))
print('Avg MSE (out-of-sample):', round(np.mean(mse_test), 4))

First, we assign the dataset we want to use to X and y. The reason for doing so is that it allows us to switch out datasets quickly (e.g. X_sale2, X_sale3, X_forest) without having to edit multiple lines. Careful: always make sure X and y have the same order and index of observations! We also create two empty lists (mse_train and mse_test) in which we store the MSEs for within-sample and out-of-sample predictions, respectively.

We then create a KFold object, let it know how many folds we want to use (5), and indicate that we want to shuffle the dataset before splitting it. This shuffling is important if you have a dataset that is ordered in ways that introduces systematic differences between parts of the dataset. Here, the dataset is ordered by Massachusetts counties. If we were to fit a model on training data from some counties, and then tried to predict the outcomes in a test sample that’s mostly from another county, prediction errors will be larger than if we fit the model to a random sample and use it to predict another random sample.

The KFold.split(X) function returns an iterator. For each iteration in the for loop, this iterator returns a tuple. The first element of that tuple is an array of the row numbers (not the index!) for the training sample (here assigned to i_train). The second element of the tuple is an array of the row numbers of the test sample (here assigned to i_test).

Using those arrays, we can obtain the training and test samples for each iteration with X.iloc[...] and y.iloc[...].

np.mean returns the average of the lists of MSEs (which is fine to use here as a measure of overall performance, as each fold is approximately the same size).

How large is the mean MSE for out-of-sample predictions of this simple regression compared to the MSE of the within-sample prediction? The two should be quite similar.

What MSEs do you get when using the datasets X_sale2 and X_sale3?

You will notice that X_sale3 needs a few seconds to run the cross-validation. Recall that our cross-validation fits 5 separate models. Some of the machine learning models we will use later are even slower to fit. If you want to monitor the progress of your for loop while it is running, you can add a print statement to its beginning, e.g.:

print('.', end='')

Notice how differences between the within-sample and out-of-sample predictions tends to increase as you add variables. This is a typical pattern of overfitting.

Create a .csv file to start documenting the average out-of-sample MSE you get for each model. Give the three models the same name that you used above (ols, ols2, ols3).

~/ee508/reports/lab3/1_mean_prediction_error_out_of_sample.csv

This is a different CSV file with a different name than the previous one!

3.1.5.3. Predict with tree ensembles

Let’s see whether we can beat the predictive power of our linear regressions with machine learning algorithms. Ensemble methods tend to be good at fitting models to non-linear and irregular patterns. We will test three of them: gradient boosting, random forests, and extremely randomized trees.

Switching the predictive model in sklearn is easy: we only need to switch out the model. So far, we used a linear regression model (model = LinearRegression()).

Switch the model to a gradient boosting regressor.

model = GradientBoostingRegressor()

Using X_sale as the predictor dataset, run the cross-validation. This will take a little longer than the regressions. Do you get a better out-of-sample MSE? Note the value in your CSV (with the model name gb).

Switch to a random forest. Let’s start with 20 trees. Run the cross-validation:

model = RandomForestRegressor(n_estimators=20)

This should take about the same time as the boosted regression. Do you get a better out-of-sample MSE? Note it in your CSV (with the model name rf).

Note the differences between the within-sample and out-of-sample prediction MSE. Random forests and extremely randomized trees have large discrepancies between the two. This is because predictions from tree-based ensembles are directly derived from the y-values (prices) of observations found in the final bins (leaves) of each individual decision tree. If we use decision trees to predict the values of their own training data, the resulting predictions will mostly be based on their own values! That’s not really a “prediction”, more of a self-reference. Reporting the within-sample prediction error for a random forest or extremely randomized tree algorithm is therefore misleading. What counts is the out-of-sample prediction error. In this case, we observe that the out-of-sample prediction error is already better for a simple random forest than for our most “advanced” linear regression.

Try the extremely randomized trees regressor. Note its out-of-sample MSE in the CSV with the model name et.

model = ExtraTreesRegressor(n_estimators=50)

All of the three ensemble methods we have used allow you to inspect the relative importance of the predictor variables (called “features” in machine learning). The higher the feature importance, the more important the feature is for predicting the outcome. More specifically, a feature gets a higher importance value if its splits can separate high from low outcome values (prices) by a higher amount and for a higher share of observations.

feat_imp = pd.Series(model.feature_importances_, index=X.columns)
feat_imp.sort_values().plot(kind='barh')

Do the residuals of our models exhibit any spatial pattern? Let’s find out:

If we were to follow the same logic for computing and plotting residuals as in the linear regression examples, we could compute and map the difference between the model predictions and the observed values as follows:

y_pred = pd.Series(model.predict(X), index=X.index)
resid = (y - y_pred).rename('resid')

fig, ax = plt.subplots(figsize=(9, 4.5))
parcel_centroids.join(resid, how='inner').plot(
    'resid',
    ax=ax,
    cmap='Spectral',
    markersize='ms',
    linewidth=0,
    scheme='quantiles',
    legend=True,
    legend_kwds={'loc': 'upper left', 'bbox_to_anchor': (1, 1)},
)

Note the first line: model.predict() returns a numpy array. Because numpy arrays don’t have indices, we have to cast it into a Series to give it an index. Do you see how this is similar to what we did with the outputs of zonal_stats?

However, as we have just learned, computing the residuals in this way is misleading, especially for random forests and extremely randomized trees. To get a sense of how the prediction error is actually distributed, we should only be allowed to use predictions from models that have not seen the observations whose values are to be predicted. There are a few ways we could achieve that.

One, we could use a cross-validation approach as above, using only part of the training data to predict the remainder.
Two, for random forests and extremely randomized trees, there exists an easier option: out-of-bag estimates. Out-of-bag (OOB) estimates are predictions that include only predictions from decision trees that haven’t seen the observation whose value they are asked to predict. In sklearn, you can compute OOB estimates by passing the arguments bootstrap=True and oob_score=True to the ExtraTreesRegressor or RandomForestRegressor.

model = ExtraTreesRegressor(n_estimators=50, bootstrap=True, oob_score=True)
model.fit(X, y)
y_pred = pd.Series(model.oob_prediction_, index=X.index)
resid = (y - y_pred).rename('resid')

fig, ax = plt.subplots(figsize=(9, 4.5))
parcel_centroids.join(resid, how='inner').plot(
    'resid',
    ax=ax,
    cmap='Spectral',
    markersize='ms',
    linewidth=0,
    scheme='quantiles',
    legend=True,
    legend_kwds={'loc': 'upper left', 'bbox_to_anchor': (1, 1)},
)

Examine the legend in detail: do you notice how its range differs between the previous map and this one?

3.1.5.4. Tweak your models

To incorporate differences across space that are not captured by our other variables, we can add x and y coordinates of each parcel to X_sale. Create a dataframe with the x and y coordinates of the centroids:

parcel_centroids['x'] = parcel_centroids['geometry'].x
parcel_centroids['y'] = parcel_centroids['geometry'].y
parcel_xy = parcel_centroids[['x', 'y']]

Run the cross-validation with the three above methods and an input dataset that contains columns from both X_sale and parcel_xy:

X = X_sale.join(parcel_xy)

How much do average out-of-sample MSE improve? Note the three new results in your csv file (under the names gb_xy, rf_xy, and et_xy).

All three ensemble-based methods have hyperparameters that you can tune to improve the predictive power of your model. An important one is the number of estimators used to fit the model (n_estimators). In boosted regressions, n_estimators refers to the number of boosting stages. In random forests and extremely randomized trees, it refers to the number of trees.

Increase the number of boosting stages in your boosted regression from 100 to 1000. What average out-of-sample MSE do you get? Note it in the CSV (gb_xy_1000)

model = GradientBoostingRegressor(n_estimators=1000)

Increase the number of trees in your random forests and extremely random trees from 20 and 50 to 100. What average out-of-sample MSE do you get? Note them in the CSV (rf_xy_100, et_xy_100).

model = RandomForestRegressor(n_estimators=100, oob_score=True)

model = ExtraTreesRegressor(
    n_estimators=100, bootstrap=True, oob_score=True
)

Note: bootstrap=True is already the default for RandomForestRegressor, so you don’t have to pass this argument explicitly.

Among the models we have examined so far, extremely randomized trees with a high number of estimators give us the lowest out-of-sample MSE for a given amount of processing power. For the rest of this section, we will work with an extremely randomized trees regressor with 250 estimators.

Go ahead and compute the out-of-sample MSE of this model using the above k-fold cross-validation method. Note it in the CSV (et_xy_250).

Model fitting will take some time. If you want to speed up the process, you can add the argument n_jobs=-1 to your model. This tells Python to use all available CPUs for fitting the model. It will speed up the model fitting, but might slow down other programs.

You have the opportunity to improve the predictions further for extra credit (see below).

3.1.5.5. Save your model (optional)

You will note that models with more estimators take longer to fit.

If you do not want to fit the same model more than once, you can save time by saving fitted models from your computer’s memory to your drive, using the Python module pickle.

To save any Python object to your hard drive, use pickle.dump:

import pickle
pickle.dump(model, open(DIR_MODELS / 'model_name.pkl', 'wb'))

To load the saved model afterwards, use pickle.load:

model = pickle.load(open(DIR_MODELS / 'model_name.pkl', 'rb'))

Note

The space usage of scikit-learn models varies tremendously. Linear regressions are tiny. However, random forests and extremely randomized trees can easily need a few GB. Still, loading them is often faster than fitting them.

3.1.5.6. Fit a full model

Once we are satisfied with the choice of our predictive model, we can create predictions of land value for the entire state of Massachusetts.

Find the largest sample of parcels for which we can predict land cost. It is the subset of all unprotected parcels for which all covariates (minus the year of sale) exist. We also need to set a year for which we would like to make the prediction (2010 here).

xcols_sale_pred = [v for v in XCOLS_SALE if v != 'lastsale_year']

is_sale_pred = parcels[xcols_sale_pred].isnull().sum(1).eq(0) & parcels[
    'perc_prot'
].le(5)

X_sale_pred = parcels.loc[is_sale_pred, xcols_sale_pred]

X_sale_pred['lastsale_year'] = 2010

We then fit our favorite model to the full sale data sample (X_sale.join(parcel_xy) and y_sale) and predict land conservation cost for all unencumbered parcels for which we have data.

model = ExtraTreesRegressor(
   n_estimators=250, n_jobs=-1, bootstrap=True, oob_score=True
)

model.fit(X_sale.join(parcel_xy), y_sale)

y_pred = pd.Series(
   model.predict(X_sale_pred.join(parcel_xy)), index=X_sale_pred.index
)

Go ahead and create maps of predicted values and residuals for the best extremely randomized tree regressor we have run so far (call it et_xy_250), using the same syntax as before. Add the outline of Massachusetts, a title, and remove the box. Use the 'Spectral_r' colormap. Save the file as follows:

~/ee508/reports/lab3/1_sale_et_xy_250_pred.png (e.g. 1_sale_rf_xy_100_pred.png)

~/ee508/reports/lab3/1_sale_et_xy_250_resid.png

For the map of residuals, make sure to compute them from differences between actual values and OOB predictions (not within-sample predictions). Remember why?

For the map of predictions, make sure you’re including all 177,277 parcels in the map (not only the ones that sold). To keep things simple, you don’t need to use OOB predictions for the parcels that were also part of the training data, i.e. you can use the full set of predictions from model.predict(). If you wanted to create a map of actual predictions, you would have to identify the training observations, and replace their predictions from model.predict() with OOB predictions.

If you want to create more fancy-looking maps, you can use parcels (polygons) instead of parcel_centroids (points) to plot the values. This will look better, but also take about 4 times as long. In the interest of students with slower machines, there is no extra credit for this.

Don’t forget to save your .csv file with the mean out-of-sample MSEs. It now should contain values for at least 13 models.

~/ee508/reports/lab3/1_mean_prediction_error_out_of_sample.csv

3.1.6. Improve your predictions (optional)

If you want, you can try to improve the predictions further. If you find a model that consistently reduces the out-of-sample MSE for the sales price prediction, you can gain up to 10% extra credit on this lab. Some rules apply:

You have to use the same observations (rows) of the original dataset and predict the same outcome variable.

You can add any additional predictor variables to the set of predictors. The predictors have to be based on real data (i.e. they cannot be artificially generated by you to improve fit), and, of course, they cannot include the outcome. The dataset already has a few variables we haven’t used yet. If you want, you can create your own (e.g. using zonal statistics).

You can use any predictive model, including those we haven’t used yet.

You can often improve predictions by modifying the hyperparameters of the machine learning methods. Look up the scikit-learn help pages for GradientBoostingRegressor, RandomForestRegressor and ExtraTreesRegressor for more information.

The model has to be able to be fit to the data within a reasonable time frame, so we do not give an unduly advantage to students with more powerful machines. While it is hard to enforce a hard limit due to differences in systems, there will be a small penalty if your model exceeds the processing time of our reference model (extremely randomized trees with 250 estimators, using X_sale with coordinates) by a factor of 5.

To submit your model, submit 1) the code to run the model, 2) the column names of the predictors (if you changed the set of predictors), 3) the mean out-of-sample MSE, and, optionally, 4) the data of your predictors (only if you created a new predictor). Put everything in the subfolder:

~/ee508/reports/lab3/best_model/

I will test the predictive power of all submitted models using the mean out-of-sample MSE from a 5-fold cross-validation, with 5 repetitions. The winner gets an extra credit of 10%. Everyone else will get extra credit up to 10% in proportion to how close they got to the best model as compared to the reference model: 10% * (MSE_best – MSE_ref).

3.1.7. Predict forest change

We will use similar machine learning methods to predict rates of forest change across Massachusetts. Measures of parcel-level forest change are usually neither normally nor binominally distributed. Instead, the majority of observations will be zeroes (no change), whereas the remainder will be a continuous measure (% of the parcel). For the purpose of this lab, we will not look in-depth into these statistical issues. Instead, we will treat forest change as a continuous variable. We can still interpret the resulting estimates as expected rates of forest change (probability * amount).

Using the forest change dataset we created earlier (X_forest and y_forest) and an extremely randomized tree regressor with at least 250 estimators, create a map of predicted forest cover change for all unprotected parcels in Massachusetts (perc_prot ≤ 5) for which you have complete data. In this case, all parcels are part of the training sample, so make sure to use only OOB predictions for the map!

Save the map as:

~/ee508/reports/lab3/1_forestchange_et_pred.png

If you want, feel free to change the covariates and tweak the regression to improve the out-of-sample prediction error. You don’t need to do it, as we will use pre-computed predicted values in the next exercise. However, practicing to fit good predictive models is always useful to build an intuition for the relationships in the data.

3.1.8. Wrap up

Submit to the Blackboard Google Assignment a zipped archive of:

~/ee508/reports/lab3/

It should contain the following files:

1_mean_prediction_error_out_of_sample.csv

1_mean_prediction_error_within_sample.csv

1_forestchange_et_pred.png

1_sale_ols3_pred.png

1_sale_ols3_resid.png

1_sale_et_xy_250_pred.png

1_sale_et_xy_250_resid.png

In Lab 3-2, we will use the estimated cost & forest loss risk data to model simulations.