.. EE 508 / DS 537: Data Science for Conservation Decisions

.. _lab4:

Estimating the effects of protected areas on Amazon deforestation
=================================================================

In this lab, we will use a popular quasi-experimental statistical method - matching - to estimate the effects of policies on land use.

Matching is a quasi-experimental method designed to *emulate* a random experiment from observational data. 

Starting with a set of units that have been subject to a given program (the **treatment units**, which may refer to people, properties, pixels, ...), matching selects from units that were *not* subject to the program (the **"pool" of potential control units**) the ones that are *as similar as possible* to the treatment units (**matched control units**).

"Similarity" is defined in terms of **confounders**. Confounders are variables having an effect on *both* the outcome and the likelihood of treatment. As a result, confounders "confound" the treatment effect (leading to an over- or underestimation of the policy effect if not controlled for).

The first application of statistical matching to conservation policy questions was a 2008 study on the `impacts of protected areas on deforestation in Costa Rica (Andam et al. 2008) <https://www.pnas.org/content/105/42/16089.short>`_, which used satellite data to show that parks did have an impact on forest cover, but that the effect is small after controlling for confounders, because parks were mostly located on land that was not at risk of losing forests.

I found Andam's paper in 2010 when I started my PhD at the University of Michigan. It inspired this analysis and paper, which provides the data for this lab:

`Nolte et al. (2013) Governance regime and location influence avoided deforestation success of protected areas in the Brazilian Amazon. PNAS <https://doi.org/10.1073/pnas.1214786110>`_

.. image:: images/lab4_nolte_pnas_fig2a.png
   :alt: Indigenous lands had the lowest deforestation in high-pressure locations.
   :width: 300
   :align: right

*"We show that, for any given level of deforestation pressure, strictly protected areas consistently avoided more deforestation than sustainable use areas. Indigenous lands were particularly effective at avoiding deforestation in locations with high deforestation pressure. Findings were stable across two time periods featuring major shifts in the intensity of government enforcement. We also observed shifting trends in the location of protected areas, documenting that, between 2000 and 2005, strictly protected areas were more likely to be established in high-pressure locations than sustainable use areas and indigenous lands. Our findings confirm that all protection regimes helped reduce deforestation in the Brazilian Amazon."*

You don't have to read the paper or memorize its content, but you're welcome to check it out and use it as a reference (e.g., if you want to learn more about the underlying datasets).

We're going to use the same data for a different use case scenario:


.. admonition:: Who was more successful in reducing deforestation?

   April 27, 2016. The Brazilian government is in a political crisis. To regain the support of its electorate, the ruling Worker’s Party (PT) commissions a series of analyses to demonstrate the positive impacts of its policies. The PT is particularly proud of its success in reducing Amazon deforestation: Since President Lula took office on January 1\ :sup:`st`, 2003, deforestation declined by more than 70%, making Brazil eligible for billions of dollars in international funding for Reducing Emissions from Deforestation and Forest Degradation (REDD).

   The PT’s main adversary, the Brazilian Social Democracy Party (PSDB), attempts to sabotage the image campaign. A PSDB press release reports that the previous government under president Cardoso contributed more to protecting the Amazon than Lula’s. PSDB analysts point out that, between 1998 and 2002, the Cardoso government created larger protected areas and indigenous lands (PA/IL) than the PT in the following five years. The PT counters that the PA/ILs declared by the PSDB were situated in areas far away from deforestation pressure, and therefore did not contribute much to the observed reductions.

   The PT does not have sufficient in-house research expertise to make its point. It therefore commissions an independent external study. Your consultancy has won the contract. The Terms of Reference (ToR) ask you to answer the following question:

   How much deforestation has been avoided by PA/ILs declared by the late Cardoso government (PSDB, 1998-2002) as compared to PA/ILs declared by the early Lula government (PT, 2003-2007)? Is there evidence that the PA/ILs created by the PT are having a higher overall impact?


.. admonition:: Deliverables

   -  Map of pixels classified by protected area type in the Brazilian Amazon
   
   -  Map of pixels protected under president Cardoso with matched control pixels

   -  Two density plots of the distribution of a covariate across different treatment and control groups: before vs. after matching.

   -  A :file:`.csv` table with matching quality indicators and impact estimates for areas protected under Cardoso and Lula.


.. admonition:: Due date

   Tue, Nov 25, 2025, by 6:00 p.m.


Prepare software and data
~~~~~~~~~~~~~~~~~~~~~~~~~

Install R and RStudio
---------------------

If you don't have an existing integrated development environment (IDE) for R, `download R and RStudio <https://posit.co/download/rstudio-desktop/>`_ and install both.


Download the data
-----------------

We need two data tables for this analysis:

   | :file:`amazon_pixel_sample.csv`
   | :file:`amazon_protected_areas.csv`

You can find both in:

   `~/ee508/data/external/lab4/ <https://drive.google.com/drive/u/0/folders/1_aLm3wEpam9w7buyms3fxlj_HrIZnLPl>`_


Set up packages and filepaths
-----------------------------

Open your R environment, e.g., :gui:`RStudio`.

Install the packages ``tidyverse``, ``Matching``, and ``styler`` with ``install.packages()``:

.. code-block:: R

   install.packages('tidyverse')
   # etc.

Activate ``tidyverse`` and ``Matching``:

.. code-block:: R

   library(tidyverse)
   library(Matching)

-  ``styler`` is available in your RStudio menu to help you with quick code styling: :gui:`Addins` > :gui:`Style active file` (doesn't do line length enforcement).

Set your working directory and filepaths to the downloaded data, e.g.:

.. code-block:: R
   
   # Set working directory - change this to your EE 508 directory
   setwd('~/ee508')

   # Set filepaths to downloaded data (here: from working directory)
   pixels_path <- file.path('data', 'external', 'lab4', 'amazon_pixel_sample.csv')
   protected_areas_path <- file.path(
     'data', 'external', 'lab4', 'amazon_protected_areas.csv'
   )


-  Note the use of ``<-`` to assign variables in R.

-  You can obtain help for any function in R with ``?<command>``, for instance: ``?file.path``

-  Note (or remember) that R allows dots (``.``) in function and variable names.

   ``file.path`` is a standalone function name (not the attribute ``path`` of the object ``file``, as it would be interpreted in Python).


Explore the Amazon data
~~~~~~~~~~~~~~~~~~~~~~~

Import the two data tables:

.. code-block:: R

   pixels <- read.csv(pixels_path)
   protected_areas <- read.csv(protected_areas_path, fileEncoding = 'latin1')

-  Inspect the tables with ``head(pixels)`` and ``head(protected_areas)``.


| :file:`amazon_pixel_sample.csv`
| a 1% sample of 1km\ :sup:`2` pixels in the Brazilian Amazon with forests in 2000.

.. list-table::
   :header-rows: 1
   :widths: 14 86

   * - **Variable**
     - **Meaning**
   * - x
     - x coordinate in WGS84 (longitude)
   * - y
     - y coordinate in WGS84 (latitude)
   * - wdpaid
     - ID of the protected area in the World Database of Protected Areas (WDPA, `www.protectedplanet.net <http://www.protectedplanet.net>`_). If this value is 0, the pixel is not protected.
   * - state
     - ID of Brazilian state
   * - vcf00
     - Tree cover in 2000 (%)
   * - vcf05
     - Tree cover in 2005 (%)
   * - distedg00
     - Distance to the forest edge in 2000 (m)
   * - distedg05
     - Distance to the forest edge in 2005 (m)
   * - distpa
     - Distance to the nearest protected area (m)
   * - trvltime
     - Travel time to major cities (min)
   * - slope
     - Slope (%)
   * - elev
     - Elevation (m)
   * - floodable
     - Percentage of area that is floodable (%)
   * - prod0105
     - Forest loss 2001-05 according to PRODES (Brazil's official monitoring system for clear-cut deforestation), as a fraction of the pixel area (0-1).
   * - prod0610
     - Forest loss 2006-10 according to PRODES, as a fraction of the pixel area (0-1)


| :file:`amazon_protected_areas.csv`
| Protected areas and indigenous territories in the Brazilian Amazon (ca. 2012)

.. list-table::
   :header-rows: 1
   :widths: 14 86

   * - **Variable**
     - **Meaning**
   * - wdpaid
     - ID of the protected area in the WDPA
   * - name
     - Name of the protected area / indigenous territories
   * - status_yr
     - Designation year of the protected area / indigenous territory
   * - patype
     - governance category (SP: strict protection, SU: sustainable use, IT: indigenous territory)

Join the two datasets on their wdpaid:

.. code-block:: R
   
   d <- merge(pixels, protected_areas, on='wdpaid', all.x=TRUE)

What does the argument ``all.x=TRUE`` do? Find out with ``?merge``. Is this a left join?

Make a scatterplot of the pixels:

.. code-block:: R

   ggplot(d) +
     geom_point(aes(x=x, y=y))

-  ``ggplot2`` is a powerful data visualization package for R that is part of the ``tidyverse``. Covering its full potential is beyond the scope of this course. For a fantastic introduction, see the Data Visualisation chapter in the free online book: `R for Data Science <https://r4ds.had.co.nz/data-visualisation.html>`_.

-  ``aes`` refers to the "*aes*\ thetic mapping" of the data. With just small modifications, you can create quick visualizations of the Amazon data. Go through the following examples.

Color the pixels as a function of one of the variables:

.. code-block:: R

   ggplot(d) +
     geom_point(aes(x=x, y=y, color=vcf00))

You'll note that some of the pixels are large and overlap. You can change that:

.. code-block:: R

   ggplot(d) +
     geom_point(aes(x=x, y=y, color=vcf00), alpha=0.5, size=0.25)

Change the color palette to one that reflects the meaning of the variable better:

.. code-block:: R

   ggplot(d) +
     geom_point(aes(x=x, y=y, color=vcf00), alpha=0.5, size=0.25) +
     scale_color_gradient(low='white', high='forestgreen')

Note how you "chain" ``ggplot`` commands together with the plus (``+``) symbol. It must appear at the end of the previous line (not at the beginning of the next).

If want to be sure that the axes have the same scale (required for maps), add ``coord_fixed():``

.. code-block:: R

   ggplot(d) +
       geom_point(aes(x=x, y=y, color=vcf00), alpha=0.5, size=0.25) +
       scale_color_gradient(low='white', high='forestgreen') +
       coord_fixed()

With just a few more lindes of code, you can add a title and save the plot as an image.

.. code-block:: R

   ggplot(d) +
       geom_point(aes(x=x, y=y, color=vcf00), alpha=0.5, size=0.25) +
       scale_color_gradient(low='white', high='forestgreen') +
       coord_fixed() +
       ggtitle('Tree Cover, 2000')
   ggsave(
     filename = file.path('reports', 'lab4', 'extra', 'vcf00.png'),
     width = 9,
     height = 6,
   )

What happens if you visualize the variable ``state`` instead of ``vcf00``?

``state`` is currently interpreted as a number, not as a category. In fact, we need ``state`` to be a number (to match on it later). If you want to map ``state`` as a category, you can create a new variable that transforms ``state`` into a ``factor``, and map that new variable instead:

.. code-block:: R

   d$state_factor <- factor(d$state)

Note how the dollar sign operator (``$``) allows us to access columns of dataframes.

What happens if you map ``state_factor``? (Tip: don't use a ``scale_color_gradient``)

Make a map of protected area types (``patype``) and save it:

   :file:`~/ee508/reports/lab4/1_protected_area_types.png`

For a full list of geometries ``ggplot2`` can plot (lines, bar charts, boxplots, violin plots, heat maps, etc.), consult the `ggplot2 reference <https://ggplot2.tidyverse.org/reference/>`_.


Examine covariate distributions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

First, let's find out where the protected areas are situated:

Create a new column in ``d`` (let's call it ``cat`` for category), that identifies, for each pixel, whether it was protected between 1998 and 2002 (first treatment group), between 2003 and 2007 (second treatment group), or never protected (control group).

.. code-block:: R

   d$cat <- ifelse(
     d$wdpaid == 0,
     'unprotected',
     ifelse(
       d$status_yr >= 1998 & d$status_yr <= 2002,
       'cardoso',
       ifelse(d$status_yr >= 2003 & d$status_yr <= 2007, 'lula', NA)
     )
   )

Note: ``ifelse`` in R has the same functionality as ``np.where`` in Python: if the condition is ``True``, it returns the first value, otherwise the second, element-wise. You can nest them, as we do here.

The pixels with no ``'cat'`` value were protected in other time periods. We will discard them.

.. code-block:: R

   d <- d[!is.na(d$cat), ]

The exclamation mark (``!``) in R has the same functionality as the tilde (``~``) in Python: it inverses boolean values.

.. note::

   Selecting rows from R DataFrames looks similar to selecting rows with ``.loc`` in ``pandas``: if you provide a boolean array, it returns the rows for which the array is ``True`` (boolean indexing).

   The comma after the boolean array tells R that you are selecting rows (not columns). It cannot be omitted.

How many pixels are there of each group?

.. code-block:: R
   
   table(d$cat)

Where are these three groups?

.. code-block:: R

   ggplot(d) +
     geom_point(aes(x=x, y=y, color=cat), alpha=0.5, size=0.25) +
     coord_fixed()

Do the groups differ in terms of their covariates?

.. code-block:: R

   ggplot(d) +
     geom_density(aes(x=traveltime, color=cat))

Add the means of the covariates and save the plot:

.. code-block:: R

   d_means <- d %>%
     group_by(cat) %>%
     summarize(cat_mean = mean(traveltime))
   ggplot(d) +
     geom_density(aes(x = traveltime, color = cat)) +
     geom_vline(data = d_means,
                aes(xintercept = cat_mean, color = cat),
                linetype = 'dashed') +
     ggtitle('Covariates before matching: traveltime')
   ggsave(
     filename = file.path('reports', 'lab4', '1_covdist_original_traveltime.png'),
     width = 4,
     height = 2
   )

Repeat the plotting for ``distedg05`` and ``slope``. Save all three plots as:

   :file:`~/ee508/reports/lab4/1_covdist_original_<covariate>.png`

where :file:`<covariate>` is replaced by the name of the covariate.

Clearly, the areas declared protected under Lula are different than those declared protected under Cardoso. And both of them differ substantially from unprotected pixels.

Conduct a naïve impact analysis
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The PSDB hired their own consultants to quantify the impacts of the protected areas. The consultants infer impact from differences in five-year deforestation rates that were observed between 2006-2010 in protected areas declared under Cardoso vs. those of unprotected areas. Then they repeat the same analysis for protected areas declared under Lula vs. those of unprotected areas. (Economists like to refer to this approach as the "naïve" impact estimation - in contrast to "rigorous", which implies controlling for confounders).

How much deforestation was observed, on average, in each group between 2006 and 2010?

You can extract the values for any group as follows:

.. code-block:: R

   d[d$cat=='lula', 'prod0610']

and then apply the ``mean()`` function.

.. admonition:: Questions

   How large is the difference between areas protected under Cardoso vs. unprotected areas?

   How large is the difference between areas protected under Lula vs. unprotected areas?

   Optional: are these differences significant?

   ``t.test(x, y)`` returns the results of a Student's t-test for the distributions of ``x`` and ``y``.

How much avoided deforestation (in km\ :sup:`2`) does this impact estimate translate to in each case?

Multiply the estimated mean difference with the estimated total area that is protected. Recall that each observation in this dataset is a 1km\ :sup:`2` pixel, and that the dataset is a 1% random sample: this means that each pixel is representative of 100km\ :sup:`2`.

Create a :file:`.csv` file to summarize the values and save it under the following file path:

   :file:`~/ee508/reports/lab4/1_impact_estimates.csv`

.. list-table::
   :header-rows: 1
   :widths: 21 30 25 27 35 18

   * - Approach
     - Cardoso: number of
       treated units used for impact estimate
     - Cardoso: average abs. SMD of six covariates
     - Cardoso: estimated average impact (in % deforestation)
     - Cardoso: estimated avoided deforestation (km\ :sup:`2`, rounded)
     - Lula: ... (create the same 4 columns)
   * - Naive (no matching)
     - 6557
     - *Leave this column blank; we'll fill it in later.*
     - -2.08
     - 13,664
     - 
   * - Matching
     - 
     - 
     - 
     - 
     -
   * - ...
     - 
     - 
     - 
     - 
     -

.. note::

   The estimated average impact should have a negative sign if deforestation is estimated to have been reduced by the treatment.

   The average impact is to be reported as a percentage (with values between 0-100). Don't forget that the deforestation variables are measured as fractions (0-1), not percentages.

Given these results, does it seem as if the protected areas declared under Cardoso's government had lower deforestation rates and, therefore, greater impact?

Estimate impacts with matched controls
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We have observed two things: one, the areas protected under Cardoso seem to have lower deforestation rates than the areas protected under Lula, which might indicate that they are more "successful". On the other hand, our analysis of covariate distributions also indicates that the areas protected under Cardoso are, on average, further removed from cities and from the forest edge. This indicates that they might have been exposed to lower threat than the areas protected under Lula. But Cardoso's areas also have, on average, lower slopes, which is usually associated with higher threat levels. Which of the two groups, then, is having more of an impact? To find out, we turn to matching.

Look up the help file for the ``Match`` function (``?Match``). You will see that the function needs at least three Arguments:

-  ``Y``: a vector of the outcome in question (here: ``prod0610``)

-  ``Tr``: a vector that is 1 for treated units and 0 for all untreated units

-  ``X``: a matrix of the variables you want to match on (covariates)

Let’s build these datasets. For this first run, we will use

1) pixels protected under Cardoso as treatment units and

2) unprotected pixels as the pool of potential controls units.

Create a new dataframe that contains the pixels from each group. Call the first ``d.tr`` and the second ``d.ct``. This will involve boolean indexing, as we used earlier when we discarded pixels we didn't need.

Add a treatment column to each dataframe. Let’s call it ``'tr'``. Set it to ``1`` for the treated and to ``0`` for the untreated units.

Combine both subsets into a single input dataset: ``d.in <- rbind(d.tr, d.ct)``.

.. note::

  ``rbind`` combines datasets vertically (``r`` for rows), ``cbind`` horizontally (``c`` for columns)

Run ``Match`` with deforestation 2006-2010 as the outcome variable and the following covariates: elevation, slope, flooding probability, baseline tree cover (in 2005), distance to forest edge (in 2005), and travel time to major cities. These are the same covariates we used in the original study.

.. code-block:: R

   cov <- c('elev', 'slope', 'floodable', 'vcf05', 'distedg05', 'traveltime')
   m <- Match(d.in$prod0610, d.in$tr, d.in[, cov])

.. note::

  ``c()`` is R syntax for creating a vector (similar to a list in Python).

Look up the help file of the Match function to understand the output. We are particularly interested in the meaning of ``est``, ``se``, ``index.treated`` and ``index.control``. Use the two latter to retrieve the matched dataset.

.. code-block:: R

   d.m <- rbind(d.in[m$index.treated, ], d.in[m$index.control, ])

For the purpose of mapping, create a new categorical treatment variable. We need to do that because ``tr`` needs to remain a numeric variable for the rest of this exercise.

.. code-block:: R
   
   d.m$group <- ifelse(d.m$tr, 'treatment units', 'matched controls')

Use ``ggplot`` to make a map of the matched units. Use an aesthetic mapping that gives you a different color for treated units and matched control units. Do you see how the matched control pixels appear to come from areas that are more comparable to the treated areas than if we had taken a random sample of all unprotected areas? Save the map as:

   :file:`~/ee508/reports/lab4/1_matched_units_cardoso.png`

Use ``ggplot`` to plot the covariate distributions of ``traveltime``, ``distedg05``, and ``slope`` for the treatment units vs. the matched controls with means. Do the distributions look more similar than they initially did? Are their means more similar? Save the plots under the below filename (replace :file:`<covariate>` by the name of the covariate).

   :file:`~/ee508/reports/lab4/1_covdist_cardoso_matched_<covariate>.png`

If you are happy with the covariate distributions, look at the impact estimate using ``summary(m)``. How large is the estimate effect compared to the simple difference in means you computed in Task 2? Why do you think that is?

If you want to directly extract the value of the treatment effect, you can use ``m$est``.

Save the results in your :file:`.csv` table. In the ``'Approach'`` column, call this impact estimate "``Matching``". Compute how much deforestation (km2) has been avoided. Repeat the analysis for the areas protected under Lula and save the results in your :file:`.csv` table.

.. note::

   When looking up the "number of treatment units used for impact estimate", use the *weighted* number of observations. ``summary(m)`` reports it as "Matched number of observations".

   You can also obtain it with ``m$wnobs``.

   When ``Match`` finds more than one nearest neighbor in the exact same Mahalanobis distance to a given treatment unit, it will match all nearest neighbors as controls (unweighted), but then weigh the matched controls equally (e.g. 50% each) in the computation of impact.


Assess the quality of the matches
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let's explore the implications of different matching parameters on the quality of the matching (i.e. the comparability of treated and matched control groups). Before we change the parameters, we have to decide on the metrics we'll use to compare them.

The ``Match`` package has its own ``MatchBalance`` function, which runs a series of tests on the covariate distribution before and after matching. It is pretty straightforward to use:

.. code-block:: R

   mb <- MatchBalance(
     tr ~ elev + slope + floodable + vcf05 + distedg05 + traveltime,
     data = d.in,
     match.out = m,
     nboots = 0
   )

.. note::
   
   The first argument of ``MatchBalance`` is an ``R`` formula: the dependent variable is written to the left of the tilde (``~``), and the independent variables are to the right of it. This type of formula is used to run most models (e.g. regressions) in ``R``.

Take a moment to examine the outputs. You see that, for each covariate, the function prints the results of a range of checks conducted both before and after matching. Of particular interest for us is the standardized mean difference (SMD, ``std mean diff``), which gives us a way to compare the balance across covariates that have wildly different ranges of values.

.. admonition:: Question

   Which covariate has the worst SMD after matching? Which one has the best?

The other tests look more specifically at the similarity in the covariate *distributions* of matched treatment and control units. These include summary statistics for the empirical cumulative density function (eCDF) and empirical quantile-quantile plots (eQQ). We will not use them here. but you can learn more about their meaning with ``?MatchBalance``.

For the purpose of the next step, we want to have a single indicator to compare the covariate balance across different matching parametrizations. We'll use the mean of the absolute values of the SMD values across all six covariates (mean absolute SMD: MASMD). We average *absolute* SMD, because SMDs can be positive and negative, yet in both cases, the closer SMD is to zero, the better the balance.

You can extract the SMD for each covariate with:

.. code-block:: R

   mb$AfterMatching[[i]]$sdiff

where ``i`` is the number of the covariate (starting at 1).

Compute the MASMD across all six covariates with a simple ``for`` loop:

.. code-block:: R

   smd <- c()
   for (i in 1:length(cov)) {
      smd <- c(smd, mb$AfterMatching[[i]]$sdiff)
   }
   mean(abs(smd))

Note the syntax of a ``for`` loop in ``R``: the iterator variables are enclosed by parentheses ``()``, and the code block of the ``for`` loop is enclosed by curly brackets ``{}``.

Compute the MASMD for the unmatched (before matching) and matched (after matching) treatment vs. control groups of areas protected under Cardoso and Lula. Note: the MASMD for the unmatched units is the same as for the "naïve" comparison. To compute it, replace ``AfterMatching`` with ``BeforeMatching``.

Fill in the missing columns in your CSV sheet (which should now contain 2x8 values). By how much did the MASMD improve?

You can create a quick barplot to inspect how much balance improved for each covariate. To do so, save the ``smd`` vectors you obtained earlier for both before matching (``smd_before``) and after matching (``smd_after``), then create a barplot that combines both:

.. code-block:: R

   balance <- data.frame(v = cov, before = smd_before, after = smd_after) %>%
     pivot_longer(
       cols = c(before, after),
       names_to = 'Matching',
       values_to = 'smd'
     )

   ggplot(balance) +
     geom_bar(aes(y = v, x = smd, fill = Matching),
              stat = 'identity',
              position = 'dodge')


Test alternative specifications
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Now that we can compare the quality of covariate balance across different runs, let's start exploring the implications of changes to the parameters. All modifications involve only minor changes to the code. Completing the following steps will be quickest if the entire code block (from building your matching dataset to computing and showing the values for the CSV file) can be run at once. Consider rewriting your code accordingly before you proceed.


Match with calipers
-------------------

Some treatment units might be so different from any control units (e.g. extremely remote) that you don’t want to consider even the "nearest" control a suitable match. You can define "calipers", which are maximum acceptable distances (in terms of covariate values) between your matches. These maximum distances are usually defined in terms of standard deviations (SD) of each covariate. Here we will use the same calipers of a standard deviation of 0.5 for all covariates:

.. code-block:: R

   m <- Match(d.in$prod0610, d.in$tr, d.in[,cov], caliper=0.5)

``summary(m)`` shows that some treated observations have been dropped because they were not sufficiently comparable. As you fill in the CSV spreadsheet, make sure the "number of treated units used for impact estimate", as well as the "estimated avoided deforestation" are based only on this new, smaller number of observations. In essence, we reduce the area of protected areas for which we say we can confidently estimate impacts, thus ignoring the areas that were dropped. In the writeup, you will need to make sure you account for that.

You can use ``m$index.dropped`` (similar to ``m$index.treated`` and ``m$index.control``) to identify and map the units that have been dropped. Are they more likely to have higher or lower deforestation than the matched parcels?

Has balance improved? Did the estimated impact change?


Exact matching
--------------

Sometimes we want to make sure that treatment units and matched controls have *exactly* the same value in one or more covariates.

In the Brazilian Amazon, **states** can differ in policies that are relevant to deforestation (enforcement, agricultural subsidies, land tenure). If we want to make sure that controls come from the same state, we can add ``state`` as a covariate.

Because ``state`` is a number, ``Match`` will interpret it as a continuous variable by default. However, as you can see in your map of states, there is no reason to believe that state ``5`` is "similar to" state ``4``. You therefore have to let ``Match`` know that you want to match *exactly* by state, i.e. you only accept matches that have the same value.

.. code-block:: R

  cov2 <- c('elev',
            'slope',
            'floodable',
            'vcf05',
            'distedg05',
            'traveltime',
            'state')
  m <- Match(
    d.in$prod0610,
    d.in$tr,
    d.in[, cov2],
    exact = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE),
    caliper = 0.5
  )


.. note::

   The ``exact`` argument expects a boolean array of the same order and length as the covariates.

How many treated units are included in the analysis? Did the impact estimate change? Has the MASMD for the six continous variables improved or worsened?

.. caution::

   Do not add ``'state'`` as a new covariate to the ``MatchBalance`` formula: it is a categorical variable without continuous interpretation (and you already know you are getting an exact match on it).


Bias adjustment
---------------

Sometimes, it is hard to achieve really good balance on covariates, even with calipers and exact matching. If that is the case, treatment and matched controls will still differ in factors that affect the outcome (as we have seen, the MASMD is never zero). Matching allows you to incorporate that remaining bias in the impact estimate through a regression-based bias adjustment.

.. code-block:: R

   m <- Match(
     d.in$prod0610,
     d.in$tr,
     d.in[, cov2],
     caliper=0.5,
     BiasAdjust=TRUE,
     exact=c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE)
   )

Did the number of treatment units or the ASSMD change? (It shouldn't: we have changed anything about the included units).

Did the impact estimate change?

Make sure :file:`1_impact_estimates.csv` contains the results of all five impact estimation approaches as rows, and both Cardoso and Lula as columns.


Write up an executive summary
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Take some time to appreciate the results.

How did the conclusions about the relative impact of the different protected areas change from the "naïve" comparison?

Write up a short executive summary for the PT (800-1000 words), which explains the analytical technique and summarizes your findings.

Advise the PT on how it should frame and interpret the findings in the light of its objective. Mention all important caveats. Save the document under the following name:

   :file:`~/ee508/reports/lab4/1_impact_findings.docx`

Make sure your writeup includes:

1. a brief explanation of matching (what it tries to achieve and how it works)
2. a brief discussion of the caveats of unobservable confounders
3. a brief discussion of how the dropping of parcels can affect your findings (for the runs with calipers / exact matching), and how you can (or cannot) account for that in your overall assessment.

Consider adding a barplot of your results (with Python or ggplot's `geom_bar <https://www.r-graph-gallery.com/218-basic-barplots-with-ggplot2.html>`_).


Wrap up
~~~~~~~

Your folder :file:`~/ee508/reports/lab4/` should now contain the following six deliverables:

   | :file:`1_protected_area_types.png`
   | :file:`1_covdist_original_distedg05.png`
   | :file:`1_impact_estimates.csv`
   | :file:`1_matched_units_cardoso.png`
   | :file:`1_covdist_cardoso_matched_distedg05.png`
   | :file:`1_impact_findings.docx`

Zip them into an archive (only these six - omit any additional figures) and submit it to the Google Assessment on Blackboard.

**Congratulations!**

You just conducted your first matching-based quasi-experimental impact assessment.

If you wonder how the findings have held up (under Bolsonaro and then again Lula), you can make this your :ref:`individual project <individual-project>`, e.g. by creating a 2000-2024 forest loss from the Global Forest Change dataset we used in :ref:`Lab 1-5 <lab1-5>`.