.. _lab1-3:

Countries and threatened species with geopandas
===============================================

This lab will get us started with ``geopandas``, a Python package for geospatial vector data (shapefiles, geopackages, geodatabases, geoparquet, etc.). We import, filter, and reproject geospatial polygons, compute areas, and make maps. We will also start using ``zonal_stats``, a way to compute statistics on pixels grouped by (separate) polygon vector data.

.. admonition:: Deliverables

   - A map of average threatened species richness by country subdivision, saved as a :file:`.png` image.

   - A list of the ten countries that contain the country subdivisions with the highest average threatened species richness, saved as a :file:`.csv` file.

.. admonition:: Due date

   Friday, September 19, 2025, by 6:00 p.m.


Create a new notebook
~~~~~~~~~~~~~~~~~~~~~

Start Jupyter, open a new notebook (:file:`~/ee508/notebooks/lab1/lab1-3.ipynb`), and run the following ``import`` commands:

::

   %matplotlib inline
   import geopandas as gpd
   import matplotlib.pyplot as plt
   from pathlib import Path
   import pandas as pd
   from rasterstats import zonal_stats

Are these in the right order? Right-click to the left of cell > :gui:`Format cell` to find out.

Define your filepaths, e.g. with ``pathlib``.

::

   # Directories and file paths
   DIR_EE508 = (
       Path("~").expanduser() / 'Dropbox' / 'Outputs' / 'Courses' / 'ee508'
   )
   DIR_LAB1_1 = DIR_EE508 / 'data' / 'external' / 'lab1' / 'part1'
   DIR_REPORTS_LAB1_3 = DIR_EE508 / 'reports' / 'lab1' / 'part3'

   PATH_NE_COUNTRIES = (
       DIR_LAB1_1
       / 'ne_10m_admin_0_countries_lakes'
       / 'ne_10m_admin_0_countries_lakes.shp'
   )
   PATH_NE_SUBDIVISIONS = (
       DIR_LAB1_1
       / 'ne_10m_admin_1_states_provinces_lakes'
       / 'ne_10m_admin_1_states_provinces_lakes.shp'
   )


Load and explore the vector data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Use ``geopandas`` to load the country shapefile from :doc:`lab1-1`.

::

   countries = gpd.read_file(PATH_NE_COUNTRIES)

What variable type is ``countries``? ``type(countries)`` will tell you.

- A ``GeoDataFrame`` is a special kind of ``DataFrame`` that comes with a geo-referenced ``geometry`` column. This colum contains the geo-referenced vector data (points, lines, polygons, multi-polygons) linked to each row in the frame.

Plot the object: ``countries.plot(figsize=(7, 4))``. You should see a map of the world.

-  The ``figsize`` argument (width, height in inches) is optional. However, if you just call ``countries.plot()``, the map tends to be on the small side.

-  Similar to ``DataFrame`` and ``Series`` objects in ``pandas``, the ``GeoDataFrame`` from ``geopandas`` offers this ``plot()`` shortcut to ``matplotlib``.

-  As in pandas, you can first create a ``Figure`` and ``Axes`` using ``plt.subplots()``, tell the ``GeoDataFrame`` to ``plot()`` on the same ``ax``. After that, we can tweak the plot further.

   ::

      fig, ax = plt.subplots()
      countries.plot(ax=ax)
      fig.set_size_inches(7, 4)

-  Alternatively, you can capture the ``Axes`` object returned by ``geopandas``' ``plot`` function to edit the plot further:

   ::

      ax = countries.plot(figsize=(7, 4))
      ax.axis('off')

What columns are available to you in the attribute table? ``countries.columns`` will tell you.

The ``geometry`` column contains the geospatial data as a special type of ``Series``. Which type? Run ``type(countries['geometry'])`` to find out.

.. Caution::

   If you drop the ``geometry`` column, the GeoDataFrame loses its vector data and can't be plotted as a map anymore. Try:

   ::

      countries.drop(columns='geometry').plot()

   So: if you want to keep the vector data, don't drop the geometry column.

Any slice (subset) of the GeoDataFrame that does **not** contain a geometry column will be treated as a ``pandas`` ``DataFrame``. See what Python returns for each of these two commands:

::

   type(countries[['NAME', 'POP_EST', 'geometry']])
   type(countries[['NAME', 'POP_EST']])


Index and filter
~~~~~~~~~~~~~~~~

You select rows and columns of a GeoDataFrame in same way you’d do it with a DataFrame. For instance, here is how you can make a map of mainland China:

::

   countries[countries['SOVEREIGNT'].eq('China')].plot()

Look at a sample of the country attribute data using: ``countries.sample(5).T``.

The attribute table has many columns we don't need. Let's keep only what we want.

Before we start filtering, let's give ``countries`` a good index.

I like to keep my indices unique (avoids duplicates when joining) and meaningful: the index should make it easy (a) to understand what type of unit each row refers to, and (b) to recognize a unit by its index value (e.g., a name or widespread code).

.. caution::

   Like ``pandas``, ``geopandas`` allows non-unique indices in its ``GeoDataFrame``.

   **You** are responsible for making sure your index is unique (if you need it to be).

Looking at the sample, the column ``'SOVEREIGNT'`` seems to be a good candidate for the index. Does it only have unique values? Find out:

::

   countries['SOVEREIGNT'].duplicated().any()

-  You can get the same result by calling ``duplicated()`` on a ``DataFrame`` instead of a ``Series`` as long as you pass the name of a column (or a list of them):

   ::

      countries.duplicated('SOVEREIGNT').any()

Looks like ``'SOVEREIGNT'`` has almost 24% duplicates. Which ones?

::
   
   countries[countries.duplicated('SOVEREIGNT', keep=False)][['SOVEREIGNT', 'ADMIN']]

Notice Hong Kong, Guantanamo, Greenland, and Puerto Rico

Does ``'NAME'`` have duplicates?

It appears ``'NAME'`` is a suitable index.

Let's use it and sort it alphabetically:

::

   countries = countries.set_index('NAME').sort_index()

Let's keep only a few columns that we are interested in:

::

   COLS = [
       'SOVEREIGNT',
       'CONTINENT',
       'REGION_WB',
       'POP_EST',
       'ADM0_A3',
       'geometry',
   ]
   countries = countries[COLS]


Create a map of country-level population density
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Population density is a very skewed variable and a good training case for **color mapping**: assigning colors to values or classes of a value distribution (e.g., when making choropleth maps).


.. _lab1-3-area-computation:

Compute areas
-------------

To compute population density, we first need a column of each country's area.

We can calculate this area directly from the geo-located polygons. However, we need to choose a suitable projection (CRS) to do so.

What's the CRS of ``countries``? Find out: ``countries.crs``

``geopandas`` has several ways to define CRS (see `Managing Projections <http://geopandas.org/projections.html>`_).

-  The easiest is to refer to an ID from the `EPSG geodetic parameter dataset <https://epsg.io/>`_ (``'epsg:4326'``), ESRI, or other providers of CRS information.

-  Alternatively, projections can be defined with a :gui:`proj4` string. The syntax of these strings looks cryptic but follows a common standard. For instance, the proj4 string for EPSG:4326 is:

   :gui:`+proj=longlat +datum=WGS84 +no_defs`.

   If you want to find the :gui:`proj4` string for your projection, you can:

   -  Look up your projection on `spatialreference.org <http://www.spatialreference.org>`_ 

   -  Copy-and-paste it from the layer CRS window in :gui:`QGIS` (later in this lab).

Importantly for our task: EPSG:4326 is **not** an equal-area projection. We need to reproject the geometry into an equal-area projection to compute the areas.

There are several global equal-area projections. We will use EPSG:6933.

How does the world look in this projection? Find out:

::

   countries.to_crs('epsg:6933').plot(figsize=(7, 4))

The map looks like a typical equal-area map: countries near the poles look "squished", providing a more accurate measure of their actual surface area. Note the differences in the actual size of Africa vs. Antarctica.

.. note::

   We projected "on the fly" - we converted, then plotted, in a single cell.

   This means that we didn't change the CRS of the original ``GeoDataFrame`` object ``countries``. Instead, we obtained a separate, projected version of ``countries`` that we plotted as soon as we had it.

   Verify: ``countries.plot(figsize=(7, 4))`` should give you the same map we made earlier. Even quicker, check the CRS: ``countries.crs``

   If you wanted to keep the reprojected version, the command would be:

   ::

      countries = countries.to_crs('epsg:6933')

   Don't do this for now. We will keep ``countries`` in ``'epsg:4326'`` projection for our maps - most people are used to it. We only need ``'epsg:6933'`` for the purpose of computing areas.


.. caution::

   Avoid reprojecting your data back and forth between CRS. Besides requiring extra processing time, it can introduce new rounding errors that can mess up the topology of polygon layers.


``geopandas`` makes it easy to compute areas. Each GeoDataFrame has an ``.area`` attribute which returns a Series of the polygon areas.

Give it a try. What does ``countries.area`` return?

-  A ``Series`` of numbers, showing the polygon's area in :gui:`EPSG:4326` projection unit (degrees).

   This is not what we want. Areas computed in a lat/long CRS don't have an interpretable unit (degrees\ :sup:`2`) and aren't proportional to actual area.

If you reproject ``countries`` into an equal-area projection *before* getting the ``.area``, you get areas in a . :gui:`EPSG:6933` uses meters as units, so areas will be in m\ :sup:`2`.

::

   countries.to_crs('epsg:6933').area

Save this ``Series`` as a new column in your GeoDataFrame:

::

   countries['m2'] = countries.to_crs('epsg:6933').area


.. note::

   Every time you call ``to_crs``, ``geopandas`` runs a full reprojection of whatever ``GeoDataFrame`` you provide. This can be slow for large datasets. To be efficient, keep the use of ``to_crs`` to a minimum.


Compute population density
--------------------------

Compute population density as people per km\ :sup:`2`. Now that you
have computed the area column, this is a simple one-liner:

::

   countries['popdens'] = countries['POP_EST'].div(countries['m2']).mul(1e6)

Alternatively, you can write:

::

   countries['popdens'] = countries['POP_EST'] / countries['m2'] * 1000000


.. admonition:: Question

   Which ten countries have the highest population density?

::

   countries.sort_values('popdens', ascending=False).head(10)


Visualize population density as a choropleth map
------------------------------------------------

To create a choropleth map of any variable in a ``GeoDataFrame``, call ``.plot()`` with the name of the variable to be plotted as the first argument:

::

   countries.plot('popdens', figsize=(7, 4))

The resulting map doesn't look very informative, because the colors are the same for almost all countries. Why is that? Adding a legend will make it clear.

::

   countries.plot('popdens', figsize=(7, 4), legend=True)

By default, ``geopandas`` maps values to colors in a linear fashion in the choropleth maps. This is not always visually informative, especially if distributions are highly skewed. How does the distribution of the variable ``popdens`` look?

::

   countries['popdens'].hist(bins=100)

Let's plot the map with a quantile-based classification scheme:

::

   countries.plot('popdens', scheme='quantiles', figsize=(7, 4), legend=True)

If you want to change the number of classes, you can modify ``k`` (counts of classes):

::

   countries.plot(
       'popdens',
       scheme='quantiles',
       k=3,
       figsize=(7, 4),
       legend=True,
   )

``geopandas`` supports all classification schemes provided by `mapclassify <https://github.com/pysal/mapclassify>`_. This includes ``'natural_breaks'``, ``'fisher_jenks'``, and ``'user_defined'``.

-  Some need additional arguments. For instance, if you want to set the breaks manually (``'user_defined'``), you have to pass a dictionary that contains a list of the upper boundaries of each bin and pass that dictionary to ``classification_kwds``:

::

   countries.plot(
       'popdens',
       scheme='user_defined',
       classification_kwds={'bins': [1e1, 1e2, 1e3, 1e4, 1e5]},
       figsize=(7, 4),
       legend=True,
   )

A little long, but it gives us what we want.

You can change the color used for classification by changing the colormaps:

::

   countries.plot(
       'popdens',
       scheme='quantiles',
       cmap='YlOrRd',
       figsize=(7, 4),
       legend=True
   )

Check out the list of `standard colormaps <https://matplotlib.org/stable/tutorials/colors/colormaps.html>`_ in ``matplotlib``.

All standard colormaps can be reversed by adding the suffix ``'_r'``.

::

   countries.plot(
       'popdens',
       scheme='quantiles',
       cmap='RdYlGn_r',
       figsize=(7, 4),
       legend=True,
   )

Because you can use ``RdYlGn_r``, there is no ``GnYlRd`` colormap.

You can add an outline by setting ``edgecolor`` (and, optionally, ``linewidth``).

::

   countries.plot(
       'popdens',
       scheme='quantiles',
       cmap='YlOrRd',
       edgecolor='black',
       linewidth=0.25,
       figsize=(7, 4),
       legend=True,
   )

You can also plot categorical and string variables. Try:

::

   countries.plot('CONTINENT', figsize=(7, 4), legend=True)

.. tip::

   If you want to modify the look of the legend, you can pass a dictionary named ``legend_kwds``, and include any parameter that can be passed to `matplotlib.pyplot.legend <https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.legend.html>`_.

::

   countries.plot(
       'CONTINENT',
       figsize=(7, 4),
       legend=True,
       legend_kwds={
           'fontsize': 20,
           'markerscale': 2,
           'loc': 'lower left'
       },
   )

Because both ``pandas`` and ``geopandas`` use ``matplotlib`` for
plotting, many commands we have learned can be used on these maps as
well. For instance, you can:

-  Add a title:

   ::

      ax.set_title('Population Density', fontsize=25, pad=25)
 
-  Save the plot with ``plt.savefig()``:

-  Initialize a figure with ``plt.subplots()`` and then pass the ``Axes`` arguments to ``geopandas``' ``plot`` function to create plots with multiple panels:

   ::

       fig, axes = plt.subplots(1, 2)
       countries.plot('popdens', scheme='quantiles', ax=axes[0]) 
       countries.plot('CONTINENT', ax=axes[1]) fig.set_size_inches(7, 2.5)


Make a map of average threatened species richness by subdivision
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We will now revisit the species richness mapping at the subdivision level from Lab 1.1: :doc:`lab1-1`. This time, we'll use ``geopandas`` and ``zonal_stats`` instead of :gui:`QGIS`.

Prepare the subdivisions vector data
------------------------------------

Load the data (recall to use the right folder path for your system)

::

   subdivisions = gpd.read_file(PATH_NE_SUBDIVISIONS)

Plot it. Does this look right?

::

   subdivisions.plot(
       figsize=(7, 4),
       color='none',
       edgecolor='black',
       linewidth=0.25,
   )

Using what we learned earlier, find a column that would make a suitable index. Choose the best from the following three options: ``'iso_3166_2'``, ``'adm1_code'``, ``'name'.`` Once you identified the best column, use ``set_index()`` to use it as the index. Then use ``sort_index()`` to sort the ``GeoDataFrame`` by its new index.

Let's create a meaningful labeling column that contains the names of both the subdivisions and the country. ``pandas`` and ``geopandas`` make it easy, as you can add (concatenate) string columns elementwise with a simple ``+`` sign:

::

   subdivisions['label'] = subdivisions['name'] + ', ' + subdivisions['admin']

Keep only the most important columns: ``['label', 'type_en', 'adm0_a3', 'geometry']``

Compute a column named ``'km2'`` that contains the area of each country subdivision in square kilometers. Don't forget to reproject. Recall that the above CRS returns area in square meters.

.. admonition:: Questions

   -  Which is the smallest state/province? How large is it?

   -  Which is the largest state/province? How large is it?


Reproject the subdivisions layer to the raster CRS
--------------------------------------------------

Zonal statistics requires input layers to have the same CRS. Currently, the original ``subdivisions`` layer and the species raster layers have different projections.

.. tip::
   
   It is usually better to reproject the vector layer to the raster CRS rather than the other way around. This is because raster reprojections often lead to a more substantial loss of information: they have to re-do their grid, whereas vectors have ``float`` coordinates that stay precise.

Find the CRS of the species rasters. Here's how you can do this in QGIS:

-  Import one of the raster files (drag-and-drop).

-  Right-click on the raster layer > :gui:`Properties` > :gui:`Source`

-  Under :gui:`Set source coordinate reference system`, you see the name of the CRS. It has a unique identifier. Note that the identifier doesn't start with :gui:`EPSG:`...

-  Using this string identifier, you should be able to reproject ``subdivisions`` using ``to_crs()``.

Because there are an infinite number of potential CRS, but only a finite number of named ones, you can always resort to the :gui:`proj4` string to pass the parameters that define the projection. :gui:`proj4` is a widespread format for CRS information (see: `en.wikipedia.org/wiki/PROJ <http://en.wikipedia.org/wiki/PROJ>`_)

.. image:: img/lab1-3_select_crs_button.png
   :alt: DataFrame of subdivisions with new columns
   :align: right

-  To find the :gui:`proj4` string, return to your raster layer in QGIS > :gui:`Properties` > :gui:`Source`, then click on the button right to the projection name in :gui:`Assigned Coordinate Reference System`:

-  The :gui:`Coordinate Reference System Selector` will open, with the current CRS selected. In the bottom right corner, you'll find the proj4 string of the selected projection (below :gui:`Proj4:`). Copy the string, paste it back into the Jupyter cell, and don't forget to put it into quotation marks.

Reproject the ``GeoDataFrame`` to the *World Eckert IV* projection. Save it as a new Python object, as we want to keep the original version around:

::

   subdivisions_reproj = subdivisions.to_crs('your proj4 string goes here')


Use ``zonal_stats`` to compute species richness stats by subdivision
--------------------------------------------------------------------

``zonal_stats`` conducts the same operation as QGIS/SAGA's Zonal Statistics.

For each polygon in the input vector data (a ``GeoDataFrame`` or filepath):

-  It identifies all pixels in the raster that fall within the polygon's boundaries,

-  and then computes statistics on the pixels's values (e.g., the ``mean``, ``max``, ``min``, ``count``, ``sum``).

Run the following command:

::

   zs = zonal_stats(
       subdivisions_reproj,
       raster_path,
       stats=['mean', 'count'],
       all_touched=True,
       nodata=-1,
   )

Replace :code:`raster_path` with the path to the raster file you'd like to use.

-  We'll start with amphibians, so use the file :file:`amph_thr.tif`.

.. note:: Rasterization strategy

   By default, ``zonal_stats`` selects pixels **whose centroid falls inside the polygon**.

   Setting the argument ``all_touched=True`` will make the algorithm select all raster pixels that fall inside, **intersect**, or **touch** the polygon.

   This can be necessary whenever you work with a vector file that contains polygons which are small relative to the raster pixels (as is true here). If ``all_touched`` is set to ``False``, the algorithm only selects pixels whose center falls within the polygon, leaving some small polygons with no pixels (and no value).

   This illustration from the ``rasterstats`` documentation explains it best:

      `<https://pythonhosted.org/rasterstats/manual.html#rasterization-strategy>`_

Although this raster does not have a nodata value (a placeholder for empty values), we provide a ``nodata`` argument (``-1``, which does not exist in the data) to silence a warning that ``zonal_stats`` throws otherwise.

How does the result look? Have a look at ``zs``.

It is a list of dictionaries, with one list item for each polygon, in the same order of the ``GeoDataFrame`` you used (``subdivisions_reproj``).

We need to convert this list of dictionaries to a ``DataFrame`` and give it the same index that we are using for ``subdivisions_reproj``. Luckily, that's easy:

::

   subdivisions_zs = pd.DataFrame(zs, index=subdivisions_reproj.index)

Have a look at ``subdivisions_zs``. Is this a ``DataFrame`` with the right
index?

Now we need to save this column as a new column in ``subdivisions`` and give it a unique and meaningful name.

The simplest way to achieve this goal is this one-liner:

::

   subdivisions['amph_thr'] = subdivisions_zs['mean']

Alternatively, you can use the ``join`` command to join one ``DataFrame`` to another, based on their index (make sure the index is unique):

::

   subdivisions = subdivisions.join(subdivisions_zs['mean'])

If you use ``join``, you still have to rename the column after joining:

::

   subdivisions = subdivisions.rename(columns={'mean': 'amph_thr'})

Or you rename the ``Series`` before joining it:

::

   subdivisions = subdivisions.join(
       subdivisions_zs['mean'].rename('amph_thr')
   )

Make a choropleth map of the new column. Does this look right?

::

   subdivisions.plot(
       'amph_thr',
       scheme='quantiles',
       k=20,
       cmap='YlGn',
       edgecolor='black',
       linewidth=0.25,
       figsize=(7, 4),
   )


``mapclassify`` will warn you that it will use a smaller number of classes than you asked for (7 instead of 20). This is because the quantile classification algorithm realizes that the first 13 quantiles are all the same (0), and automatically drops those that are redundant. If you don't like that, try a different classification method (e.g., ``natural_breaks``).

**Great!**

Repeat the above steps for threatened birds and mammals.

Try to implement it in a ``for`` loop that starts like:

::

   for r_name in ['amph_thr', 'bird_thr', 'mamm_thr']:


When you are done, ``subdivisions.head()`` should give you a result that looks like this:

.. image:: img/lab1-3_subdivision_frame.png
   :alt: DataFrame of subdivisions with new columns


Compute and visualize average threatened species richness across all three groups:

This is now very simple:

::

   subdivisions['thr'] = subdivisions[['amph_thr', 'bird_thr', 'mamm_thr']].sum(1)

Recall that :code:`sum(1)` computes the sum for each row.

Plot the result. How does it look?

::

   subdivisions.plot(
       'thr',
       scheme='natural_breaks',
       cmap='YlGn',
       k=9,
       legend=True,
       edgecolor='black',
       linewidth=0.25,
       figsize=(7, 4),
   )

You can drop the outlines for subdivisions and add the outlines of the countries instead (with a transparent fill).

You can plot one ``GeoDataFrame`` on top of each other. As explained earlier, ``plot()`` returns the ``Axes`` object it plots on. You can save that argument and pass it on to the next ``plot()`` command:

::

   ax = subdivisions.plot(
       'thr',
       scheme='natural_breaks',
       cmap='YlGn',
       k=9,
       legend=True,
       edgecolor='black',
       linewidth=0.1,
       figsize=(7, 4),
   )

   countries.plot(ax=ax, color='none', edgecolor='black', linewidth=0.5)

Add a meaningful title and save a high-resolution version of this map (with legend) in your class folder under the following filename:


   |3_avg_thr_spec_richness|
      
      .. |3_avg_thr_spec_richness| raw:: html
      
         <span class="file">~/ee508/reports/figures/lab1/<span style="font-weight:bold">3_avg_thr_spec_richness.png</span></span>


Find the countries with the most biodiversity-rich subdivisions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. admonition:: Task

   Suppose each country were to submit its most species-rich subdivision to a global biodiversity richness contest of subdivisions. Which 10 countries would win?

   Find the ten countries with the highest average threatened species richness found amongst *any* of its subdivisions (note: this is not the country-wide mean).

Here are two possible solution paths:

-  **Option 1**: sort the subdivisions by descending species richness, remove all entries with duplicate country identifiers (keeping only the first), and pick the top 10. This gives you a subset of ``subdivisions`` with all of its fields. Because the field ``adm0_a3`` is now unique, you can use it as the index.

   ::

      # Sort by threat, most threatened first (on top)
      subdivisions_subset = subdivisions.sort_values('thr', ascending=False)

      # Drop duplicate county IDs (keep the first)
      subdivisions_subset = subdivisions_subset[~subdivisions_subset.duplicated('adm0_a3')]

      # Keep the top 10
      top10 = subdivisions_subset.head(10).set_index('adm0_a3')[['thr']]

   .. important::

      The tilde sign (``~``) inverts a boolean Series (it sets all values of ``True`` to ``False`` and vice versa). So, the second line reads "give me all rows whose ``adm0_a3`` value is *not* a duplicate of any earlier row".

-  **Option 2**: you can use `groupby <https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html>`_, a powerful way to process records (rows) in a ``pandas`` ``DataFrame`` in groups. The following line groups all subdivision by country, then returns the highest (``max()``) value of threatened species richness found in the ``'thr'`` column of all subdivisions in that country, sorts the resulting Series from high to low, selects the top 10, and converts the ``Series`` into a ``DataFrame``.

   ::

      # Get the highest value of ``'thr'`` among subdivisions in a county
      thr_max = subdivisions.groupby('adm0_a3')['thr'].max()

      # Sort, select, and cast to DataFrame
      top10 = thr_max.sort_values(ascending=False).head(10).to_frame()

   -  To get a sense of what each piece of the above command is doing, run it in parts, and see what result each step returns. Note how the variable used to group the original ``DataFrame`` becomes the index of the Series ``thr_max``, and then of the resulting ``DataFrame``.

Finally, we need to turn the three-letter codes (``adm0_a3``) into the actual country names. Recall that the relationship between the three-letter codes and country names is saved in ``countries``. We can join the country name onto the states/counties using this three-letter code.

::

   code_to_name = countries.reset_index().set_index('ADM0_A3')['NAME']
   top10 = top10.join(code_to_name.rename('country'))


Save the result as a CSV file with ``to_csv()``.

::

   top10.to_csv(path)


Replace ``path`` with a string that saves the CSV file in your class folder in the following subfolder under the following filename:

   |3_top_10_countries|
      
      .. |3_top_10_countries| raw:: html
      
         <span class="file">~/ee508/reports/lab1/<span style="font-weight:bold">3_top_10_countries.csv</span></span>

Which are the countries that have subdivisions (at least one) with a globally significant high threatened species density? What do these countries have in common?

Finally, save the new average threatened species richness columns of ``subdivisions``, so you can use them in the next exercise. Consider two options:

-  **Option 1**: You can either use ``subdivisions.reset_index().to_file(path)`` (see above) to save the vector data to a location of your choice.

   .. admonition:: Note on vector formats.

      There are many file formats for vector data out there. ``geopandas`` uses the file extension to figure out which format you want to save in.

      If you pick the shapefile (:file:`.shp`) format to save your vector data, note that:

      -  ``geopandas`` doesn't save the index of the ``GeoDataFrame`` to the shapefile. If you want to keep the index (as here), you need to use ``reset_index()`` before saving the ``GeoDataFrame``.

      -  An important inconvenience with saving tabular data in the shapefile format is that you can only use column names with 10 characters, because the tabular data is saved in the :file:`.dbf` format. The remainder of the column names will be automatically truncated and made unique with numeric suffixes (``~1``, ``~2``, etc.)

      - You have to keep all companion files in the same folder as the :file:`.shp` file.

      If you want to skip these inconveniences, Geopackages (:file:`.gpkg`) are usually the better option. Sometimes, the consume more space. Sometimes, they're painfully slow to read and write.

      My favorite format for writing, reading, and storing vector data in 2025 is `Geoparquet <https://geoparquet.org/>`_: blazingly fast, space-efficient, and keeps your indices intact (so you don't have to set them after reading). It is handled a little differently: to save as ``GeoDataFrame`` in the Geoparquet format, you need to use ``gpd.to_parquet()``. To read it, ``gpd.read_parquet()``.

-  **Option 2**: Use ``subdivisions[your_list_of_column_names].to_csv(path)`` to save only the new columns to a CSV file of your choice. The index will be automatically saved as a separate column. As long as your output file keeps the unique index, you can join it back to the original shapefile the next time you want to use it.

   -  Benefit: this needs less space than the previous command (geometries are not saved) and it allows you to use column names of any length.

   -  Downside: you'll need one additional line of code next time you want to join the new columns to the vector data.


Wrap up
~~~~~~~

You :file:`reports` folder should now contain both deliverables:

   | :file:`~/ee508/reports/lab1/3_top_10_countries.csv`
   | :file:`~/ee508/reports/figures/lab1/3_avg_thr_spec_richness.png`

-  Copy the two files into the same folder and compress them into a :file:`.zip` archive.

-  Find the Assignment with the lab title on the `Blackboard course website <https://learn.bu.edu/ultra/courses/_264297_1/outline>`_:

-  Upload your :file:`.zip` archive.


.. admonition:: You're done!

   I hope you enjoyed your first steps in ``geopandas``.

   Bring your questions to class!