.. _lab1-2:

Prior knowledge and interests of EE 508 students: pandas
========================================================

This exercise gets you started with the basic functionality of ``pandas``: importing data, analyzing it, visualizing it, computing new indicators, and plotting the results.

Before you start, do the introductory readings for ``pandas``:

-  `10 Minutes to pandas <https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html>`_

-  `Intro to data structures <https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html>`_


.. admonition:: Deliverables

   -  Three histograms of the distribution of the average prior knowledge of Fall 2025 students by category (coding, statistics, spatial data & processing).

   -  Two horizontal bar plots of the average level of software knowledge and the thematic interests of Fall 2025 students (one bar for each software / interest).


.. admonition:: Due date

   Tuesday, September 16, 2025, by 6:00 p.m.


Get the survey data
~~~~~~~~~~~~~~~~~~~

Download the survey data from the EE 508 drive:

-  `data/processed/lab1/part2 <https://drive.google.com/drive/folders/17xP8-Hm9z1ytVL4UR1T7u4mZP3wrXzon?usp=drive_link>`_

   -  :file:`ee508_survey_anonymized.csv` - the survey data

   -  :file:`ee508_survey_fields.csv` - shortcuts for field names


Create a new notebook
~~~~~~~~~~~~~~~~~~~~~

Activate your ``conda`` environment and start ``jupyter notebook``.

Open a new Jupyter notebook: :file:`~/ee508/notebooks/lab1/lab1-2.ipynb`

Load the required modules by running this code at the beginning of the notebook. The first line makes plots appear below the cells.

::

   %matplotlib inline
   import pandas as pd
   import matplotlib.pyplot as plt
   from pathlib import Path

But wait ... these aren't in the "right" order. 

Good that we have a code formatter installed. Right-click on the white area to the left of the cell > :gui:`Format cell`.

The cell input should change to:

::

   %matplotlib inline
   from pathlib import Path

   import matplotlib.pyplot as plt
   import pandas as pd

This regrouping follows the (very reasonable) rules of ``isort``: Python functions first, then external packages, then own packages - and all in alphabetic order.

.. important::

   Please use :gui:`Format cell` religiously across your Jupyter notebooks. This allows you to submit consistent-looking code for all of your assignments. It's like running spell check before you hand in a report.


Define your file paths
----------------------

Throughout EE 508, we will be using ``pathlib``'s ``Path`` object to write paths to directories and files. That way, our filepaths will work across operating systems (Mac, Windows, Linux).

Here's how I define my directories and filepaths for Lab 1.2:

::

   # EE 508 home directory
   DIR_EE508 = (
       Path('~').expanduser() / 'Dropbox' / 'Outputs' / 'Courses' / 'ee508'
   )

   # Directory for Lab 1.2
   DIR_LAB1_2 = DIR_EE508 / 'data' / 'processed' / 'lab1' / 'part2'

   # Path to survey data file
   PATH_SURVEY_ANONYMIZED = DIR_LAB1_2 / 'ee508_survey_anonymized.csv'

   # Path to survey fields file
   PATH_SURVEY_FIELDS = DIR_LAB1_2 / 'ee508_survey_fields.csv'

   # Directory for figures for Lab 1
   DIR_FIGURES_LAB1 = DIR_EE508 / 'reports' / 'figure' / 'lab1'

Adapt it to your directory structure. I'll use these constants later in the script.

.. note::

   I used UPPERCASE letters for all filepaths because they are **constants**.

   Following PEP8 convention:

   -  Please always use uppercase letters for script-level constants (variables that you define and don't change for the remainder of the script)

   -  Define all script-level constants at the **beginning** of your script, right after the ``import`` statements.


Import the survey data
~~~~~~~~~~~~~~~~~~~~~~

Load the survey data into a ``DataFrame`` with ``pandas``. Use ``pd.read_csv`` to load the data and assign it to a new Python object that you name ``d`` (shorthand for ``data``).

   -  If you forgot how to call ``pd.read_csv`` find it again in `10 Minutes to pandas <https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html>`_.

   -  Or ask the function for its help:

      ::
      
         help(pd.read_csv)

      You can also write this `iPyton` magic:

      ``?pd.read_csv``

      You get the same content, different presentation.


.. tip::
   You can expand your Python skills rapidly by calling ``help()`` on every function you use.

Because ``read_csv`` has so many options, its help is a little unwieldy. Luckily, it comes with easy examples at the bottom. ``pandas`` is good about that.


Assign an index
~~~~~~~~~~~~~~~

Check out what columns exist: ``d.columns``. They have short names.

Use the (anonymized) student names as the index (row names):

::

   d = d.set_index('name')

.. note::

   Setting a column as the index with ``set_index()`` will remove the column from the list of columns. Therefore, you can run this exact command only once after importing the file. If you run it twice without re-importing, you get an error.

   If you wish to keep the column ``'name'``, you can set the index with:

   ::

      d.index = d['name']

.. caution::

   ``pandas`` allows you to use non-unique indices (i.e., indices with duplicates). This can lead to unexpected results, e.g., the creation of duplicates when joining data. If you want to ensure your index is unique, you can double-check it as follows:

   ::

      d.index.duplicated().any()

-  The questions associated with each column are in :file:`ee508_survey_fields.csv`.

-  Load this tabular data into a variable you name ``fields``.

-  Display ``fields`` and compare the list of ``'new'`` column names to the list of columns in ``d``.


Explore the survey data
~~~~~~~~~~~~~~~~~~~~~~~

``d.head()`` shows you the first 5 rows. It's a good way to get a first impression of what data is in a ``DataFrame``.

.. tip::

   Notice how the ``pandas`` ``DataFrame`` doesn't display all columns by default.

   The middle ones are omitted (:gui:`...`). This avoids the accidental display of very large DataFrames, which can slow down your browser.

   Run ``pd.options.display.max_columns`` to see how many columns are allowed.

   If you want to see more, you can change this value:

   ::

      pd.options.display.max_columns=100

   For more info, consult the full list of `pandas options <https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html#available-options>`_.

   During exploratory analysis, I often find myself changing ``display.max_rows``, ``display.min_rows``, and ``display.max_colwidth``.

If the top rows are not representative, ``d.sample(5)`` gives you five random rows. ``d.T`` will transpose a DataFrame. Whenever I open a new dataset, I like to get a quick first sense of columns, values, and completeness by running ``d.sample(5).T`` .

.. admonition:: Question

   How have students rated their prior experience with spatial data
   processing in R?

``fields`` tells us that the corresponding column has the name ``'sw_r_spatial'``.

-  Run ``d['sw_r_spatial']``. What does it do?

-  Run ``d[['sw_r_spatial']]``. What does it do?

-  To understand the difference between the two commands, enclose both of the above statements with ``type()`` and run them again. In what way are the outputs different?

-  If you forgot why, revisit the `Selection <https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html#selection>`_ section in `10 minutes to pandas <https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html>`_.

Run ``d['sw_r_spatial'].ge(3)``. What does this function do?

-  Explore the list of available `binary operator functions <https://pandas.pydata.org/docs/reference/frame.html#binary-operator-functions>`_.

-  What do you get if you run ``d['sw_r_spatial'] >= 3``?

-  What do you get if you run ``d[['sw_r_spatial']].ge(3)``? What explains the difference?

-  Run ``d['sw_r_spatial'].ge(3).sum()``. What does this
   number represent?

-  Explore the list of available `descriptive statistics <https://pandas.pydata.org/docs/user_guide/basics.html#descriptive-statistics>`_.

.. admonition:: Question

   Can you correct this code, so it produces the same output as ``d['sw_r_spatial'].ge(3).sum()``?
   ::

      d['sw_r_spatial']>=3.sum()


Run ``d['sw_r_spatial'].value_counts()``. What does this function do?

How many students are in the dataset? ``len(d)``

-  This is more students than are in the class. Why?
   ``d['class'].value_counts()``

-  The dataset contains students from previous years. Let's filter them
   out:

   ::

       d = d[d['class'].eq('2025_fall')]
       len(d)

   Note the three instructions in the first line of code and their execution order:
   
   -  We create a boolean Series that is ``True`` for all rows we want to keep (those with the string ``'2025_fall'`` in the ``'class'`` column) and ``False`` for all others.

   -  We index (using square brackets) the DataFrame ``d`` with the boolean Series, which returns the filtered DataFrame.

   - We assign the returned  filtered DataFrame to the (same) variable name ``d``. Because of this last step, ``d`` now contains only current students.

   This code gives you the same result:

   ::

      d = d[d['class'] == '2023_fall']

.. important::

   From here on onward, this lab continues with the data for **current** students (make sure to keep the filtering step in your pipeline if you re-run the code).


.. admonition:: Review

   Review this section and commit to memory how ``pandas`` interprets indexing very differently depending on whether you pass:

   -  a string (``str``, returns a column as a ``Series``)
   -  a ``list`` of strings (returns a ``DataFrame`` with selected columns)
   -  a ``Series`` of ``bool`` (returns a selection / slice of rows).

Visualize the data
~~~~~~~~~~~~~~~~~~

Plot the histogram of the variable we just examined:

::

   d['sw_r_spatial'].hist()

Histograms of small counts of integer values don't look so great by default: bars can be to the right, middle, or left of their respective value.

A better alternative is a barplot of counts:

::

    dp = d['sw_r_spatial'].value_counts()
    dp.plot(kind='bar')


``value_counts()`` automatically sorts observed values by frequencies. If you prefer the index values to be displayed in order, you can add ``.sort_index()`` to the Series returned by ``value_counts()``.

   ::

       dp = d['sw_r_spatial'].value_counts().sort_index()
       dp.plot(kind='bar')

What kind of object does the ``plot()`` command return? Find out
by reading it into a variable that you name ``ax`` (i.e., prefix
the last line of code with ``ax =``), then run:

::

   type(ax)

The object is an ``Axes`` object, defined by ``matplotlib`` (`<https://matplotlib.org/stable/api/axes_api.html>`_).

``matplotlib`` is the most widespread package used for plotting in Python. ``Axes`` are the objects that do the drawing (``Artist`` in ``matplotlib``) 

Each ``Axes`` object defines a subplot on a ``Figure``: a `cartesian coordinate system <https://en.wikipedia.org/wiki/Cartesian_coordinate_system>`_ (2D axes) on we will plot all data we want to visualize in this class (graphs, vector maps, images).

Each ``Axes`` object belongs to a matplotlib ``Figure`` (customarily named ``fig``). Figures can have one or more  ``Axes`` (subplots, customarily named ``ax``):

.. image:: img/lab1-2_figax.png
   :alt: Figure and Axes
   :align: right

Why is this important?

-  You need ``Axes`` objects to manipulate details of the graph you're working on (e.g., add layers or data, change labels, colors, grids, or ticks).

-  You need the ``Figure`` object to change attributes of the overall figure (e.g., size, layout, saving).

-  If you would like to have more control over your visuals, I recommend consulting the `matplotlib Quick start guide and Pyplot tutorials <https://matplotlib.org/stable/tutorials/introductory/quick_start.html>`_.

The ``plot()`` command we called on the ``Series`` ``dp`` is a shortcut provided by ``pandas`` for quick visualizatoin of data via ``matplotlib``. If you want to get the ``Figure`` and ``Axes`` objects to tweak the figure, you have two options:

-  Initialize the matplotlib plot with ``plt.subplots()``.

   This function returns the ``Figure`` and ``Axes`` object(s) as a ``tuple``. You can save the ``ax`` object and pass it as an argument to ``pandas``' ``plot()`` function.

   ::

       fig, ax = plt.subplots()
       dp.plot(kind='bar', ax=ax)

-  Alternatively, you can save the ``Axes`` object returned by ``pandas``' ``plot()`` shortcut and get the figure from ``matplotlib`` with ``plt.gcf()`` ("get current figure")

   ::

       ax = dp.plot(kind='bar')
       fig = plt.gcf()

-  I recommend the first option if you plan to go beyond quick exploratory analysis.

You can also pass the number of rows and columns as the first two arguments to create layouts with multiple plots. In this case, ``plt.subplots`` returns not a single ``Axes`` object, but an ``array`` (rows and/or columns) of several ``Axes`` objects. In that setup, the individual subplots can be accessed with their indices in the array:

::

    fig, axes = plt.subplots(1, 2)
    dp.plot(kind='bar', ax=axes[0])
    dp.plot(kind='bar', ax=axes[1], color='crimson')
    fig.set_size_inches(7, 3)

.. important::

   When you configure Jupyter such that it visualizes ``matplotlib`` plots below the cell (``%matplotlib inline``), ``matplotlib`` discards the figure after plotting it. This means that you need to keep all commands for a plot in the same notebook cell.

Tweak the resulting plot by adding a title (in the same cell).
   
::

   ax.set_title('Spatial R', fontsize=20, pad=15)

-  The optional ``pad`` argument defines the distance between the plot and the title.

-  Another way of adding a title (without the ``ax`` object) is via the ``matplotlib.pyplot`` interface, which we imported at the beginning as ``plt``:

   ::

       plt.title('Spatial R', fontsize=20, pad=15)


.. admonition:: Question
 
   How many students said they have prior experience with R?


Find the right column name in ``fields``, then plot the values.

-  You will note that one possible value (``2``) does not exist in the data. If you want to show the full range of values (``1`` to ``5``) on the x axis, you need to add the missing values to the Series created by ``value_counts()``.

   -  One way is to save the ``Series`` returned by ``value_counts()``, add a zero (no responses) to the index value ``2`` (which also means the value ``2`` is added to the index of the ``Series``), and sort the ``Series`` by its index before plotting the values:

      ::

          dp = d['sw_r'].value_counts()
          dp[2] = 0
          dp.sort_index().plot(kind='barh')


Create and visualize new indicators
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. admonition:: Question

   What is the distribution of the knowledge of all listed spatial data skills across all students?


For each student, compute the average number of points that they have given themselves across all spatial data processing skills (those beginning with ``'spat'``)

Generate a list of all these variables with a Python `list comprehension <https://www.w3schools.com/python/python_lists_comprehension.asp>`_.
   
-  List comprehensions are concise expressions to generate lists or dictionaries.

   ::

       spat_cols = [v for v in d.columns if v.startswith('spat')]

   This is an elegant ("Pythonic") way of writing this four-line ``for`` loop:

      ::

          spat_cols = []
          for v in d.columns:
              if v.startswith('spat'):
                  spat_cols.append(v)  # alternative: spat_cols += [v]

   -  Note how ``d.columns`` is used as an iterator that returns the columns names as a string (``v``).

   -  ``startswith()`` is one of many useful Python string methods.
      
      -  Check out the Python documentation on `String Methods <https://docs.python.org/3/library/stdtypes.html#string-methods>`_, especially ``endswith()``, ``upper()``, ``lower()``, ``title()``, ``replace()``, ``strip()``, and ``zfill()``.

-  You can now subset the ``DataFrame`` with your new list:

   ::

       d[spat_cols]

-  Or you can do both in one line:

   ::

       d[[v for v in d.columns if d.startswith('spat')]]

Once you have subset the DataFrame, you can compute column or row means:

-  Across columns (skills): ``d[spat_cols].mean()``

-  Across rows (students): ``d[spat_cols].mean(1)``

Plot the means quickly by adding ``.plot(kind='bar')`` to the end.

The vertical bar plot has vertically oriented labels on the x axis. These can be difficult to read. If you have long labels, a horizontal bar plot is usually the better choice. That takes just one more letter: ``.plot(kind='barh')``

One inconvenience of the horizontal bar plot is that, by default, the first items are at the bottom of the graph. To change that, invert the Series you plot by using ``[::-1]``:

::

    d[spat_cols].mean()[::-1].plot(kind='barh')

-  Note: the odd-looking command ``[::-1]`` can also reverse any
   list or tuple.

You can add a number of arguments to the plot call (after ``kind='...'``, separated with commas) to make the horizontal bar plot look more the way you want:

-  ``figsize=(10, 7)`` will increase your plot size (width and height in inches)

-  ``fontsize=15`` will increase the label fontsize.

-  ``xlim=(1, 3)`` will re-scale the x-axis to the correct value range.

For all of these shortcut arguments provided by pandas, there are also corresponding commands that you can access via ``ax`` or ``fig``:

::

    fig.set_size_inches(10, 7)
    ax.tick_params(axis='both', labelsize=15)
    ax.set_xlim(1, 3)

-  Although the latter commands are usually a little longer, it's good to practice using them. These commands give you fine-grained control over any ``matplotlib`` plot, regardless of which package or shortcut was used to create that plot. For instance, ``hist()`` doesn't accept an ``xlim`` or ``title`` argument, but you might still be interested in changing these attributes after ``hist()`` has been called. Luckily, you can pass an ``ax`` argument from an initialized plot to ``hist()``:

   ::

       fig, ax = plt.subplots()
       d['sw_r_spatial'].hist(ax=ax)

-  ``hist()`` automatically plots a grid. If you don't like the grid, you can turn it off via the ``Axes`` object: ``ax.grid(False)``


Save a figure
~~~~~~~~~~~~~

You have two options to save your figure:

-  The quick version: hold :gui:`Shift` and right-click on the image in Jupyter to save it as a file. (Holding :gui:`Shift` is necessary on Jupyter notebook 7 to access your browser's normal right-click popup menu). You obtain a low-resolution image that is fine for email, but not necessarily for publication.

-  For a high-resolution version, add these lines to the end of the same Jupyter cell in which you called the ``plot`` or ``hist`` function:

   ::

       PATH_FIGURE = DIR_FIGURES_LAB1 / '2_fig_example.png'

       plt.savefig(PATH_FIGURE, bbox_inches='tight', dpi=150)

   -  This will save your figure with a transparent background. If you prefer a solid white background, add the argument ``facecolor='white'``

**Nice job!**

You just learned a number of skills that that are useful to have in your digital toolset. Now you can apply what you learned to generate the deliverables:


Create the deliverables
~~~~~~~~~~~~~~~~~~~~~~~

-  A figure with three histograms (arranged horizontally) showing the distribution of *per-student average skill levels* within each category:

   .. image:: img/lab1-2_skills.png
      :alt: Schematic illustration of skills plot
      :align: right
      :width: 280

   -  Use ``plt.subplots()`` to initialize a figure with three ``Axes`` arranged horizontally.
   
   -  Make it 7" wide and 3" high.

   -  *For each student in the current class*, compute their **average score across all skills within each category**.

      -  Do this separately for the three categories: coding (``code_``), statistics (``stat_``), spatial data (``spat_``).

      -  This should give you **one average value per student per category**.  

   -  For each of the three categories, create a histogram (not ``barh``) that shows the **distribution of these per-student averages**. Plot each histogram on one of the ``Axes`` objects.

      -  For instance, if you saved the (``numpy``) array of the three ``Axes`` returned by ``plt.subplots()`` as ``axes`` you can select individual ``Axes`` to pass them to the plotting function, e.g. ``ax=axes[0]``).

   -  Make sure the x-axis extent covers the right range (1–3). Add an informative title to each histogram. Save a 150 dpi version of the figure at this filepath:

      :file:`~/ee508/reports/figures/lab1/2_ee508_skills_hist.png`

   -  **Important:** Here you are grouping by student (averaging within each student, then comparing between students).

-  A figure with two barplots showing the *class average level for each software skill and each interest*.

   .. image:: img/lab1-2_sw.png
      :alt: Schematic illustration of software & interests plot
      :align: right
      :width: 120

   -  Use ``plt.subplots()`` to initialize a figure with two ``Axes`` arranged vertically. Make it 4" wide and 6" high.

   -  For each software type (``sw_``), compute the **average score across all students** (i.e. one value per software type). Create a horizontal bar plot that shows these averages and plot it on the upper axes. Add a title, make it look nice, and choose the correct x-axis extent (1–5).

   -  Do the same for each self-reported interest (``int_``): compute the **average score across all students** (i.e. one number per interest). Plot these averages as a horizontal barplot on the lower axes.

   -  The title of the lower plot might overlap with the labels of the y-axis of the upper plot. The quickest way to resolve this issue is a matplotlib shortcut which automatically chooses a spacing which resolves such overlaps:

      ::

          plt.tight_layout()

   -  Save a 150 dpi version of the figure at:

      :file:`~/ee508/reports/figures/lab1/2_ee508_sw_int.png`

   -  **Important:** Here you are grouping by skill/interest (averaging across students, then comparing between skills).

.. admonition:: Reflect

   Take some time to appreciate the results. How diverse are the prior skills of the students in this course? What are you, as a group, most interested in?


Independent coding challenge
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. admonition:: Question

   Does this cohort look different than the previous ones?


Create a new version of :file:`~/ee508/reports/figures/lab1/2_ee508_skills_hist.png`, plotting the histograms of students that are **not** in the current cohort behind the histogram of students that are in the current cohort. (You will have to backtrack and re-load the survey responses to get all students back.)

You can plot two histograms (or any matplotlib layers) on top of each other by passing them the same ``ax`` object. For instances, if you have created a Series with values of current students (``d_current``) and a Series with the values of prior students (``d_prior``), you can plot their histograms on top of each other like this (``ax`` has to already exist).

::

    d_current.hist(ax=ax, alpha=0.5, color='red')
    d_prior.hist(ax=ax, alpha=0.5, color='blue')


.. tip::

   You will reuse this "layering" of plot elements: this is how we plot map elements on top of each other (e.g., a county boundary on top of a raster image).


-  ``alpha=0.5`` adds transparency, so the lower histogram stays
   visible.

-  Save a 150dpi version at:

   :file:`~/ee508/reports/figures/lab1/2_ee508_skills_hist_prior.png`


Wrap up
~~~~~~~

Your folder :file:`~/ee508/reports/figures/lab1/` should now contain all deliverables:

   | :file:`2_ee508_skills_hist.png`
   | :file:`2_ee508_sw_int.png`
   | :file:`2_ee508_skills_hist_prior.png`

-  Compress the three files into a single :file:`.zip` archive.

-  Find the Google Assignment with the lab title on the `Blackboard course website <https://learn.bu.edu/ultra/courses/_264297_1/outline>`_:

-  Upload your :file:`.zip` archive.


.. admonition:: You're done!

   That's it.

   I hope you enjoyed your first steps in ``pandas``!

   Bring your questions and opinions to class!