Data analyzes II - data visualization and analyses#

Before we get started …#

import warnings
warnings.filterwarnings("ignore")

What we will do in this session of the course is a short introduction to Python for data analyses including basic data operations like file reading and wrangling, as well as statistics and data visualization. The goal is to showcase crucial tools/resources and their underlying working principles to allow further more in-depth exploration and direct application.

It is divided into the following chapters:

  • Getting ready

  • Basic data operations

    • Reading data

    • Exploring data

    • Data wrangling

  • Basic data visualization

    • Underlying principles

    • “standard” plots

    • Going further with advanced plots

  • Statistics in python

    • Descriptive analyses

    • Inferential analyses

  • Interactive data visualization

Here’s what we will focus on in the second block:

  • Basic data visualization

    • Underlying principles

    • “standard” plots

    • Going further with advanced plots

  • Statistics in python

    • Descriptive analyses

    • Inferential analyses

  • Interactive data visualization

Recap - Getting ready#

What’s the first thing we have to check/evaluate before we start working with data, no matter if in Python or any other software? That’s right: getting everything ready!

This includes outlining the core workflow and respective steps. Quite often, this notebook and its content included, this entails the following:

  1. What kind of data do I have and where is it?

  2. What is the goal of the data analyses?

  3. How will the respective steps be implemented?

So let’s check these aspects out in slightly more detail.

Recap - What kind of data do I have and where is it#

The first crucial step is to get a brief idea of the kind of data we have, where it is, etc. to outline the subsequent parts of the workflow (python modules to use, analyses to conduct, etc.). At this point it’s important to note that Python and its modules work tremendously well for basically all kinds of data out there, no matter if behavior, neuroimaging, etc. . To keep things rather simple, we will keep using the behavioral dataset from the prior section that contains reaction times, accuracies and demographic information from a group of 30 participants (ah, the classics…).

We already accomplished and worked with the dataset quite a bit during the last session, including:

  • reading data

  • extract data of interest

  • convert to different more intelligible structures and forms

At the end, we had two kinds of DataFrames, two per participant, ie one per session and one containing the data of all participants.

For this session, we will continue to explore aspects of data visualization and analyzes via this DataFrame containing the data of all participants. Thus, let’s move to our data storage and analyses directory and load it accordingly using pandas!

from os import chdir
chdir('/Users/peerherholz/Desktop/choice_rtt/derivatives/concatenation/')
from os import listdir
listdir('.')
['group_task-choiceRTT_beh.tsv',
 'pairplot.html',
 'boxplot_data_points.html',
 'heatmap_cor.html']
import pandas as pd

df_all_part = pd.read_csv('group_task-choiceRTT_beh.tsv', sep='\t')

df_all_part.head(n=20)
participant_id age left-handed Do you like this session? session stim_file response response_time trial_type trial
0 1 24 False Yes post ../../stimuli/shapes/target_plus.jpg 1 0.513755 practice shapes
1 1 24 False Yes post ../../stimuli/shapes/target_cross.jpg 0 0.639930 practice shapes
2 1 24 False Yes post ../../stimuli/shapes/target_square.jpg 1 0.613897 practice shapes
3 1 24 False Yes post ../../stimuli/shapes/target_plus.jpg 1 0.996120 practice shapes
4 1 24 False Yes post ../../stimuli/shapes/target_square.jpg 1 0.423148 practice shapes
5 1 24 False Yes post ../../stimuli/shapes/target_square.jpg 1 0.312653 practice shapes
6 1 24 False Yes post ../../stimuli/shapes/target_cross.jpg 1 0.425176 experiment shapes
7 1 24 False Yes post ../../stimuli/shapes/target_plus.jpg 1 0.556528 experiment shapes
8 1 24 False Yes post ../../stimuli/shapes/target_plus.jpg 0 0.820919 experiment shapes
9 1 24 False Yes post ../../stimuli/shapes/target_cross.jpg 1 0.804658 experiment shapes
10 1 24 False Yes post ../../stimuli/shapes/target_cross.jpg 1 0.515643 experiment shapes
11 1 24 False Yes post ../../stimuli/shapes/target_plus.jpg 1 0.679778 experiment shapes
12 1 24 False Yes post ../../stimuli/shapes/target_plus.jpg 1 0.656170 experiment shapes
13 1 24 False Yes post ../../stimuli/shapes/target_cross.jpg 1 0.745433 experiment shapes
14 1 24 False Yes post ../../stimuli/shapes/target_square.jpg 1 0.475323 experiment shapes
15 1 24 False Yes post ../../stimuli/shapes/target_square.jpg 0 0.712910 experiment shapes
16 1 24 False Yes post ../../stimuli/shapes/target_cross.jpg 0 0.985225 experiment shapes
17 1 24 False Yes post ../../stimuli/shapes/target_square.jpg 1 0.640720 experiment shapes
18 1 24 False Yes post ../../stimuli/shapes/target_wombat.jpg 1 0.461130 practice images
19 1 24 False Yes post ../../stimuli/images/target_capybara.jpg 1 0.649435 practice images

For this section, we want to focus on the experiment trials and will thus use a respective sub-DataFrame. We will then briefly summarize the data as function of trial again:

df_all_part = df_all_part[df_all_part['trial_type']=='experiment']

for index, df in df_all_part.groupby('trial'):
    print('Showing information for subdataframe: %s' %index)
    print(df['response_time'].describe())
Showing information for subdataframe: images
count    720.000000
mean       0.884940
std        0.319951
min        0.300281
25%        0.616388
50%        0.897373
75%        1.106691
max        1.498425
Name: response_time, dtype: float64
Showing information for subdataframe: shapes
count    720.000000
mean       0.895266
std        0.325224
min        0.300439
25%        0.626882
50%        0.893830
75%        1.154183
max        1.498475
Name: response_time, dtype: float64
for index, df in df_all_part.groupby('trial'):
    print('Showing information for subdataframe: %s' %index)
    print(df['response'].describe())
Showing information for subdataframe: images
count    720.000000
mean       0.775000
std        0.417873
min        0.000000
25%        1.000000
50%        1.000000
75%        1.000000
max        1.000000
Name: response, dtype: float64
Showing information for subdataframe: shapes
count    720.000000
mean       0.776389
std        0.416954
min        0.000000
25%        1.000000
50%        1.000000
75%        1.000000
max        1.000000
Name: response, dtype: float64

Great! With these basics set, we can continue and start thinking about the potential goal of the analyses.

Recap - What is the goal of the data analyzes#

There obviously many different routes we could pursue when it comes to analyzing data. Ideally, we would know that before starting (pre-registration much?) but we all know how these things go… For the dataset we aimed at the following, with steps in () indicating operations we already conducted:

  • (read in single participant data)

  • (explore single participant data)

  • (extract needed data from single participant data)

  • (convert extracted data to more intelligible form)

    • (repeat for all participant data)

    • (combine all participant data in one file)

  • (explore data from all participants)

    • (general overview)

    • basic plots

  • analyze data from all participant

    • descriptive stats

    • inferential stats

Nice, that’s a lot. The next step on our list would be data explorations by means of data visualization which will also lead to data analyzes.

Recap - how will the respective steps be implemented#

After creating some sort of outline/workflow, we though about the respective steps in more detail and set overarching principles. Regarding the former, we also gathered a list of potentially useful python modules to use. Given the pointers above, this entailed the following:

  • numpy and pandas for data wrangling/exploration

  • matplolib, seaborn and plotly for data visualization

  • pingouin and statsmodels for data analyzes/stats

Regarding the second, we went back to standards and principles concerning computational work:

  • use a dedicated computing environment

  • provide all steps and analyzes in a reproducible form

  • nothing will be done manually, everything will be coded

  • provide as much documentation as possible

Important: these aspects should be followed no matter what you’re working on!

So, after “getting ready” and conducted the first set of processing steps, it’s time to continue via basic data visualization.

Basic data visualization#

Given that we already explored our data a bit more, including the basic descriptive statistics and data types, we will go one step further and continue this process via basic data visualization to get a different kind of overview that can potentially indicate important aspects concerning data analyses. As mentioned above, we will do so via the following steps, addressing different aspects of data visualization. Throughout each, we will get to know respective python modules and functions.

  • Underlying principles

  • “standard” plots

  • Going further with advanced plots

Underlying principles#

When talking about visualization one might want to differentiate data exploration and analyses but one can actually drastically influence the other. Here, we are going to check both, that is facilitating data understanding in many ways and creating high quality results figures.

Unsurprisingly, python is nothing but fantastic when it comes to data visualization:

  • python provides a wide array of options

  • Low-level and high-level plotting APIs

  • static images vs. HTML output vs. interactive plots

  • domain-general and domain-specific packages

  • optimal visualization environment as it’s both efficient and flexible

    • produce off-the-shelf high-quality plots very quickly

    • with more effort, gives you full control over the plot

While python has a large amount of amazing modules targetting data visualization, we are going to utilize the three most common and general ones, as they provide the basis for everything else going further:

The first two produce static images and the last one HTML outputs and allow much more interactive plots. We will talk about each one as we go along.

matplotlib#

  • the most widely-used python plotting library

  • initially modeled on MATLAB’s plotting system

  • designed to provide complete control over a plot

matplotlib and all other high-level APIs that build upon it operate on underlying principles and respective parts:

In the most basic sense matplotlib graphs your data on Figures (e.g., windows, Jupyter widgets, etc.), each of which can contain one or more Axes, an area where points can be specified in terms of x-y coordinates (or theta-r in a polar plot, x-y-z in a 3D plot, etc.).

  • figures

    • the entire graphic

    • keep track of everything therein (axes, titles, legends, etc.)

  • axes

    • usually contains two or three axis objects

    • includes title, x-label, y-label

  • axis

    • ticks and tick labels to provide scale for data

  • artist

    • everything visible on the figure: text, lines, patches, etc.

    • drawn to the canvas

A bit too “theoretical”, eh? Let’s dive in and create some plots!

But before we start, two important points to remember: when plotting in jupyter notebooks, make sure to run the %matplotlib inline magic before your first graphic which results in the graphics being embedded in the jupyter notebook and not in the digital void. (NB: this is true for most but not all plotting modules/functions.)

%matplotlib inline

When using matplotlib you can choose between explicitly creating Figures and axes or use the plt interface to automatically create and manage them, as well as adding graphics. Quite often you might want to use the latter.

import matplotlib.pyplot as plt

standard plots#

Obviously, matplotlib comes with support for all the “standard plots” out there: barplots, scatterplots, histograms, boxplots, errorbars, etc. . For a great overview on what’s possible, make sure to check the gallery of the matplotlib documentation. For now, we are going to start simply…how about some univariate data visualization, e.g. a scatterplot?

For example, we are interested in the distribution of age in our dataset. Using matplotlib, we need to create a figure and draw something inside. As our data is in long-format we have to initially extract a list containing the age of each participant only once, for example using list comprehension.

plt.figure(figsize=(10, 5))
plt.hist([df_all_part[df_all_part['participant_id']==part]['age'].to_numpy()[0] for part in df_all_part['participant_id'].unique()])
(array([4., 1., 3., 2., 4., 3., 3., 1., 2., 7.]),
 array([18. , 20.2, 22.4, 24.6, 26.8, 29. , 31.2, 33.4, 35.6, 37.8, 40. ]),
 <BarContainer object of 10 artists>)
../../_images/a80f5c3ce3283a527dc8d969694cfe23124ed51952c370908131dafa9d1d95c6.png

While the information we wanted is there, the plot itself looks kinda cold and misses a few pieces to make it intelligible, e.g. axes labels and a title. This can easily be added via matplotlib’s plt interface.

plt.figure(figsize=(10, 5))
plt.hist([df_all_part[df_all_part['participant_id']==part]['age'].to_numpy()[0] for part in df_all_part['participant_id'].unique()])
plt.xlabel('Age', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.title('Distribution of age', fontsize=15);
../../_images/fa666795304cab283ba782e10410205db26ca4f7a5789fdfb2771a4e43d0a4fa.png

We could also add a grid to make it easier to situate the given values:

plt.figure(figsize=(10, 5))
plt.hist([df_all_part[df_all_part['participant_id']==part]['age'].to_numpy()[0] for part in df_all_part['participant_id'].unique()])
plt.xlabel('Age', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.title('Distribution of age', fontsize=15);
plt.grid(True)
../../_images/df970ad887c0349378536fe347ca14bec4e7cb1c4a42e62d2a8b3ab028c48e45.png

Seeing this distribution of age, we could also have a look how it might interact with responses, e.g. do younger participants exhibit different response pattern thanthan older participants. Thus, we would create a bivariate visualization with linear data. As an example, let’s look at the mean accuracy of responses to shapes:

age_list = [df_all_part[df_all_part['participant_id']==part]['age'].to_numpy()[0] for part in df_all_part['participant_id'].unique()]
acc_means = [df_all_part[df_all_part['participant_id']==part & (df_all_part['trial']=="shapes")]['response'].to_numpy().mean() for part in df_all_part['participant_id'].unique()]

plt.figure(figsize=(10, 5))
plt.scatter(age_list, acc_means)
plt.xlabel('Age', fontsize=12)
plt.ylabel('Accuracy for shapes', fontsize=12)
plt.title('Comparing accuracy and age', fontsize=15);
../../_images/a9db9c278b3de5cdb5cf7dccc672baa492e000fbc16ff6551e72c970f73f66c9.png

Sometimes, we might want to have different subplots within one main plot. Using matplotlib’s subplots function makes this straightforward via two options: creating a subplot and adding the respective graphics or creating multiple subplots and adding the respective graphics via the axes. Let’s check the first option:

plt.subplot(1, 2, 1)
plt.hist([df_all_part[df_all_part['participant_id']==part]['age'].to_numpy()[0] for part in df_all_part['participant_id'].unique()])
plt.xlabel('Age', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.title('Distribution of age', fontsize=15);
plt.grid(True)

plt.subplots_adjust(right=4.85)

plt.subplot(1, 2, 2)
plt.scatter(age_list, acc_means)
plt.xlabel('Age', fontsize=12)
plt.ylabel('Accuracy for shapes', fontsize=12)
plt.title('Comparing accuracy and age', fontsize=15);

plt.show()
../../_images/91c801f9eebb9ce3618f03be1f4952eb89d3e586dfac036747e418dba65cc5d8.png

Hm, kinda ok but we would need to adapt the size and spacing. This is actually easier using the second option, subplots(), which is also recommended by the matplotlib community:

fig, axs = plt.subplots(1, 2, figsize=(20, 5))

axs[0].hist([df_all_part[df_all_part['participant_id']==part]['age'].to_numpy()[0] for part in df_all_part['participant_id'].unique()])
axs[0].set_xlabel('Age', fontsize=12)
axs[0].set_ylabel('Count', fontsize=12)
axs[0].set_title('Distribution of age', fontsize=15);
axs[0].grid(True)

axs[1].scatter(age_list, acc_means)
axs[1].set_xlabel('Age', fontsize=12)
axs[1].set_ylabel('Accuracy for shapes', fontsize=12)
axs[1].set_title('Comparing accuracy and age', fontsize=15);
../../_images/44cd104b2a9074b79778d5fb92c527bce22f1e94d3181339c0b59be32e8027bb.png

As matplotlib provides access to all parts of a figure, we could furthermore adapt various aspects, e.g. the color and size of the drawn markers.

fig, axs = plt.subplots(1, 2, figsize=(20, 5))

axs[0].hist([df_all_part[df_all_part['participant_id']==part]['age'].to_numpy()[0] for part in df_all_part['participant_id'].unique()])
axs[0].set_xlabel('Age', fontsize=12)
axs[0].set_ylabel('Count', fontsize=12)
axs[0].set_title('Distribution of age', fontsize=15);
axs[0].grid(True)

axs[1].scatter(age_list, acc_means, c='black', s=80)
axs[1].set_xlabel('Age', fontsize=12)
axs[1].set_ylabel('Accuracy for shapes', fontsize=12)
axs[1].set_title('Comparing accuracy and age', fontsize=15);
../../_images/5a2d81dd17175c0bf70674e59fcaf044d406fdb9717ed435c42ad54c80be41ee.png

This provides just a glimpse but matplotlib is infinitely customizable, thus as in most modern plotting environments, you can do virtually anything. The problem is: you just have to be willing to spend enough time on it. Lucky for us and everyone else there are many modules/libraries that provide a high-level interface to matplotlib. However, before we check one of them out we should quickly summarize pros and cons of matplotlib.

Pros#
  • provides low-level control over virtually every element of a plot

  • completely object-oriented API; plot components can be easily modified

  • close integration with numpy

  • extremely active community

  • tons of functionality (figure compositing, layering, annotation, coordinate transformations, color mapping, etc.)

Cons#
  • steep learning curve

  • API is extremely unpredictable–redundancy and inconsistency are common

  • some simple things are hard; some complex things are easy

  • lacks systematicity/organizing syntax–every plot is its own little world

  • simple plots often require a lot of code

  • default styles are not optimal

High-level interfaces to matplotlib#

  • matplotlib is very powerful and very robust, but the API is hit-and-miss

  • many high-level interfaces to matplotlib have been written

  • abstract away many of the annoying details

  • best of both worlds: easy generation of plots, but retain matplotlib’s power

    • Seaborn

    • ggplot

    • pandas

    • etc.

  • many domain-specific visualization tools are built on matplotlib (e.g., nilearn and mne in neuroimaging)

Going further with advanced plots#

This also marks the transition to more “advanced plots” as the respective libraries allow you to create fantastic and complex plots with ease!

Seaborn#

Seaborn abstracts away many of the complexities to deal with such minutiae and provides a high-level API for creating aesthetic plots.

  • arguably the premier matplotlib interface for high-level plots

  • generates beautiful plots in very little code

  • beautiful styles and color palettes

  • wide range of supported plots

  • modest support for structured plotting (via grids)

  • exceptional documentation

  • generally, the best place to start when exploring and visualizing data

  • (can be quite slow (e.g., with permutation))

For example, recreating the plots from above is as easy as:

import seaborn as sns

sns.histplot([df_all_part[df_all_part['participant_id']==part]['age'].to_numpy()[0] for part in df_all_part['participant_id'].unique()])
plt.xlabel('Age')
plt.title('Distribution of age')
Text(0.5, 1.0, 'Distribution of age')
../../_images/de8c3bee667f01dedf7188e8a40b35bc12fcc0ed91854f55913ee5e04f1b52e5.png
sns.scatterplot(x=age_list, y=acc_means)
plt.xlabel('Age')
plt.title('Comparing accuracy and age')
Text(0.5, 1.0, 'Comparing accuracy and age')
../../_images/0ab49578247dcb35a7fb4f5b71247302b9d97f81c0dc1848a427503e65fac3bc.png

You might wonder: “well, that doesn’t look so different from the plots we created before and it’s also not way faster/easier”.

True that, but so far this based on our data and the things we wanted to plot. Seaborn actually integrates fantastically with pandas dataframes and allows to achieve amazing things rather easily.

Let’s go through some examples!

How about evaluating response_time as function of age, separated by handedness? Sounds wild? Using seaborn’s pairplot this achieved with just one line of code:

sns.pairplot(df_all_part[['age', 'left-handed', 'response_time']], hue='left-handed')
<seaborn.axisgrid.PairGrid at 0x12c8dbf70>
../../_images/1081e18d3b61f3ad974f5267a210eddb12f0f7dabc299b37df64bdda1441d879.png

Or how about rating of animals as a function of age, separately for each animal? Same approach, but restricted to a subdataframe that only contains the ratings of the animal category! However, before we create the plot, we will adapt the stim_file column from a path to the image to the animal name for plotting reasons.

# Define a function to replace the stim_file based on the presence of specific animal names
def replace_stim_file_images(row):
    animals = ['capybara', 'wombat', 'platypus']  # List of animals to check for
    for animal in animals:
        if animal in row['stim_file']:
            return animal  # Replace with the animal name if found
    return row['stim_file']  # Return the original value if none of the animals are found

# Apply the function to each row
df_all_part['stim_file'] = df_all_part.apply(replace_stim_file_images, axis=1)

And do the same for the shapes to be consistent.

# Define a function to replace the stim_file based on the presence of specific animal names
def replace_stim_file_shapes(row):
    shapes = ['square', 'plus', 'cross']  # List of animals to check for
    for shape in shapes:
        if shape in row['stim_file']:
            return shape  # Replace with the animal name if found
    return row['stim_file']  # Return the original value if none of the animals are found

# Apply the function to each row
df_all_part['stim_file'] = df_all_part.apply(replace_stim_file_shapes, axis=1)

Now we’re ready for plotting!

sns.pairplot(df_all_part[df_all_part['trial']=='images'][['age', 'stim_file', 'response_time']], hue='stim_file')
<seaborn.axisgrid.PairGrid at 0x12eb96110>
../../_images/57cf235fe7f5916ffab6dc563936e383fb59753a04042952f4b9746f3d589850.png

Assuming we want to check the responses to the images further, we will create a respective subdataframe.

df_images = df_all_part[df_all_part['trial']=='images']
df_images.head(n=20)
participant_id age left-handed Do you like this session? session stim_file response response_time trial_type trial
24 1 24 False Yes post capybara 1 0.495692 experiment images
25 1 24 False Yes post platypus 1 0.918446 experiment images
26 1 24 False Yes post platypus 0 0.823403 experiment images
27 1 24 False Yes post platypus 1 0.967150 experiment images
28 1 24 False Yes post wombat 1 0.531525 experiment images
29 1 24 False Yes post capybara 1 0.686935 experiment images
30 1 24 False Yes post platypus 1 0.700605 experiment images
31 1 24 False Yes post capybara 1 0.986232 experiment images
32 1 24 False Yes post wombat 1 0.352742 experiment images
33 1 24 False Yes post wombat 1 0.513988 experiment images
34 1 24 False Yes post platypus 1 0.433638 experiment images
35 1 24 False Yes post wombat 1 0.487932 experiment images
60 1 24 False Yes test platypus 1 1.119574 experiment images
61 1 24 False Yes test capybara 1 0.952908 experiment images
62 1 24 False Yes test platypus 0 1.091557 experiment images
63 1 24 False Yes test platypus 1 1.418296 experiment images
64 1 24 False Yes test capybara 1 1.027042 experiment images
65 1 24 False Yes test platypus 1 0.885462 experiment images
66 1 24 False Yes test capybara 1 1.049408 experiment images
67 1 24 False Yes test wombat 0 1.434780 experiment images

And can now make use of the fantastic pandas - seaborn friendship. For example, let’s go back to the scatterplot of age and ratings we did before. How could we improve this plot? Maybe adding the distribution of each variable to it? That’s easily done via jointplot():

sns.jointplot(x='age', y='response_time', data=df_images, hue='session')
<seaborn.axisgrid.JointGrid at 0x10933c0a0>
../../_images/e342e2c53e4881eacd0224152d27ad4c80558398329c9c97bb4f874e36b82d66.png

Wouldn’t it be cool if we could also briefly explore if there might be some statistically relevant effects going on here? Say no more, as we can add a regression to the plot via setting the kind argument to reg:

sns.jointplot(x='age', y='response_time', data=df_images, kind='reg')
<seaborn.axisgrid.JointGrid at 0x12ef98ee0>
../../_images/3b7486f3b5e935318a575d7febce58d77c72d28e1cdaabc94fbc486b995f2725.png

Is there maybe a slight preference for one animal over another? We might want to spend a closer look on the response_time for each animal. One possibility to do so, could be a boxplot():

plt.figure(figsize=(8,6))
sns.boxplot(x='response_time', y='stim_file', data=df_images, hue='stim_file', palette='vlag', dodge=False)
plt.legend([],[], frameon=False)
<matplotlib.legend.Legend at 0x12f082e30>
../../_images/e2eb0c6f6f9212b7a272b0d1177144a5cb4a8a10b818ab429623e84d2474463d.png

However, we know that boxplots have their fair share of problems…given that they show summary statistics, clusters and multimodalities are hidden..

That’s actually one important aspect everyone should remember concerning data visualization, no matter if for exploration or analyses: show as much data and information as possible! With seaborn we can easily address this via adding individual data points to our plot via stripplot():

plt.figure(figsize=(8,6))

sns.boxplot(x='response_time', y='stim_file', data=df_images, hue='stim_file', palette='vlag', dodge=False)
sns.stripplot(x='response_time', y='stim_file', data=df_images, color='black')
plt.legend([],[], frameon=False)
<matplotlib.legend.Legend at 0x12c8d8dc0>
../../_images/5ba1006bb35ba9abf9cf7f4e465845817f8b6d9e1cc16ec679e8796a72287e8f.png

Ah yes, that’s better! Seeing the individual data points, we might want to check the respective distributions. Using seaborn’s violinplot(), this is done in no time.

plt.figure(figsize=(6,6))

sns.violinplot(data=df_images, x="response_time", y="stim_file", inner="quart", linewidth=1, palette='vlag')
sns.despine(left=True)
../../_images/203eec03c90fbd4d4842a192423213e106016f8d7139c940e03f1a3842df8f9f.png

As you might have seen, we also adapted the style of our plot a bit via sns.despine() which removed the y axis spine. This actually outlines another important point: seaborn is fantastic when it comes to customizing plots with little effort (i.e. getting rid of many lines of matplotlib code). This includes “themes, contexts, colormaps among other things. The subdataframe including response_time for shapes might be a good candidate to explore these aspects.

df_shapes = df_all_part[df_all_part['trial']=='shapes']

Starting with themes, which set a variety of aesthetic related factors, including background color, grids, etc., here are some very different examples showcasing the whitegrid style

sns.set_style("whitegrid")
sns.violinplot(data=df_shapes, x="response_time", y="stim_file", inner="quart", linewidth=1, palette='vlag')
sns.despine(left=True)
../../_images/eda3b66d535366ab9f988da92a0394f695624cbde46ce7871fc3f7066db369fc.png

and the dark style:

sns.set_style("dark")
sns.violinplot(data=df_shapes, x="response_time", y="stim_file", inner="quart", linewidth=1, palette='vlag')
sns.despine(left=True)
../../_images/7b0f890a748030a343f26e8bc3645f3664503bbc508562a28258f2cbc7a50540.png

While this is already super cool, seaborn goes one step further and even let’s you define the context for which your figure is intended and adapts it accordingly. For example, there’s a big difference if you want to include your figure in a poster:

sns.set_style('whitegrid')
sns.set_context("poster")
sns.violinplot(data=df_shapes, x="response_time", y="stim_file", inner="quart", linewidth=1, palette='vlag')
sns.despine(left=True)
../../_images/8b4adffe4951b26a713f3b0e7327506bbb3659bcd3674723c4b895ae06412ed8.png

or a talk:

sns.set_context("talk")
sns.violinplot(data=df_shapes, x="response_time", y="stim_file", inner="quart", linewidth=1, palette='vlag')
sns.despine(left=True)
../../_images/8ba6e626a192db3bc3e41e0291c8e48a3802c16f3f8f1f9c13d35d9f5ba4580a.png

or a paper:

sns.set_context("paper")
sns.violinplot(data=df_shapes, x="response_time", y="stim_file", inner="quart", linewidth=1, palette='vlag')
sns.despine(left=True)
../../_images/0c76e40daae4d524698d65dc94732d95223e12314ea998bc5d4ccf6812b34c61.png

No matter the figure and context, another very crucial aspect everyone should always look out for is the colormap or colorpalette! Some of the most common ones are actually suboptimal in multiple regards. This entails a misrepresentation of data:

It gets even worse: they don’t work for people with color vision deficiencies!

That’s obviously not ok and we/you need to address this! With seaborn, some of these important aspects are easily addressed. For example, via setting the colorpalette to colorblind:

sns.set_context("notebook")
sns.violinplot(data=df_shapes, x="response_time", y="stim_file", inner="quart", linewidth=1, palette='colorblind')
sns.despine(left=True)
../../_images/9d63944b00943e02a30bf8450a2c805df6d185ec2c38c12a7cbaad70e108896c.png

or using one of the suitable color palettes that also address the data representation problem, i.e. perceptually uniform color palettes:

sns.color_palette("flare", as_cmap=True)
flare
flare colormap
under
bad
over
sns.color_palette("crest", as_cmap=True)
crest
crest colormap
under
bad
over
sns.cubehelix_palette(as_cmap=True)
seaborn_cubehelix
seaborn_cubehelix colormap
under
bad
over
sns.cubehelix_palette(start=.5, rot=-.5, as_cmap=True)
seaborn_cubehelix
seaborn_cubehelix colormap
under
bad
over

Let’s see a few of those in action, for example within a heatmap that displays the correlation between response_time for images. For this, we need to reshape our data back to wide-format which is straightforward using pandaspivot function.

df_all_part_wide = df_all_part.pivot_table(index=['participant_id', 'session'], 
                         columns='stim_file', 
                         values='response_time', 
                         aggfunc='mean')  

df_all_part_wide = df_all_part_wide[['cross', 'plus', 'square', 'capybara', 'platypus', 'wombat']]
df_all_part_wide
stim_file cross plus square capybara platypus wombat
participant_id session
1 post 0.695227 0.678349 0.609651 0.722953 0.768648 0.471547
test 1.186744 1.123207 1.139514 0.957431 1.092030 1.344081
2 post 0.762689 0.619683 0.599985 0.673366 0.737820 0.587512
test 1.274003 1.100577 0.923209 1.124963 0.945678 1.254009
3 post 0.530374 0.521553 0.992972 0.652133 0.404850 0.716720
test 1.162258 1.051083 1.194107 0.959460 1.010056 1.061552
4 post 0.633965 0.757075 0.596069 0.578154 0.646218 0.662923
test 1.112594 1.113806 1.079497 1.048941 1.083535 1.247528
5 post 0.423638 0.649743 0.773591 0.674606 0.605312 0.618300
test 1.268744 1.115380 1.127952 1.249809 1.192149 1.131877
6 post 0.683104 0.625368 0.519447 0.633972 0.866682 0.734251
test 1.401519 1.227546 1.217396 1.078705 NaN 1.079305
7 post 0.634570 0.582086 0.737552 0.561176 0.648876 0.565289
test NaN 1.217030 1.160113 1.153621 1.091095 1.073692
8 post 0.735386 0.628580 0.544224 0.700269 NaN 0.553094
test 1.160200 0.991310 1.168924 1.106350 1.132035 1.142475
9 post 0.607519 0.755562 0.614723 0.771245 0.789756 0.657574
test 1.109647 1.197718 1.117669 1.251627 1.093382 1.057321
10 post 0.502849 0.686854 0.573754 0.621630 0.755898 0.590473
test 1.173531 1.121304 1.111777 1.055491 1.224844 1.201307
11 post 0.565770 0.796409 0.526394 0.662049 0.652107 0.359739
test 1.484344 1.101375 1.355650 1.177220 1.111482 1.154950
12 post 0.825229 0.469693 0.696174 0.586533 0.683366 0.709453
test 1.210542 1.353745 1.308511 1.068612 1.110585 1.346530
13 post 0.559779 0.705641 0.750242 0.632453 0.545109 0.643767
test 1.042256 1.174450 1.087483 1.261020 1.093822 1.178421
14 post 0.773779 0.695953 0.463407 0.736323 0.898039 0.618524
test 1.165927 0.999943 0.881409 0.962234 1.271395 1.126580
15 post 0.565924 0.748914 0.533429 0.657298 0.738657 0.618882
test 1.191236 1.256513 1.124252 1.064684 1.125852 1.157209
16 post 0.490668 0.629922 0.749302 0.609369 0.499990 0.540915
test 1.238226 1.097461 1.201405 1.009574 1.103565 1.120550
17 post 0.670037 0.806978 0.614988 0.535507 0.548741 0.765582
test 1.024884 0.862128 0.957512 1.112053 1.056240 1.260760
18 post 0.505096 0.500771 0.622511 0.731418 0.651869 0.550223
test 1.139586 1.203622 1.109744 0.940470 1.157193 1.080493
19 post 0.630026 0.570403 0.553740 0.670729 0.702371 0.539150
test 1.305638 1.197644 1.057489 1.068612 1.101864 1.059130
20 post 0.554464 0.518257 0.706713 0.674570 0.482590 0.630534
test 1.202076 1.084012 1.095869 1.086556 1.095437 1.197563
21 post 0.663010 0.674987 0.773452 0.651549 0.766257 0.589517
test 1.196278 1.307578 1.218819 1.160398 1.183626 0.818807
22 post 0.802519 0.522100 0.677609 0.607053 0.410310 0.653369
test 1.158462 1.082238 1.187047 1.128951 1.184014 NaN
23 post 0.449296 0.664222 0.672239 0.811811 0.687664 0.722097
test 1.234483 1.000650 1.247278 1.373307 0.971880 1.083474
24 post 0.727099 0.669498 0.544538 0.780169 0.539558 0.512699
test 1.146860 1.318709 1.013310 1.179416 1.116052 1.426554
25 post 0.667413 0.749908 0.674152 0.510016 0.624600 NaN
test 1.172720 1.189687 1.082873 1.142845 1.271394 1.181100
26 post 0.750412 0.672995 0.430227 0.855770 0.761978 0.597756
test 1.163184 1.116851 1.337332 0.907621 1.154835 0.949386
27 post 0.569827 0.647740 0.602968 0.666668 0.573412 0.675750
test 1.209065 1.085824 0.975633 1.254210 1.124254 0.977895
28 post 0.823674 0.549932 0.501510 0.616241 0.904126 0.566966
test 1.261421 0.984909 1.358329 1.026843 1.057895 1.168361
29 post 0.609549 0.651714 0.866044 0.385282 0.588151 0.628211
test 1.183397 1.097331 1.084794 1.112685 1.186779 1.221239
30 post 0.499469 0.769347 0.680312 0.449260 0.617595 0.606390
test 1.144440 1.246675 1.208878 1.333720 1.123456 1.048837

Then we can use another built-in function of pandas dataframes: .corr(), which computes a correlation between all columns:

plt.figure(figsize=(8,6))
sns.heatmap(df_all_part_wide.corr(), xticklabels=False, cmap='rocket')
<Axes: xlabel='stim_file', ylabel='stim_file'>
../../_images/69ca1071eaf6ae9ea49d0d5604c15520a3b00fb92e3d7b2282990c855dc8a7ca.png

Nice, how does the crest palette look?

plt.figure(figsize=(10,7))
sns.heatmap(df_all_part_wide.corr(), xticklabels=False, cmap='crest')
<Axes: xlabel='stim_file', ylabel='stim_file'>
../../_images/c646ee9fc53815f2788d1880fdac1b99101ae230d29d58ba9732c3ccb880a937.png

Also fantastic! However, it’s easy to get fooled by beautiful graphics, so maybe think about adding information to your plot whenever possible. For example, we could change heatmap to clustermap and adjust the colormap!

plt.figure(figsize=(6,4))
sns.clustermap(df_all_part_wide.corr(), xticklabels=False, cmap='mako', center=0)
<seaborn.matrix.ClusterGrid at 0x12f609600>
<Figure size 600x400 with 0 Axes>
../../_images/68fe776dd447360ed3583dd617180864790f22b0fa27c592a3770ee345bc9920.png

However, to be on the safe side, please also check your graphics for the mentioned points, e.g. via tools like Color Oracle that let you simulate color vision deficiencies!

Make use of amazing resources like the python graph gallery, the data to viz project and the colormap decision tree in Crameri et al. 2020, that in combination allow you to find and use the best graphic and colormap for your data!

And NEVER USE JET!

While the things we briefly explored were already super cool and a lot, we cannot conclude the data visualization section without at least mentioning the up and coming next-level graphics: raincloudplots as they combine various aspects of the things we’ve talked about! In python they are available via the ptitprince library.

from ptitprince import PtitPrince as pt

f, ax = plt.subplots(figsize=(10, 7))

pt.RainCloud(data = df_images, x = "stim_file", y = "response_time", 
             ax = ax, orient='h', hue='session', dodge = True, alpha = .65, bw = 0.2, move = .2)
<Axes: xlabel='response_time', ylabel='stim_file'>
../../_images/71791b21393f9c7365b8681eb104997e8fb8221283536512c33e6a0e5d1146b9.png

You want to go even further? Web-based and interactive plots are a must? Say no more, python of course has that covered as well.

Great examples for libraries are plotly, bokeh and holoviews.

Let’s have a brief look at the first one, plotly.

  • a python visualization engine that outputs directly to the web

  • lets you generate interactive web-based visualizations in pure python (!)

  • you get interactivity for free, and can easily customize many different aspects

  • works seamlessly in Jupyter notebooks

How about we re-create some of the graphics from before but as interactive versions?

Let’s start with the heatmap that displays the correlation between response_times of the images?

Depending on the graphic you want to build and its complexity, plotly offers different solutions. The easiest is to use plotly.express which lets you create a lot of different plots in a straightforward manner.

import plotly.express as px
fig = px.imshow(df_all_part_wide.corr(), x=df_all_part_wide.corr().columns, y=df_all_part_wide.corr().index)
fig.update_layout(width=500,height=500)

# run this when going through the notebook 
#fig.show()


### This part is required to include the graphics in the built Jupyter-Book
### It can be ignored during the session and should not be run
from plotly.offline import init_notebook_mode, plot
from IPython.core.display import display, HTML

init_notebook_mode(connected=True)

plot(fig, filename = 'heatmap_cor.html')

display(HTML('heatmap_cor.html'))

As with the other packages before, we can easily update the different graphic properties, including changing the colormap and axis titles.

fig = px.imshow(df_all_part_wide.corr(), x=df_all_part_wide.corr().columns, y=df_all_part_wide.corr().index, 
                color_continuous_scale=px.colors.sequential.Viridis,
                labels=dict(x="Stimulus", y="Stimulus"))
fig.update_layout(width=500,height=500)

# run this when going through the notebook 
#fig.show()

### This part is required to include the graphics in the built Jupyter-Book
### It can be ignored during the session and should not be run
from plotly.offline import init_notebook_mode, plot
from IPython.core.display import display, HTML

init_notebook_mode(connected=True)

plot(fig, filename = 'heatmap_cor_vir.html')

display(HTML('heatmap_cor_vir.html'))