Data analyzes II - data visualization and analyses#
Before we get started …#
most of what you’ll see within this lecture was prepared by Ross Markello, Michael Notter and Peer Herholz and further adapted for this course by Peer Herholz
based on Tal Yarkoni’s “Introduction to Python” lecture at Neurohackademy 2019
based on 10 minutes to pandas
import warnings
warnings.filterwarnings("ignore")
What we will do in this session of the course is a short introduction to Python
for data analyses
including basic data operations
like file reading
and wrangling
, as well as statistics
and data visualization
. The goal is to showcase crucial tools/resources and their underlying working principles to allow further more in-depth exploration and direct application.
It is divided into the following chapters:
Getting ready
Basic data operations
Reading data
Exploring data
Data wrangling
Basic data visualization
Underlying principles
“standard” plots
Going further with advanced plots
Statistics in python
Descriptive analyses
Inferential analyses
Interactive data visualization
Here’s what we will focus on in the second block:
Basic data visualization
Underlying principles
“standard” plots
Going further with advanced plots
Statistics in python
Descriptive analyses
Inferential analyses
Interactive data visualization
Recap - Getting ready#
What’s the first thing we have to check/evaluate before we start working with data, no matter if in Python
or any other software? That’s right: getting everything ready!
This includes outlining the core workflow and respective steps. Quite often, this notebook and its content included, this entails the following:
What kind of data do I have and where is it?
What is the goal of the data analyses?
How will the respective steps be implemented?
So let’s check these aspects out in slightly more detail.
Recap - What kind of data do I have and where is it#
The first crucial step is to get a brief idea of the kind of data we have, where it is, etc. to outline the subsequent parts of the workflow (python modules
to use, analyses to conduct, etc.). At this point it’s important to note that Python
and its modules
work tremendously well for basically all kinds of data out there, no matter if behavior
, neuroimaging, etc. . To keep things rather simple, we will keep using the behavioral dataset
from the prior section that contains reaction times
, accuracies
and demographic information
from a group of 30 participants
(ah, the classics…).
We already accomplished and worked with the dataset
quite a bit during the last session, including:
reading data
extract data of interest
convert to different more intelligible structures and forms
At the end, we had two kinds of DataFrames
, two per participant
, ie one per session
and one containing the data
of all participants
.
Does anyone remember what formats datasets commonly have and how they differ?
Usually, you either deal with DataFrames
in long
or wide
format. Which one you get and/or need depends on your data acquisition method and desired analysis.
Luckily, pandas
also makes this conversion very easy, ie changing the shape
of our DataFrame
from a few rows
that contain information across a wide range of columns
, ie wide
, to a few columns
that contain information stacked across a long range of rows
, ie long
, and vice versa.
This is achieved via the .melt() and .pivot functions.
For this session, we will continue to explore aspects of data visualization
and analyzes
via this DataFrame
containing the data
of all participants
. Thus, let’s move
to our data storage
and analyses directory
and load it accordingly using pandas
!
from os import chdir
chdir('/Users/peerherholz/Desktop/choice_rtt/derivatives/concatenation/')
from os import listdir
listdir('.')
['group_task-choiceRTT_beh.tsv',
'pairplot.html',
'boxplot_data_points.html',
'heatmap_cor.html']
import pandas as pd
df_all_part = pd.read_csv('group_task-choiceRTT_beh.tsv', sep='\t')
df_all_part.head(n=20)
participant_id | age | left-handed | Do you like this session? | session | stim_file | response | response_time | trial_type | trial | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 24 | False | Yes | post | ../../stimuli/shapes/target_plus.jpg | 1 | 0.513755 | practice | shapes |
1 | 1 | 24 | False | Yes | post | ../../stimuli/shapes/target_cross.jpg | 0 | 0.639930 | practice | shapes |
2 | 1 | 24 | False | Yes | post | ../../stimuli/shapes/target_square.jpg | 1 | 0.613897 | practice | shapes |
3 | 1 | 24 | False | Yes | post | ../../stimuli/shapes/target_plus.jpg | 1 | 0.996120 | practice | shapes |
4 | 1 | 24 | False | Yes | post | ../../stimuli/shapes/target_square.jpg | 1 | 0.423148 | practice | shapes |
5 | 1 | 24 | False | Yes | post | ../../stimuli/shapes/target_square.jpg | 1 | 0.312653 | practice | shapes |
6 | 1 | 24 | False | Yes | post | ../../stimuli/shapes/target_cross.jpg | 1 | 0.425176 | experiment | shapes |
7 | 1 | 24 | False | Yes | post | ../../stimuli/shapes/target_plus.jpg | 1 | 0.556528 | experiment | shapes |
8 | 1 | 24 | False | Yes | post | ../../stimuli/shapes/target_plus.jpg | 0 | 0.820919 | experiment | shapes |
9 | 1 | 24 | False | Yes | post | ../../stimuli/shapes/target_cross.jpg | 1 | 0.804658 | experiment | shapes |
10 | 1 | 24 | False | Yes | post | ../../stimuli/shapes/target_cross.jpg | 1 | 0.515643 | experiment | shapes |
11 | 1 | 24 | False | Yes | post | ../../stimuli/shapes/target_plus.jpg | 1 | 0.679778 | experiment | shapes |
12 | 1 | 24 | False | Yes | post | ../../stimuli/shapes/target_plus.jpg | 1 | 0.656170 | experiment | shapes |
13 | 1 | 24 | False | Yes | post | ../../stimuli/shapes/target_cross.jpg | 1 | 0.745433 | experiment | shapes |
14 | 1 | 24 | False | Yes | post | ../../stimuli/shapes/target_square.jpg | 1 | 0.475323 | experiment | shapes |
15 | 1 | 24 | False | Yes | post | ../../stimuli/shapes/target_square.jpg | 0 | 0.712910 | experiment | shapes |
16 | 1 | 24 | False | Yes | post | ../../stimuli/shapes/target_cross.jpg | 0 | 0.985225 | experiment | shapes |
17 | 1 | 24 | False | Yes | post | ../../stimuli/shapes/target_square.jpg | 1 | 0.640720 | experiment | shapes |
18 | 1 | 24 | False | Yes | post | ../../stimuli/shapes/target_wombat.jpg | 1 | 0.461130 | practice | images |
19 | 1 | 24 | False | Yes | post | ../../stimuli/images/target_capybara.jpg | 1 | 0.649435 | practice | images |
For this section, we want to focus on the experiment trials
and will thus use a respective sub-DataFrame
. We will then briefly summarize the data
as function of trial
again:
df_all_part = df_all_part[df_all_part['trial_type']=='experiment']
for index, df in df_all_part.groupby('trial'):
print('Showing information for subdataframe: %s' %index)
print(df['response_time'].describe())
Showing information for subdataframe: images
count 720.000000
mean 0.884940
std 0.319951
min 0.300281
25% 0.616388
50% 0.897373
75% 1.106691
max 1.498425
Name: response_time, dtype: float64
Showing information for subdataframe: shapes
count 720.000000
mean 0.895266
std 0.325224
min 0.300439
25% 0.626882
50% 0.893830
75% 1.154183
max 1.498475
Name: response_time, dtype: float64
for index, df in df_all_part.groupby('trial'):
print('Showing information for subdataframe: %s' %index)
print(df['response'].describe())
Showing information for subdataframe: images
count 720.000000
mean 0.775000
std 0.417873
min 0.000000
25% 1.000000
50% 1.000000
75% 1.000000
max 1.000000
Name: response, dtype: float64
Showing information for subdataframe: shapes
count 720.000000
mean 0.776389
std 0.416954
min 0.000000
25% 1.000000
50% 1.000000
75% 1.000000
max 1.000000
Name: response, dtype: float64
Great! With these basics set, we can continue and start thinking about the potential goal of the analyses.
Recap - What is the goal of the data analyzes#
There obviously many different routes we could pursue when it comes to analyzing data
. Ideally, we would know that before starting (pre-registration
much?) but we all know how these things go… For the dataset
we aimed at the following, with steps in ()
indicating operations
we already conducted:
(read in single participant data)
(explore single participant data)
(extract needed data from single participant data)
(convert extracted data to more intelligible form)
(repeat for all participant data)
(combine all participant data in one file)
(explore data from all participants)
(general overview)
basic plots
analyze data from all participant
descriptive stats
inferential stats
Nice, that’s a lot. The next step on our list would be data explorations
by means of data visualization
which will also lead to data analyzes
.
Recap - how will the respective steps be implemented#
After creating some sort of outline/workflow, we though about the respective steps in more detail and set overarching principles. Regarding the former, we also gathered a list of potentially useful python modules
to use. Given the pointers above, this entailed the following:
numpy and pandas for data wrangling/exploration
matplolib, seaborn and plotly for data visualization
pingouin and statsmodels for data analyzes/stats
Regarding the second, we went back to standards and principles concerning computational work:
use a dedicated computing environment
provide all steps and analyzes in a reproducible form
nothing will be done manually, everything will be coded
provide as much documentation as possible
Important: these aspects should be followed no matter what you’re working on!
So, after “getting ready” and conducted the first set of processing steps
, it’s time to continue via basic data visualization
.
Basic data visualization#
Given that we already explored our data a bit more, including the basic descriptive statistics
and data types
, we will go one step further and continue this process via basic data visualization
to get a different kind of overview that can potentially indicate important aspects concerning data analyses
. As mentioned above, we will do so via the following steps
, addressing different aspects of data visualization
. Throughout each, we will get to know respective python modules
and functions
.
Underlying principles
“standard” plots
Going further with advanced plots
Underlying principles#
When talking about visualization
one might want to differentiate data exploration
and analyses
but one can actually drastically influence the other. Here, we are going to check both, that is facilitating data understanding in many ways
and creating high quality results figures
.
Unsurprisingly, python
is nothing but fantastic when it comes to data visualization
:
python
provides a wide array of optionsLow-level and high-level plotting
API
sstatic
images
vs.HTML
output vs.interactive plots
domain-general and domain-specific packages
optimal visualization environment as it’s both efficient and flexible
produce off-the-shelf high-quality plots very quickly
with more effort, gives you full control over the plot
While python
has a large amount of amazing modules targetting data visualization
, we are going to utilize the three most common and general ones, as they provide the basis for everything else going further:
The first two produce static images
and the last one HTML outputs and allow much more interactive plots
. We will talk about each one as we go along.
matplotlib#
the most widely-used
python
plotting libraryinitially modeled on
MATLAB
’s plotting systemdesigned to provide complete control over a
plot
matplotlib
and all other high-level API
s that build upon it operate on underlying principles
and respective parts
:

In the most basic sense matplotlib
graphs your data on Figures
(e.g., windows
, Jupyter widgets
, etc.), each of which can contain one or more Axes
, an area
where points
can be specified in terms of x-y coordinates
(or theta-r
in a polar plot
, x-y-z
in a 3D plot
, etc.).
figures
the entire
graphic
keep track of everything therein (
axes
,titles
,legends
, etc.)
axes
usually contains two or three
axis objects
includes
title
,x-label
,y-label
axis
ticks
andtick labels
to providescale
fordata
artist
everything visible on the
figure
:text
,lines
,patches
, etc.drawn to the
canvas
A bit too “theoretical”, eh? Let’s dive in and create some plots!
But before we start, two important points to remember: when plotting
in jupyter notebooks
, make sure to run the %matplotlib inline
magic
before your first graphic
which results in the graphics
being embedded in the jupyter notebook
and not in the digital void. (NB: this is true for most but not all plotting modules
/functions
.)
%matplotlib inline
When using matplotlib
you can choose between explicitly creating Figures
and axes
or use the plt
interface to automatically create and manage them, as well as adding graphics
. Quite often you might want to use the latter.
import matplotlib.pyplot as plt
standard plots#
Obviously, matplotlib
comes with support for all the “standard plots” out there: barplots, scatterplots, histograms, boxplots, errorbars, etc. . For a great overview on what’s possible, make sure to check the gallery of the matplotlib documentation. For now, we are going to start simply…how about some univariate data visualization
, e.g. a scatterplot
?
For example, we are interested in the distribution
of age
in our dataset
. Using matplotlib
, we need to create a figure
and draw
something inside. As our data is in long-format
we have to initially extract a list
containing the age
of each participant
only once, for example using list comprehension
.
plt.figure(figsize=(10, 5))
plt.hist([df_all_part[df_all_part['participant_id']==part]['age'].to_numpy()[0] for part in df_all_part['participant_id'].unique()])
(array([4., 1., 3., 2., 4., 3., 3., 1., 2., 7.]),
array([18. , 20.2, 22.4, 24.6, 26.8, 29. , 31.2, 33.4, 35.6, 37.8, 40. ]),
<BarContainer object of 10 artists>)

While the information we wanted is there, the plot
itself looks kinda cold and misses a few pieces to make it intelligible, e.g. axes labels
and a title
. This can easily be added via matplotlib
’s plt
interface.
plt.figure(figsize=(10, 5))
plt.hist([df_all_part[df_all_part['participant_id']==part]['age'].to_numpy()[0] for part in df_all_part['participant_id'].unique()])
plt.xlabel('Age', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.title('Distribution of age', fontsize=15);

We could also add a grid
to make it easier to situate the given values
:
plt.figure(figsize=(10, 5))
plt.hist([df_all_part[df_all_part['participant_id']==part]['age'].to_numpy()[0] for part in df_all_part['participant_id'].unique()])
plt.xlabel('Age', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.title('Distribution of age', fontsize=15);
plt.grid(True)

Seeing this distribution of age
, we could also have a look how it might interact with responses
, e.g. do younger participants
exhibit different response
pattern thanthan older participants
. Thus, we would create a bivariate visualization
with linear data
. As an example, let’s look at the mean
accuracy
of responses
to shapes
:
age_list = [df_all_part[df_all_part['participant_id']==part]['age'].to_numpy()[0] for part in df_all_part['participant_id'].unique()]
acc_means = [df_all_part[df_all_part['participant_id']==part & (df_all_part['trial']=="shapes")]['response'].to_numpy().mean() for part in df_all_part['participant_id'].unique()]
plt.figure(figsize=(10, 5))
plt.scatter(age_list, acc_means)
plt.xlabel('Age', fontsize=12)
plt.ylabel('Accuracy for shapes', fontsize=12)
plt.title('Comparing accuracy and age', fontsize=15);

Sometimes, we might want to have different subplots
within one main plot
. Using matplotlib
’s subplots
function
makes this straightforward via two options: creating a subplot
and adding the respective graphics
or creating multiple subplots
and adding the respective graphics
via the axes
. Let’s check the first option:
plt.subplot(1, 2, 1)
plt.hist([df_all_part[df_all_part['participant_id']==part]['age'].to_numpy()[0] for part in df_all_part['participant_id'].unique()])
plt.xlabel('Age', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.title('Distribution of age', fontsize=15);
plt.grid(True)
plt.subplots_adjust(right=4.85)
plt.subplot(1, 2, 2)
plt.scatter(age_list, acc_means)
plt.xlabel('Age', fontsize=12)
plt.ylabel('Accuracy for shapes', fontsize=12)
plt.title('Comparing accuracy and age', fontsize=15);
plt.show()

Hm, kinda ok but we would need to adapt the size
and spacing
. This is actually easier using the second option, subplots()
, which is also recommended by the matplotlib
community:
fig, axs = plt.subplots(1, 2, figsize=(20, 5))
axs[0].hist([df_all_part[df_all_part['participant_id']==part]['age'].to_numpy()[0] for part in df_all_part['participant_id'].unique()])
axs[0].set_xlabel('Age', fontsize=12)
axs[0].set_ylabel('Count', fontsize=12)
axs[0].set_title('Distribution of age', fontsize=15);
axs[0].grid(True)
axs[1].scatter(age_list, acc_means)
axs[1].set_xlabel('Age', fontsize=12)
axs[1].set_ylabel('Accuracy for shapes', fontsize=12)
axs[1].set_title('Comparing accuracy and age', fontsize=15);

As matplotlib
provides access to all parts of a figure
, we could furthermore adapt various aspects, e.g. the color
and size
of the draw
n markers
.
fig, axs = plt.subplots(1, 2, figsize=(20, 5))
axs[0].hist([df_all_part[df_all_part['participant_id']==part]['age'].to_numpy()[0] for part in df_all_part['participant_id'].unique()])
axs[0].set_xlabel('Age', fontsize=12)
axs[0].set_ylabel('Count', fontsize=12)
axs[0].set_title('Distribution of age', fontsize=15);
axs[0].grid(True)
axs[1].scatter(age_list, acc_means, c='black', s=80)
axs[1].set_xlabel('Age', fontsize=12)
axs[1].set_ylabel('Accuracy for shapes', fontsize=12)
axs[1].set_title('Comparing accuracy and age', fontsize=15);

This provides just a glimpse but matplotlib
is infinitely customizable, thus as in most modern plotting
environments, you can do virtually anything. The problem is: you just have to be willing to spend enough time on it. Lucky for us and everyone else there are many modules
/libraries
that provide a high-level interface to matplotlib
. However, before we check one of them out we should quickly summarize pros
and cons
of matplotlib
.
Pros#
provides low-level control over virtually every element of a
plot
completely
object-oriented API
;plot
components can be easily modifiedclose integration with
numpy
extremely active community
tons of functionality (
figure compositing
,layering
,annotation
,coordinate transformations
,color mapping
, etc.)
Cons#
steep learning curve
API
is extremely unpredictable–redundancy and inconsistency are commonsome simple things are hard; some complex things are easy
lacks systematicity/organizing
syntax
–everyplot
is its own little worldsimple
plots
often require a lot ofcode
default
styles
are not optimal
High-level interfaces to matplotlib#
matplotlib
is very powerful and very robust, but theAPI
is hit-and-missmany high-level interfaces to
matplotlib
have been writtenabstract away many of the annoying details
best of both worlds: easy generation of
plots
, but retainmatplotlib
’s powerSeaborn
ggplot
pandas
etc.
many domain-specific
visualization
tools are built onmatplotlib
(e.g.,nilearn
andmne
inneuroimaging
)
Going further with advanced plots#
This also marks the transition to more “advanced plots” as the respective libraries allow you to create fantastic and complex plots
with ease!
Seaborn#
Seaborn
abstracts away many of the complexities to deal with such minutiae and provides a high-level API
for creating aesthetic plots
.
arguably the premier
matplotlib
interface
forhigh-level plots
generates beautiful
plots
in very littlecode
beautiful
styles
andcolor palettes
wide range of supported plots
modest support for structured
plotting
(viagrids
)exceptional documentation
generally, the best place to start when
exploring
andvisualizing data
(can be quite slow (e.g., with
permutation
))
For example, recreating the plots
from above is as easy as:
import seaborn as sns
sns.histplot([df_all_part[df_all_part['participant_id']==part]['age'].to_numpy()[0] for part in df_all_part['participant_id'].unique()])
plt.xlabel('Age')
plt.title('Distribution of age')
Text(0.5, 1.0, 'Distribution of age')

sns.scatterplot(x=age_list, y=acc_means)
plt.xlabel('Age')
plt.title('Comparing accuracy and age')
Text(0.5, 1.0, 'Comparing accuracy and age')

You might wonder: “well, that doesn’t look so different from the plots
we created before and it’s also not way faster/easier”.
True that, but so far this based on our data
and the things we wanted to plot
. Seaborn
actually integrates fantastically with pandas dataframes
and allows to achieve amazing things rather easily.
Let’s go through some examples!
How about evaluating response_time
as function
of age
, separated by handedness
? Sounds wild? Using seaborn
’s pairplot this achieved with just one line of code
:
sns.pairplot(df_all_part[['age', 'left-handed', 'response_time']], hue='left-handed')
<seaborn.axisgrid.PairGrid at 0x12c8dbf70>

Or how about rating
of animals
as a function
of age
, separately for each animal
? Same approach, but restricted to a subdataframe
that only contains the ratings
of the animal category
! However, before we create the plot
, we will adapt the stim_file
column
from a path
to the image
to the animal
name for plotting reasons.
# Define a function to replace the stim_file based on the presence of specific animal names
def replace_stim_file_images(row):
animals = ['capybara', 'wombat', 'platypus'] # List of animals to check for
for animal in animals:
if animal in row['stim_file']:
return animal # Replace with the animal name if found
return row['stim_file'] # Return the original value if none of the animals are found
# Apply the function to each row
df_all_part['stim_file'] = df_all_part.apply(replace_stim_file_images, axis=1)
And do the same for the shapes
to be consistent.
# Define a function to replace the stim_file based on the presence of specific animal names
def replace_stim_file_shapes(row):
shapes = ['square', 'plus', 'cross'] # List of animals to check for
for shape in shapes:
if shape in row['stim_file']:
return shape # Replace with the animal name if found
return row['stim_file'] # Return the original value if none of the animals are found
# Apply the function to each row
df_all_part['stim_file'] = df_all_part.apply(replace_stim_file_shapes, axis=1)
Now we’re ready for plotting!
sns.pairplot(df_all_part[df_all_part['trial']=='images'][['age', 'stim_file', 'response_time']], hue='stim_file')
<seaborn.axisgrid.PairGrid at 0x12eb96110>

Assuming we want to check the responses
to the images
further, we will create a respective subdataframe
.
df_images = df_all_part[df_all_part['trial']=='images']
df_images.head(n=20)
participant_id | age | left-handed | Do you like this session? | session | stim_file | response | response_time | trial_type | trial | |
---|---|---|---|---|---|---|---|---|---|---|
24 | 1 | 24 | False | Yes | post | capybara | 1 | 0.495692 | experiment | images |
25 | 1 | 24 | False | Yes | post | platypus | 1 | 0.918446 | experiment | images |
26 | 1 | 24 | False | Yes | post | platypus | 0 | 0.823403 | experiment | images |
27 | 1 | 24 | False | Yes | post | platypus | 1 | 0.967150 | experiment | images |
28 | 1 | 24 | False | Yes | post | wombat | 1 | 0.531525 | experiment | images |
29 | 1 | 24 | False | Yes | post | capybara | 1 | 0.686935 | experiment | images |
30 | 1 | 24 | False | Yes | post | platypus | 1 | 0.700605 | experiment | images |
31 | 1 | 24 | False | Yes | post | capybara | 1 | 0.986232 | experiment | images |
32 | 1 | 24 | False | Yes | post | wombat | 1 | 0.352742 | experiment | images |
33 | 1 | 24 | False | Yes | post | wombat | 1 | 0.513988 | experiment | images |
34 | 1 | 24 | False | Yes | post | platypus | 1 | 0.433638 | experiment | images |
35 | 1 | 24 | False | Yes | post | wombat | 1 | 0.487932 | experiment | images |
60 | 1 | 24 | False | Yes | test | platypus | 1 | 1.119574 | experiment | images |
61 | 1 | 24 | False | Yes | test | capybara | 1 | 0.952908 | experiment | images |
62 | 1 | 24 | False | Yes | test | platypus | 0 | 1.091557 | experiment | images |
63 | 1 | 24 | False | Yes | test | platypus | 1 | 1.418296 | experiment | images |
64 | 1 | 24 | False | Yes | test | capybara | 1 | 1.027042 | experiment | images |
65 | 1 | 24 | False | Yes | test | platypus | 1 | 0.885462 | experiment | images |
66 | 1 | 24 | False | Yes | test | capybara | 1 | 1.049408 | experiment | images |
67 | 1 | 24 | False | Yes | test | wombat | 0 | 1.434780 | experiment | images |
And can now make use of the fantastic pandas
- seaborn
friendship. For example, let’s go back to the scatterplot
of age
and ratings
we did before. How could we improve this plot
? Maybe adding the distribution
of each variable
to it? That’s easily done via jointplot()
:
sns.jointplot(x='age', y='response_time', data=df_images, hue='session')
<seaborn.axisgrid.JointGrid at 0x10933c0a0>

Wouldn’t it be cool if we could also briefly explore if there might be some statistically relevant effects going on here? Say no more, as we can add a regression
to the plot
via setting the kind
argument
to reg
:
sns.jointplot(x='age', y='response_time', data=df_images, kind='reg')
<seaborn.axisgrid.JointGrid at 0x12ef98ee0>

Is there maybe a slight preference for one animal over another? We might want to spend a closer look on the response_time
for each animal
. One possibility to do so, could be a boxplot()
:
plt.figure(figsize=(8,6))
sns.boxplot(x='response_time', y='stim_file', data=df_images, hue='stim_file', palette='vlag', dodge=False)
plt.legend([],[], frameon=False)
<matplotlib.legend.Legend at 0x12f082e30>

However, we know that boxplot
s have their fair share of problems…given that they show summary statistics
, clusters
and multimodalities
are hidden..

That’s actually one important aspect everyone should remember concerning data visualization
, no matter if for exploration
or analyses
: show as much data
and information as possible! With seaborn
we can easily address this via adding individual data points
to our plot
via stripplot()
:
plt.figure(figsize=(8,6))
sns.boxplot(x='response_time', y='stim_file', data=df_images, hue='stim_file', palette='vlag', dodge=False)
sns.stripplot(x='response_time', y='stim_file', data=df_images, color='black')
plt.legend([],[], frameon=False)
<matplotlib.legend.Legend at 0x12c8d8dc0>

Ah yes, that’s better! Seeing the individual data points
, we might want to check the respective distributions
. Using seaborn
’s violinplot()
, this is done in no time.
plt.figure(figsize=(6,6))
sns.violinplot(data=df_images, x="response_time", y="stim_file", inner="quart", linewidth=1, palette='vlag')
sns.despine(left=True)

As you might have seen, we also adapted the style of our plot
a bit via sns.despine()
which removed the y axis spine
. This actually outlines another important point: seaborn
is fantastic when it comes to customizing plots
with little effort (i.e. getting rid of many lines of matplotlib
code
). This includes “themes
, context
s, colormaps
among other things. The subdataframe
including response_time
for shapes
might be a good candidate to explore these aspects.
df_shapes = df_all_part[df_all_part['trial']=='shapes']
Starting with themes
, which set a variety of aesthetic
related factors, including background color
, grid
s, etc., here are some very different examples showcasing the whitegrid
style
sns.set_style("whitegrid")
sns.violinplot(data=df_shapes, x="response_time", y="stim_file", inner="quart", linewidth=1, palette='vlag')
sns.despine(left=True)

and the dark
style:
sns.set_style("dark")
sns.violinplot(data=df_shapes, x="response_time", y="stim_file", inner="quart", linewidth=1, palette='vlag')
sns.despine(left=True)

While this is already super cool, seaborn
goes one step further and even let’s you define the context
for which your figure
is intended and adapts it accordingly. For example, there’s a big difference if you want to include your figure
in a poster
:
sns.set_style('whitegrid')
sns.set_context("poster")
sns.violinplot(data=df_shapes, x="response_time", y="stim_file", inner="quart", linewidth=1, palette='vlag')
sns.despine(left=True)

or a talk
:
sns.set_context("talk")
sns.violinplot(data=df_shapes, x="response_time", y="stim_file", inner="quart", linewidth=1, palette='vlag')
sns.despine(left=True)

or a paper
:
sns.set_context("paper")
sns.violinplot(data=df_shapes, x="response_time", y="stim_file", inner="quart", linewidth=1, palette='vlag')
sns.despine(left=True)

No matter the figure
and context
, another very crucial aspect everyone should always look out for is the colormap
or colorpalette
! Some of the most common ones are actually suboptimal in multiple regards. This entails a misrepresentation of data
:



It gets even worse: they don’t work for people with color vision deficiencies
!

That’s obviously not ok and we/you need to address this! With seaborn
, some of these important aspects are easily addressed. For example, via setting the colorpalette
to colorblind
:
sns.set_context("notebook")
sns.violinplot(data=df_shapes, x="response_time", y="stim_file", inner="quart", linewidth=1, palette='colorblind')
sns.despine(left=True)

or using one of the suitable color palettes
that also address the data representation problem
, i.e. perceptually uniform color palettes
:
sns.color_palette("flare", as_cmap=True)
sns.color_palette("crest", as_cmap=True)
sns.cubehelix_palette(as_cmap=True)
sns.cubehelix_palette(start=.5, rot=-.5, as_cmap=True)
Let’s see a few of those in action, for example within a heatmap
that displays the correlation
between response_time
for images
. For this, we need to reshape
our data
back to wide-format
which is straightforward using pandas
’ pivot function
.
df_all_part_wide = df_all_part.pivot_table(index=['participant_id', 'session'],
columns='stim_file',
values='response_time',
aggfunc='mean')
df_all_part_wide = df_all_part_wide[['cross', 'plus', 'square', 'capybara', 'platypus', 'wombat']]
df_all_part_wide
stim_file | cross | plus | square | capybara | platypus | wombat | |
---|---|---|---|---|---|---|---|
participant_id | session | ||||||
1 | post | 0.695227 | 0.678349 | 0.609651 | 0.722953 | 0.768648 | 0.471547 |
test | 1.186744 | 1.123207 | 1.139514 | 0.957431 | 1.092030 | 1.344081 | |
2 | post | 0.762689 | 0.619683 | 0.599985 | 0.673366 | 0.737820 | 0.587512 |
test | 1.274003 | 1.100577 | 0.923209 | 1.124963 | 0.945678 | 1.254009 | |
3 | post | 0.530374 | 0.521553 | 0.992972 | 0.652133 | 0.404850 | 0.716720 |
test | 1.162258 | 1.051083 | 1.194107 | 0.959460 | 1.010056 | 1.061552 | |
4 | post | 0.633965 | 0.757075 | 0.596069 | 0.578154 | 0.646218 | 0.662923 |
test | 1.112594 | 1.113806 | 1.079497 | 1.048941 | 1.083535 | 1.247528 | |
5 | post | 0.423638 | 0.649743 | 0.773591 | 0.674606 | 0.605312 | 0.618300 |
test | 1.268744 | 1.115380 | 1.127952 | 1.249809 | 1.192149 | 1.131877 | |
6 | post | 0.683104 | 0.625368 | 0.519447 | 0.633972 | 0.866682 | 0.734251 |
test | 1.401519 | 1.227546 | 1.217396 | 1.078705 | NaN | 1.079305 | |
7 | post | 0.634570 | 0.582086 | 0.737552 | 0.561176 | 0.648876 | 0.565289 |
test | NaN | 1.217030 | 1.160113 | 1.153621 | 1.091095 | 1.073692 | |
8 | post | 0.735386 | 0.628580 | 0.544224 | 0.700269 | NaN | 0.553094 |
test | 1.160200 | 0.991310 | 1.168924 | 1.106350 | 1.132035 | 1.142475 | |
9 | post | 0.607519 | 0.755562 | 0.614723 | 0.771245 | 0.789756 | 0.657574 |
test | 1.109647 | 1.197718 | 1.117669 | 1.251627 | 1.093382 | 1.057321 | |
10 | post | 0.502849 | 0.686854 | 0.573754 | 0.621630 | 0.755898 | 0.590473 |
test | 1.173531 | 1.121304 | 1.111777 | 1.055491 | 1.224844 | 1.201307 | |
11 | post | 0.565770 | 0.796409 | 0.526394 | 0.662049 | 0.652107 | 0.359739 |
test | 1.484344 | 1.101375 | 1.355650 | 1.177220 | 1.111482 | 1.154950 | |
12 | post | 0.825229 | 0.469693 | 0.696174 | 0.586533 | 0.683366 | 0.709453 |
test | 1.210542 | 1.353745 | 1.308511 | 1.068612 | 1.110585 | 1.346530 | |
13 | post | 0.559779 | 0.705641 | 0.750242 | 0.632453 | 0.545109 | 0.643767 |
test | 1.042256 | 1.174450 | 1.087483 | 1.261020 | 1.093822 | 1.178421 | |
14 | post | 0.773779 | 0.695953 | 0.463407 | 0.736323 | 0.898039 | 0.618524 |
test | 1.165927 | 0.999943 | 0.881409 | 0.962234 | 1.271395 | 1.126580 | |
15 | post | 0.565924 | 0.748914 | 0.533429 | 0.657298 | 0.738657 | 0.618882 |
test | 1.191236 | 1.256513 | 1.124252 | 1.064684 | 1.125852 | 1.157209 | |
16 | post | 0.490668 | 0.629922 | 0.749302 | 0.609369 | 0.499990 | 0.540915 |
test | 1.238226 | 1.097461 | 1.201405 | 1.009574 | 1.103565 | 1.120550 | |
17 | post | 0.670037 | 0.806978 | 0.614988 | 0.535507 | 0.548741 | 0.765582 |
test | 1.024884 | 0.862128 | 0.957512 | 1.112053 | 1.056240 | 1.260760 | |
18 | post | 0.505096 | 0.500771 | 0.622511 | 0.731418 | 0.651869 | 0.550223 |
test | 1.139586 | 1.203622 | 1.109744 | 0.940470 | 1.157193 | 1.080493 | |
19 | post | 0.630026 | 0.570403 | 0.553740 | 0.670729 | 0.702371 | 0.539150 |
test | 1.305638 | 1.197644 | 1.057489 | 1.068612 | 1.101864 | 1.059130 | |
20 | post | 0.554464 | 0.518257 | 0.706713 | 0.674570 | 0.482590 | 0.630534 |
test | 1.202076 | 1.084012 | 1.095869 | 1.086556 | 1.095437 | 1.197563 | |
21 | post | 0.663010 | 0.674987 | 0.773452 | 0.651549 | 0.766257 | 0.589517 |
test | 1.196278 | 1.307578 | 1.218819 | 1.160398 | 1.183626 | 0.818807 | |
22 | post | 0.802519 | 0.522100 | 0.677609 | 0.607053 | 0.410310 | 0.653369 |
test | 1.158462 | 1.082238 | 1.187047 | 1.128951 | 1.184014 | NaN | |
23 | post | 0.449296 | 0.664222 | 0.672239 | 0.811811 | 0.687664 | 0.722097 |
test | 1.234483 | 1.000650 | 1.247278 | 1.373307 | 0.971880 | 1.083474 | |
24 | post | 0.727099 | 0.669498 | 0.544538 | 0.780169 | 0.539558 | 0.512699 |
test | 1.146860 | 1.318709 | 1.013310 | 1.179416 | 1.116052 | 1.426554 | |
25 | post | 0.667413 | 0.749908 | 0.674152 | 0.510016 | 0.624600 | NaN |
test | 1.172720 | 1.189687 | 1.082873 | 1.142845 | 1.271394 | 1.181100 | |
26 | post | 0.750412 | 0.672995 | 0.430227 | 0.855770 | 0.761978 | 0.597756 |
test | 1.163184 | 1.116851 | 1.337332 | 0.907621 | 1.154835 | 0.949386 | |
27 | post | 0.569827 | 0.647740 | 0.602968 | 0.666668 | 0.573412 | 0.675750 |
test | 1.209065 | 1.085824 | 0.975633 | 1.254210 | 1.124254 | 0.977895 | |
28 | post | 0.823674 | 0.549932 | 0.501510 | 0.616241 | 0.904126 | 0.566966 |
test | 1.261421 | 0.984909 | 1.358329 | 1.026843 | 1.057895 | 1.168361 | |
29 | post | 0.609549 | 0.651714 | 0.866044 | 0.385282 | 0.588151 | 0.628211 |
test | 1.183397 | 1.097331 | 1.084794 | 1.112685 | 1.186779 | 1.221239 | |
30 | post | 0.499469 | 0.769347 | 0.680312 | 0.449260 | 0.617595 | 0.606390 |
test | 1.144440 | 1.246675 | 1.208878 | 1.333720 | 1.123456 | 1.048837 |
Then we can use another built-in function
of pandas
dataframes
: .corr()
, which computes a correlation between all columns
:
plt.figure(figsize=(8,6))
sns.heatmap(df_all_part_wide.corr(), xticklabels=False, cmap='rocket')
<Axes: xlabel='stim_file', ylabel='stim_file'>

Nice, how does the crest
palette
look?
plt.figure(figsize=(10,7))
sns.heatmap(df_all_part_wide.corr(), xticklabels=False, cmap='crest')
<Axes: xlabel='stim_file', ylabel='stim_file'>


Also fantastic! However, it’s easy to get fooled by beautiful graphics
, so maybe think about adding information to your plot whenever possible. For example, we could change heatmap
to clustermap
and adjust the colormap
!
plt.figure(figsize=(6,4))
sns.clustermap(df_all_part_wide.corr(), xticklabels=False, cmap='mako', center=0)
<seaborn.matrix.ClusterGrid at 0x12f609600>
<Figure size 600x400 with 0 Axes>

However, to be on the safe side, please also check your graphics
for the mentioned points, e.g. via tools like Color Oracle that let you simulate color vision deficiencies
!

Make use of amazing resources like the python graph gallery, the data to viz project and the colormap decision tree in Crameri et al. 2020, that in combination allow you to find and use the best graphic
and colormap
for your data!
And NEVER USE JET!

While the things we briefly explored were already super cool and a lot, we cannot conclude the data visualization
section without at least mentioning the up and coming next-level graphics
: raincloudplots as they combine various aspects of the things we’ve talked about! In python
they are available via the ptitprince library
.
from ptitprince import PtitPrince as pt
f, ax = plt.subplots(figsize=(10, 7))
pt.RainCloud(data = df_images, x = "stim_file", y = "response_time",
ax = ax, orient='h', hue='session', dodge = True, alpha = .65, bw = 0.2, move = .2)
<Axes: xlabel='response_time', ylabel='stim_file'>

You want to go even further? Web-based
and interactive
plots
are a must? Say no more, python
of course has that covered as well.
Great examples for libraries are plotly, bokeh and holoviews.
Let’s have a brief look at the first one, plotly
.
a
python visualization engine
that outputs directly to theweb
lets you generate
interactive web-based visualizations
in purepython
(!)you get
interactivity
for free, and can easily customize many different aspectsworks seamlessly in
Jupyter notebooks
How about we re-create some of the graphics from before but as interactive versions?
Let’s start with the heatmap
that displays the correlation
between response_time
s of the images
?
Depending on the graphic you want to build and its complexity, plotly
offers different solutions. The easiest is to use plotly.express which lets you create a lot of different plots in a straightforward manner.
import plotly.express as px
fig = px.imshow(df_all_part_wide.corr(), x=df_all_part_wide.corr().columns, y=df_all_part_wide.corr().index)
fig.update_layout(width=500,height=500)
# run this when going through the notebook
#fig.show()
### This part is required to include the graphics in the built Jupyter-Book
### It can be ignored during the session and should not be run
from plotly.offline import init_notebook_mode, plot
from IPython.core.display import display, HTML
init_notebook_mode(connected=True)
plot(fig, filename = 'heatmap_cor.html')
display(HTML('heatmap_cor.html'))
As with the other packages before, we can easily update the different graphic properties
, including changing the colormap
and axis titles
.
fig = px.imshow(df_all_part_wide.corr(), x=df_all_part_wide.corr().columns, y=df_all_part_wide.corr().index,
color_continuous_scale=px.colors.sequential.Viridis,
labels=dict(x="Stimulus", y="Stimulus"))
fig.update_layout(width=500,height=500)
# run this when going through the notebook
#fig.show()
### This part is required to include the graphics in the built Jupyter-Book
### It can be ignored during the session and should not be run
from plotly.offline import init_notebook_mode, plot
from IPython.core.display import display, HTML
init_notebook_mode(connected=True)
plot(fig, filename = 'heatmap_cor_vir.html')
display(HTML('heatmap_cor_vir.html'))