Code testing#

Looking back at our script from the section before, ie the one you just brought up to code concerning comments
and formatting
based on guidelines
, there is at least one other big question we have to address…
How do we know that the code
does what it is supposed to be doing?
Now you might think “What do you mean? It is running the things I wrote down, done.”. However, the reality looks different…
Background#
Generally, there are two major reasons why it is of the utmost importance to check and evaluate code
:
Mistakes made while coding
Code instability and changes
Let’s have a quick look at both.
Mistakes made while coding#
It is very, very easy to make mistakes when coding
. A single misplaced character can cause a program/script’s output to be entirely wrong or vary tremendously from what its expected behavior. This can happen because a plus sign which should have been a minus or one piece of code working in one unit while a piece of code written by another researcher worked in a differnt unit. Everyone makes mistakes, but the results can be catastrophic. Careers can be damaged/ended, vast sums of research funds can be wasted, and valuable time may be lost to exploring incorrect avenues.




Code instabilities and changes#
The second reason is a also challening but in a different way…The code
you are using and writing is affected by underlying numerical instabilities
and changes
during development.
Regarding the first, there are intrinsic numerical erros & instabilities
that may lead unstable functions
towards distinct local minima
. This phenomenon is only aggravated by prominent differences between OS
.

Concerning the second, a lot of the code
you are using is going to be part of packages
and modules
that are developed
and maintained
by other people. Along this process, the code
, e.g. a function
, you are using is going to change, more or less prominently. It could be a rounding change
or complete change of inputs
and outputs
. Either or, the effects on your application and/or pipeline might be significant and most importantely: unbeknownst to you.
This is why software
and code tests
are vital! Ie, to ensure the expected outcome and check, as well as evaluate changes along the development process.
Motivation#
Even if problems in a program are caught before research is published it can be difficult to figure out what results are contaminated and must be re-done. This represents a huge loss of time and effort. Catching these problems as early as possible minimises the amount of work it takes to fix them, and for most researchers time is by far their most scarce resource. You should not skip writing tests because you are short on time, you should write tests because you are short on time. Researchers cannot afford to have months or years of work go down the drain, and they can’t afford to repeatedly manually check every little detail of a program that might be hundreds or hundreds of thousands of lines long. Writing tests to do it for you is the time-saving option, and it’s the safe option.

As researchers write code they generally do some tests as they go along, often by adding in print statements and checking the output. However, these tests are often thrown away as soon as they pass and are no longer present to check what they were intended to check. It is comparatively very little work to place these tests in functions and keep them so they can be run at any time in the future. The additional labour is minimal, the time saved and safeguards provided are invaluable. Further, by formalising the testing process into a suite of tests that can be run independently and automatically, you provide a much greater degree of confidence that the software behaves correctly and increase the likelihood that defects will be found.

Testing also affords researchers much more peace of mind when working on/improving a project. After changing their code a researcher will want to check that their changes or fixes have not broken anything. Providing researchers with a fail-fast environment allows the rapid identification of failures introduced by changes to the code. The alternative, of the researcher writing and running whatever small tests they have time for is far inferior to a good testing suite which can thoroughly check the code.

Another benefit of writing tests is that it typically forces a researcher to write cleaner, more modular code as such code is far easier to write tests for, leading to an improvement in code quality. Good quality code is far easier (and altogether more pleasant) to work with than tangled rat’s nests of code I’m sure we’ve all come across (and, let’s be honest, written). This point is expanded upon in the section Unit Testing.

Research software#
As well as advantaging individual researchers testing also benefits research as a whole. It makes research more reproducible by answering the question “how do we even know this code works”. If tests are never saved, just done and deleted the proof cannot be reproduced easily.
Testing also helps prevent valuable grant money being spent on projects that may be partly or wholly flawed due to mistakes in the code. Worse, if mistakes are not at found and the work is published, any subsequent work that builds upon the project will be similarly flawed.
Perhaps the cleanest expression of why testing is important for research as a whole can be found in the Software Sustainability Institute slogan: better software, better research.


General guidance and good practice for testing#
There are several different kinds of testing which each have best practice specific to them (see Types of Testing). Nevertheless, there is some general guidance that applies to all of them, which will be outlined here.
Write Tests - Any Tests!#
Starting the process of writing tests
can be overwhelming, especially if you have a large code base
. Further to that, as mentioned, there are many kinds of tests
, and implementing all of them can seem like an impossible mountain to climb. That is why the single most important piece of guidance in this chapter is as follows: write some tests
. Testing one tiny thing in a code
that’s thousands of lines long is infinitely better than testing
nothing in a code
that’s thousands of lines long. You may not be able to do everything, but doing something is valuable.
Make improvements where you can, and do your best to include tests
with new code
you write even if it’s not feasible to write tests
for all the code
that’s already written.

Run the tests#
The second most important piece of advice in this chapter: run
the tests
. Having a beautiful, perfect test suite
is no use if you rarely run it. Leaving long gaps between test runs
makes it more difficult to track down what has gone wrong when a test
fails
because, a lot of the code
will have changed. Also, if it has been weeks or months since tests
have been run and they fail, it is difficult or impossible to know which results that have been obtained in the mean time are still valid, and which have to be thrown away as they could have been impacted by the bug
.

It is best to automate
your testing
as far as possible. If each test
needs to be run individually then that boring painstaking process is likely to get neglected. This can be done by making use of a testing framework
(discussed later). Ideally set your tests
up to run at regular intervals, possibly every night.
Consider setting up continuous integration
(discussed in the continuous integration sesssion) on your project. This will automatically run
your tests
each time you make a change to your code
and, depending on the continuous integration
software you use, will notify you if any of the tests
fail
.
Consider how long it takes your tests to run#
Some tests, like Unit Testing
only test
a small piece of code
and so typically are very fast. However other kinds of tests
, such as System Testing
which test
the entire code
from end to end, may take a long time to run depending on the code
. As such it can be obstructive to run the entire test suite
after each little bit of work.

In that case it is better to run lighter weight tests
such as unit tests
frequently, and longer tests
only once per day, overnight. It is also good to scale the number of each kind of tests
you have in relation to how long they take to run. You should have a lot of unit tests
(or other types of tests
that are fast) but much fewer tests
which take a long time to run.
Document the tests and how to run them#
It is important to provide documentation that describes how to run the tests
, both for yourself in case you come back to a project in the future, and for anyone else that may wish to build upon or reproduce your work.

This documentation should also cover subjects such as:
any resources, such as
test dataset
files that are requiredany
configuration
/settings
adjustments needed to run the testswhat
software
(such astesting frameworks
) need to be installed
Ideally, you would provide scripts to set up and configure any resources that are needed.
Test Realistic Cases#
Make the cases you test
as realistic as possible. If for example, you have dummy data
to run tests
on you should make sure that data
is as similar as possible to the actual data
. If your actual data
is messy with a lot of null values
, so should your test dataset
be.
Use a Testing Framework#
There are tools
available to make writing
and running
tests
easier, these are known as testing frameworks
. Find one you like, learn about the features it offers, and make use of them. A very common testing framework
for python
is pytest.
Coverage#
Code coverage
is a measure of how much of your code is “covered
” by tests
. More precisely it a measure of how much of your code
is run when tests
are conducted. So for example, if you have an if statement
but only test
things where that if statement
evaluates to “False
” then none of the code
in the if block
will be run. As a result your code coverage
would be < 100%
. Code coverage
doesn’t include documentation like comments
, so adding more documentation doesn’t affect your percentages.
Code coverage
gauges how much code
your tests
run, aiming for as close to 100%
as possible without counting documentation. High coverage is ideal, though any testing
is beneficial. Various tools
and bots
measure this across programming languages
, e.g. pytest
for python
. Beware the illusion of good coverage
; thorough testing
involves multiple scenarios for the same code
, emphasizing testing
smaller code
chunks for precise logic validation. Testing
the same code multiple ways is encouraged for comprehensive assessment.
Use test doubles/stubs/mocking where appropriate#
Use test doubles
like stubs
or mocks
for isolating code
in tests
. Ensure tests
make it easy to pinpoint failures, which can be hard when code
depends on external factors like internet connections
or objects
. For example, a web interaction test
might fail
due to internet issues, not code bugs
. Similarly, a test
involving an object
might fail
because of the object
itself, which should have its own tests
. Eliminate these dependencies with test doubles
, which come in several types:
Dummy objects
areplaceholders
that aren’t actually used intesting
beyond filling method parameters.Fake objects
have simplified, functional implementations, like anin-memory database
instead of a real one.Stubs
provide partial implementations to respond only to specifictest
cases and might record call information.Mocks
simulateinterfaces
orclasses
, withpredefined outputs
formethod
calls, often recording interactions fortest
validation.
Test doubles
replace real dependencies, making tests
more focused and reliable. Mocks
can be hand-coded or generated with mock frameworks
, which allow dynamic behavior definition. A common mock example
is a data
provider, where a mock
simulates the data source
to ensure consistent test conditions
, contrasting with the real data source
used in production.
Overview of Testing Types#
There are a number of different kinds of tests
, which will be briefly discussed in the follwing.
Firstly, there are positive tests
and negative tests
. Positive tests
check that something works, for example, testing
that a function
that multiplies some numbers together outputs the correct answer. Negative tests
check that something generates an error
when it should. For example, nothing can go quicker than the speed of light, so a plasma physics simulation code
may contain a test that an error is outputted if there are any particles faster than this, as it indicates there is a deeper problem in the code
.
In addition to these two kinds of tests
, there are also different levels of tests
which test
different aspects of a project. These levels are outlined below and both positive
and negative tests
can be present at any of these levels
. A thorough test suite
will contain tests
at all of these levels
(though some levels
will need very few).
However, before we will check out the different test
options, we have to talk about one aspect that is central to all: assert
.
Assert - a test’s best friend#
In order to check and evaluate if a certain piece of code
is doing what it is supposed to do, in a reliable manner, we need a way of assert
ing what the “correct output” should be and test
ing the outcome we get against it.
In Python
, assert
is a statement
used to test
whether a condition
is true
. If the condition
is true
, the program continues to execute as normal. If the condition
is false
, the program raises an AssertionError
exception and optionally can display an accompanying message. The primary use of assert
is for debugging
and testing
purposes, where it helps to catch errors
early by ensuring that certain conditions
hold at specific points in the code
.
Syntax#
The basic syntax
of an assert statement
is:
assert condition, "Optional error message"
condition
: This is the expression to betested
. If thecondition
evaluates toTrue
, nothing happens, and the program continues to execute. If it evaluates toFalse
, anAssertionError
is raised."Optional error message"
: This is the message that is shown when thecondition
isfalse
. This message is optional, but it’s helpful for understanding why theassertion
failed
.
Usage in Testing#
In the context of testing
, assert statements
are used to verify that a function
or a piece of code
behaves as expected. They are a simple yet powerful tool for writing test cases
, where you check the outcomes
of various functions
under different inputs
. Here’s how you might use assert
in a test
:
Checking Function Outputs
: To verify that afunction
returns the expectedvalue
.Validating Data Types
: To ensure thatvariables
orreturn values
are of thecorrect type
.Testing Invariants
: To checkconditions
that should always betrue
in a given context.Comparing Data Structures
: To ensure thatlists
,dictionaries
,sets
, etc., contain the expectedelements
.
A simple example#
Here’s a simple example demonstrating how assert
might be used in a test case
:
def add(a, b):
return a + b
# Test case for the add function
def test_add():
result = add(2, 3)
assert result == 5, "Expected add(2, 3) to be 5"
test_add() # This will pass silently since 2 + 3 is indeed 5
If add(2, 3)
did not return 5
, the assert statement
would raise an AssertionError
with the message "Expected add(2, 3) to be 5"
.
Best Practices#
Use for Testing
: Leveragea
ssert primarily intesting frameworks
or during thedebugging phase
, not as a mechanism for handlingruntime errors
in productioncode
.Clear Messages
: Includeclear
,descriptive messages
withassert statements
to make it easier to identify the cause of atest
failure
.Test Precisely
: Eachassert
shouldtest
one specific aspect of yourcode
’s behavior to make diagnosing issues straightforward.
Runtime testing#
Runtime tests
are tests
that run as part of the program itself. They may take the form of checks within the code
, as shown below:
For example, we could use the following runtime tests
to test
the first block of our analysis_pipeline.py script:
import requests, zipfile
from io import BytesIO
url = 'https://gitlab.com/julia-pfarr/nowaschool/-/raw/main/school/materials/CI_CD/crtt.zip?ref_type=heads'
extract_to_path = '/Users/peerherholz/Desktop/'
req = requests.get(url)
if req.status_code == 200:
print('Downloading Completed')
with zipfile.ZipFile(BytesIO(req.content)) as zfile:
zfile.extractall(extract_to_path)
else:
print('Download failed.')
Downloading Completed
Advantages of runtime testing:#
run
within the program, so can catch problems caused by logic errors or edge casesmakes it easier to find the cause of the
bug
by catching problems earlycatching problems early also helps prevent them escalating into catastrophic failures. It minimises the blast radius.
Disadvantages of runtime testing:#
tests
can slow down the programwhat is the right thing to do if an
error
is detected? How should thiserror
be reported? Exceptions are a recommended route to go with this.
Smoke tests#
Very brief initial checks that ensures the basic requirements required to run the project
hold. If these fail
there is no point proceeding to additional levels of testing
until they are fixed.
For example, we could use the following smoke tests
to test
the first block of our analysis_pipeline.py script:
import requests
from zipfile import ZipFile, BadZipFile
from io import BytesIO
import os
def test_download_and_extraction(url, extraction_path):
"""
Test downloading a ZIP file from a URL and extracting it to a specified path.
Args:
- url (str): URL of the ZIP file to download.
- extraction_path (str): The filesystem path where the ZIP file contents will be extracted.
"""
# 1. URL Accessibility
response = requests.head(url)
assert response.status_code == 200, "URL is not accessible or does not exist"
#assert 'application/zip' in response.headers['Content-Type'], "URL does not point to a ZIP file"
# 2. Successful Download
response = requests.get(url)
assert response.status_code == 200, "Failed to download the file"
# 3. Correct File Type and Extraction
try:
with ZipFile(BytesIO(response.content)) as zipfile:
zipfile.extractall(extraction_path)
assert True # If extraction succeeds
except BadZipFile:
assert False, "Downloaded file is not a valid ZIP archive"
# 4. Check Extracted Files
extracted_files = os.listdir(extraction_path)
assert len(extracted_files) > 0, "No files were extracted"
print(f"Test passed: Downloaded and extracted ZIP file to {extraction_path}")
# Example usage
url = 'https://gitlab.com/julia-pfarr/nowaschool/-/raw/main/school/materials/CI_CD/crtt.zip?ref_type=heads'
extraction_path = '/Users/peerherholz/Desktop/'
test_download_and_extraction(url, extraction_path)
Test passed: Downloaded and extracted ZIP file to /Users/peerherholz/Desktop/
Unit tests#
A level of the software testing
process where individual units
of a software
are tested
. The purpose is to validate that each unit
of the software
performs as designed.
For example, we could use the following unit tests
to test
the second block of our analysis_pipeline.py script:
import pandas as pd
def test_data_conversion():
def process_data(df, columns_select):
# Assuming df is the DataFrame before conversion
data_loaded_sub_part = df[columns_select]
# Insert more DF operations if needed
return data_loaded_sub_part
# Load the raw data (before conversion)
raw_data_path = '/Users/peerherholz/Desktop/choice_rtt/sourcedata/sub-01/ses-post/01_post_crtt_exp_2024-02-02_09h43.24.388.csv' # Update this path
raw_data_df = pd.read_csv(raw_data_path, delimiter=',')
# Columns to select and any other processing details
columns_select = ['participant_id', 'age', 'left-handed', 'Do you like this session?', 'session', 'TargetImage', 'keyboard_response.corr', 'trialRespTimes']
# Process the raw data
processed_data_df = process_data(raw_data_df, columns_select)
# Load the expected data (after conversion) for comparison
expected_data_path = '/Users/peerherholz/Desktop/choice_rtt/sub-01/ses-post/beh/sub-01_ses-post_task-ChoiceRTT_beh.tsv' # Update this path
expected_data_df = pd.read_csv(expected_data_path, delimiter='\t')
# Assertions
assert list(processed_data_df.columns) == list(expected_data_df.columns), "Columns do not match"
assert processed_data_df.shape == expected_data_df.shape, "DataFrame shapes do not match"
# Compare the first row as dicts
processed_first_row = processed_data_df.iloc[0].to_dict()
expected_first_row = expected_data_df.iloc[0].to_dict()
for key in processed_first_row:
if isinstance(processed_first_row[key], float):
assert abs(processed_first_row[key] - expected_first_row[key]) < 1e-5, f"Row values do not match for column {key}"
else:
assert processed_first_row[key] == expected_first_row[key], f"Row values do not match for column {key}"
Unit Testing Tips#
many
testing frameworks
havetools
specifically geared towards writing and running unit tests,pytest
does as well
isolate the
development environment
from thetest environment
write
test cases
that are independent of each other. For example, if aunit A
utilises the result supplied by anotherunit B
, you shouldtest
unit A
with atest double
, rather than actually calling theunit B
. If you don’t do this yourtest
failing
may be due to a fault in eitherunit A
orunit B
, making thebug
harder to trace.
aim at
covering
allpaths
through aunit
, pay particular attention toloop conditions
.
in addition to
writing cases
to verify the behaviour, writecases
to ensure theperformance
of thecode
. For example, if afunction
that is supposed to addtwo numbers
takes several minutes to run there is likely a problem.
if you find a defect in your
code
write atest
that exposes it. Why? First, you will later be able to catch the defect if you do not fix it properly. Second, yourtest suite
is now more comprehensive. Third, you will most probably be too lazy to write thetest
after you have already fixed the defect.
Integration tests#
A level of software testing
where individual units
are combined and tested
as a group
. The purpose of this level of testing
is to expose faults in the interaction between integrated units
.
Integration Testing Approaches#
There are several different approaches to integration testing
.
Big Bang
: an approach tointegration testing
where all or most of theunits
are combined together and tested at one go. This approach is taken when thetesting team
receives the entiresoftware
in a bundle. So what is the difference betweenBig Bang
integration testing
andsystem testing
? Well, the formertests
only theinteractions
between theunits
while the lattertests
the entire system.
Top Down
: an approach tointegration testing
wheretop-level sections
of thecode
(that themselves contain many smallerunits
) aretested first
andlower level units
aretested
step by step after that.
Bottom Up
: an approach tointegration testing
whereintegration
between bottom level sections are tested first and upper-level sections step by step after that. Againtest stubs
should be used, in this case tosimulate
inputs
from higher level sections.
Sandwich
/Hybrid
is an approach tointegration testing
which is a combination ofTop Down
andBottom Up
approaches.
Which approach you should use will depend on which best suits the nature/structure of your project.
For example, we could use the following integration test
to test
the first and second block of our analysis_pipeline.py script:
import requests
import zipfile
from io import BytesIO
import os
import pandas as pd
from glob import glob
import pytest
from analysis_pipeline import download_and_extract_data, convert_data
import os
from tempfile import TemporaryDirectory
import pandas as pd
def download_and_extract_data(url, extract_to_path):
print('Downloading started')
response = requests.get(url)
print('Downloading Completed')
with zipfile.ZipFile(BytesIO(response.content)) as zfile:
zfile.extractall(extract_to_path)
def convert_data(source_dir, target_dir):
data_files = glob(os.path.join(source_dir, '*'))
columns_select = ['participant_id', 'age', 'left-handed', 'Do you like this session?', 'session', 'TargetImage', 'keyboard_response.corr', 'trialRespTimes']
for index, participant in enumerate(data_files):
print(f'Working on {participant}, file {index+1}/{len(data_files)}')
data_loaded_part = pd.read_csv(participant, delimiter=',')
data_loaded_sub_part = data_loaded_part[columns_select]
# Additional processing...
# Save converted data
output_file = os.path.join(target_dir, os.path.basename(participant))
data_loaded_sub_part.to_csv(output_file, sep='\t', index=False)
def test_download_and_data_conversion():
with TemporaryDirectory() as tmp_dir:
download_dir = os.path.join(tmp_dir, "download")
os.makedirs(download_dir, exist_ok=True)
convert_dir = os.path.join(tmp_dir, "convert")
os.makedirs(convert_dir, exist_ok=True)
test_url = 'https://example.com/test_data.zip'
# Block 1: Download and extract
download_and_extract_data(test_url, download_dir)
extracted_files = os.listdir(download_dir)
assert extracted_files, "Download or extraction failed."
# Example additional check: Verify the extracted file names or types
# This step assumes you know what files you're expecting
expected_files = ['data1.csv', 'data2.csv'] # Example expected files
assert all(file in extracted_files for file in expected_files), "Missing expected files after extraction."
# Block 2: Convert data
convert_data(download_dir, convert_dir)
converted_files = os.listdir(convert_dir)
assert converted_files, "Data conversion failed."
# Content validation for one of the converted files
# This assumes you know the structure of the converted data
sample_converted_file = os.path.join(convert_dir, converted_files[0])
df = pd.read_csv(sample_converted_file, sep='\t')
# Check if specific columns are present in the converted file
expected_columns = ['participant_id', 'age', 'left-handed', 'Do you like this session?', 'session', 'stim_file', 'response', 'response_time']
assert all(column in df.columns for column in expected_columns), "Converted file missing expected columns."
# Basic content check: Ensure no empty rows for key columns
assert df['participant_id'].notnull().all(), "Null values found in 'participant_id' column."
assert df['session'].notnull().all(), "Null values found in 'session' column."
# Example of performance metric (very basic)
import time
start_time = time.time()
convert_data(download_dir, convert_dir)
end_time = time.time()
assert (end_time - start_time) < 60, "Conversion took too long."
Integration Testing Tips#
Ensure that you have a proper Detail Design document
where interactions
between each unit
are clearly defined. It is difficult or impossible to perform integration testing
without this information.
Make sure that each unit
is unit tested
and fix
any bugs
before you start integration testing
. If there is a bug
in the individual units
then the integration tests
will almost certainly fail
even if there is no error
in how they are integrated
.
Use mocking
/stubs
where appropriate.
System tests#
A level of the software testing
process where a complete, integrated system
is tested
. The purpose of this test
is to evaluate whether the system
as a whole gives the correct outputs for given inputs.
System Testing Tips#
System tests
, also called end-to-end tests
, run the program, well, from end to end. As such these are the most time consuming tests
to run. Therefore you should only run these if all the lower-level tests
(smoke
, unit
, integration
) have already passed. If they haven’t, fix
the issues
they have detected first before wasting time running system tests
.
Because of their time-consuming nature it will also often be impractical to have enough system tests
to trace every possible route through a program, especially if there are a significant number of conditional statements
. Therefore you should consider the system test
cases
you run carefully and prioritise:
the most common routes through a program
the most important routes for a program
cases
that are prone to breakage due to structural problems within the program. Though ideally it’s better to justfix
those problems, butcases
exist where this may not be feasible.
Because system tests
can be time consuming it may be impractical to run
them very regularly (such as multiple times a day after small changes in the code
). Therefore it can be a good idea to run
them each night (and to automate
this process) so that if errors
are introduced that only system testing
can detect, the developer(s) will be made aware of them relatively quickly.
For example, we could use the following system test
to test
our analysis_pipeline.py script:
import pytest
import subprocess
from tempfile import TemporaryDirectory
import os
import pandas as pd
def run_pipeline(script_path, data_dir, output_dir):
"""Executes the analysis pipeline script."""
subprocess.run(['python', script_path, '--data-dir', data_dir, '--output-dir', output_dir], check=True)
@pytest.mark.system
def test_analysis_pipeline():
with TemporaryDirectory() as tmp_dir:
data_dir = os.path.join(tmp_dir, "data")
os.makedirs(data_dir, exist_ok=True)
output_dir = os.path.join(tmp_dir, "output")
os.makedirs(output_dir, exist_ok=True)
script_path = 'analysis_pipeline.py' # Update if your script is in a different location
run_pipeline(script_path, data_dir, output_dir)
# Specific check 1: Verify the structure of output CSV files
output_files = [f for f in os.listdir(output_dir) if f.endswith('.csv')]
assert output_files, "No CSV output files found after running the pipeline."
for output_file in output_files:
df = pd.read_csv(os.path.join(output_dir, output_file))
expected_columns = ['participant_id', 'age', 'left-handed', 'session', 'stim_file', 'response', 'response_time', 'trial_type', 'trial']
assert all(column in df.columns for column in expected_columns), f"Missing expected columns in {output_file}."
# Specific check 2: Verify data integrity, e.g., non-negative ages and response times
assert (df['age'] >= 0).all(), f"Negative values found in 'age' column of {output_file}."
assert (df['response_time'] >= 0).all(), f"Negative values found in 'response_time' column of {output_file}."
# Additional check: Verify summary statistics in analysis_results.txt (if applicable)
summary_file_path = os.path.join(output_dir, 'analysis_results.txt')
if os.path.exists(summary_file_path):
with open(summary_file_path, 'r') as file:
summary_contents = file.read()
# Example: Check for a specific summary statistic
assert "Mean response time:" in summary_contents, "Expected summary statistic 'Mean response time' not found in analysis results."
# Further parsing and checks can be added based on the expected format and content of the summary statistics
Acceptance and regression tests#
A level of the software testing process
where a system
is tested
for acceptability. The purpose of this test
is to evaluate the system
’s compliance with the project
requirements and assess whether it is acceptable for the purpose.
Acceptance testing#
Acceptance tests
are one of the last tests types
that are performed on software
prior to delivery. Acceptance testing
is used to determine whether a piece of software
satisfies all of the requirements from user’s perspective. Does this piece of software
do what it needs to do? These tests
are sometimes built against the original specification.
Because research software
is typically written by the researcher that will use it (or at least with significant input
from them) acceptance tests
may not be necessary.
Regression testing#
Regression testing
checks for unintended changes by comparing new test results to previous ones, ensuring updates don’t break the software
. It’s critical because even unrelated code
changes can cause issues. Suitable for all testing levels
, it’s vital in system testing
and can automate tedious manual checks
. Tests
are created by recording outputs
for specific inputs
, then retesting
and comparing results
to detect discrepancies. Essential for team projects, it’s also crucial for solo work to catch self-introduced errors
.
Regression testing
approaches differ in their focus. Common examples include:
Bug regression
: retest a specificbug
that has been allegedlyfixed
Old fix regression testing
: retest several oldbugs
that werefixed
, to see if they are back. (This is the classical notion ofregression
: the program has regressed to a bad state.)
General functional regression
:retest
the project broadly, including areas that worked before, to see whether more recent changes have destabilized workingcode
.
Conversion or port testing
: the program isported
to a new platform and aregression test suite
is run to determine whether theport
was successful.
Configuration testing
: theprogram
isrun
with a newdevice
or on a newversion
of theoperating system
or in conjunction with a new application. This is likeport testing
except that the underlyingcode
hasn’t been changed–only the externalcomponents
that thesoftware
undertest
must interact with.
For example, we could use the following regression test
to test
our analysis_pipeline.py script:
import pytest
import subprocess
from tempfile import TemporaryDirectory
import os
import pandas as pd
import filecmp
import difflib
SCRIPT_PATH = 'path/to/your/analysis_pipeline.py' # Update this path
BASELINE_DIR = 'path/to/your/baseline_data' # Directory containing baseline results
def run_pipeline(script_path, data_dir, output_dir):
"""Executes the analysis pipeline script."""
subprocess.run(['python', script_path, '--data-dir', data_dir, '--output-dir', output_dir], check=True)
def compare_files(file1, file2):
"""Compares two files line by line."""
with open(file1, 'r') as f1, open(file2, 'r') as f2:
diff = difflib.unified_diff(
f1.readlines(), f2.readlines(),
fromfile='baseline', tofile='current',
)
diff_list = list(diff)
if diff_list:
print('Differences found:\n', ''.join(diff_list))
return not diff_list
@pytest.mark.regression
def test_pipeline_against_baseline():
with TemporaryDirectory() as tmp_dir:
data_dir = os.path.join(tmp_dir, "data")
output_dir = os.path.join(tmp_dir, "output")
os.makedirs(data_dir, exist_ok=True)
os.makedirs(output_dir, exist_ok=True)
# Assuming the input data is prepared in data_dir
run_pipeline(SCRIPT_PATH, data_dir, output_dir)
# Compare each output file against its baseline counterpart
for baseline_file in os.listdir(BASELINE_DIR):
baseline_path = os.path.join(BASELINE_DIR, baseline_file)
current_path = os.path.join(output_dir, baseline_file)
assert os.path.exists(current_path), f"Expected output file {baseline_file} not found in current run."
# Compare files (could be CSV, TXT, etc.)
assert compare_files(baseline_path, current_path), f"File {baseline_file} does not match baseline."
Testing frameworks#
Testing frameworks
are essential in the software
development process, enabling developers to ensure their code
behaves as expected. These frameworks
facilitate various types of testing
, such as unit testing
, integration testing
, functional testing
, regression testing
, and performance testing
. By automating
the execution of tests
, verifying outcomes
, and reporting results
, testing frameworks
help improve code quality
and software stability
.
Key Features of Testing Frameworks#
Test Organization: Helps structure and manage tests effectively.
Fixture Management: Supports setup and teardown operations for tests.
Assertion Support: Provides tools for verifying test outcomes.
Automated Test Discovery: Automatically identifies and runs tests.
Mocking and Patching: Allows isolation of the system under test.
Parallel Test Execution: Reduces test suite execution time.
Extensibility: Offers customization through plugins and hooks.
Reporting: Generates detailed reports on test outcomes.
pytest
#
pytest
is a powerful testing framework
for Python
that is easy to start with but also supports complex functional testing
. It is known for its simple syntax
, detailed assertion
introspection, automatic test
discovery, and a wide range of plugins
and integrations
.
Running pytest#
There are various options to run pytest
. Let’s start with the easiest one, running all tests
written in a specific directory
.
At first, you need to ensure pytest
is installed in your computational environment
. If not, install it using:
%%bash
pip install pytest
Additionally, you have to make sure that all your tests
are placed in a dedicated directory
and that their filenames follows one of these patterns: test_*.py
or *_test.py
.
Following the structure we discussed in the RDM session, they should ideally be placed in the code
directory
. Thus, let’s create a respective test directory
there.
import os
os.makedirs('/Users/peerherholz/Desktop/choice_rtt/code/tests', exist_ok=True)
Next, we will create our test
files and save them in the test
directory
.
%%writefile /Users/peerherholz/Desktop/choice_rtt/code/tests/test_download.py
import requests
from zipfile import ZipFile, BadZipFile
from io import BytesIO
import os
def test_download_and_extraction():
"""
Test downloading a ZIP file from a URL and extracting it to a specified path.
Args:
- url (str): URL of the ZIP file to download.
- extraction_path (str): The filesystem path where the ZIP file contents will be extracted.
"""
url = 'https://gitlab.com/julia-pfarr/nowaschool/-/raw/main/school/materials/CI_CD/crtt.zip?ref_type=heads'
extraction_path = '/Users/peerherholz/Desktop/'
# 1. URL Accessibility
response = requests.head(url)
assert response.status_code == 200, "URL is not accessible or does not exist"
#assert 'application/zip' in response.headers['Content-Type'], "URL does not point to a ZIP file"
# 2. Successful Download
response = requests.get(url)
assert response.status_code == 200, "Failed to download the file"
# 3. Correct File Type and Extraction
try:
with ZipFile(BytesIO(response.content)) as zipfile:
zipfile.extractall(extraction_path)
assert True # If extraction succeeds
except BadZipFile:
assert False, "Downloaded file is not a valid ZIP archive"
# 4. Check Extracted Files
extracted_files = os.listdir(extraction_path)
assert len(extracted_files) > 0, "No files were extracted"
print(f"Test passed: Downloaded and extracted ZIP file to {extraction_path}")
Writing /Users/peerherholz/Desktop/choice_rtt/code/tests/test_download.py
%%writefile /Users/peerherholz/Desktop/choice_rtt/code/tests/test_conversion.py
import pandas as pd
def test_data_conversion():
def process_data(df, columns_select):
# Assuming df is the DataFrame before conversion
data_loaded_sub_part = df[columns_select]
# Insert more DF operations if needed
return data_loaded_sub_part
# Load the raw data (before conversion)
raw_data_path = '/Users/peerherholz/Desktop/choice_rtt/sourcedata/sub-01/ses-post/01_post_crtt_exp_2024-02-02_09h43.24.388.csv' # Update this path
raw_data_df = pd.read_csv(raw_data_path, delimiter=',')
# Columns to select and any other processing details
columns_select = ['participant_id', 'age', 'left-handed', 'Do you like this session?', 'session', 'TargetImage', 'keyboard_response.corr', 'trialRespTimes']
# Process the raw data
processed_data_df = process_data(raw_data_df, columns_select)
# Load the expected data (after conversion) for comparison
expected_data_path = '/Users/peerherholz/Desktop/choice_rtt/sub-01/ses-post/beh/sub-01_ses-post_task-ChoiceRTT_beh.tsv' # Update this path
expected_data_df = pd.read_csv(expected_data_path, delimiter='\t')
# Assertions
assert list(processed_data_df.columns) == list(expected_data_df.columns), "Columns do not match"
assert processed_data_df.shape == expected_data_df.shape, "DataFrame shapes do not match"
# Compare the first row as dicts
processed_first_row = processed_data_df.iloc[0].to_dict()
expected_first_row = expected_data_df.iloc[0].to_dict()
for key in processed_first_row:
if isinstance(processed_first_row[key], float):
assert abs(processed_first_row[key] - expected_first_row[key]) < 1e-5, f"Row values do not match for column {key}"
else:
assert processed_first_row[key] == expected_first_row[key], f"Row values do not match for column {key}"
Writing /Users/peerherholz/Desktop/choice_rtt/code/tests/test_conversion.py
Now, navigate to your test directory
(provide the path
to it) and run pytest
via:
os.chdir('/Users/peerherholz/Desktop/choice_rtt/code/tests')
%%bash
pytest
============================= test session starts ==============================
platform darwin -- Python 3.7.0, pytest-7.4.4, pluggy-1.2.0
rootdir: /Users/peerherholz/Desktop/choice_rtt/code/tests
plugins: anyio-3.5.0
collected 2 items
test_conversion.py F [ 50%]
test_download.py . [100%]
=================================== FAILURES ===================================
_____________________________ test_data_conversion _____________________________
def test_data_conversion():
def process_data(df, columns_select):
# Assuming df is the DataFrame before conversion
data_loaded_sub_part = df[columns_select]
# Insert more DF operations if needed
return data_loaded_sub_part
# Load the raw data (before conversion)
raw_data_path = '/Users/peerherholz/Desktop/choice_rtt/sourcedata/sub-01/ses-post/01_post_crtt_exp_2024-02-02_09h43.24.388.csv' # Update this path
raw_data_df = pd.read_csv(raw_data_path, delimiter=',')
# Columns to select and any other processing details
columns_select = ['participant_id', 'age', 'left-handed', 'Do you like this session?', 'session', 'TargetImage', 'keyboard_response.corr', 'trialRespTimes']
# Process the raw data
processed_data_df = process_data(raw_data_df, columns_select)
# Load the expected data (after conversion) for comparison
expected_data_path = '/Users/peerherholz/Desktop/choice_rtt/sub-01/ses-post/beh/sub-01_ses-post_task-ChoiceRTT_beh.tsv' # Update this path
expected_data_df = pd.read_csv(expected_data_path, delimiter='\t')
# Assertions
> assert list(processed_data_df.columns) == list(expected_data_df.columns), "Columns do not match"
E AssertionError: Columns do not match
E assert ['participant...etImage', ...] == ['participant...im_file', ...]
E At index 5 diff: 'TargetImage' != 'stim_file'
E Right contains 2 more items, first extra item: 'trial_type'
E Use -v to get more diff
test_conversion.py:27: AssertionError
=============================== warnings summary ===============================
../../../../anaconda3/envs/neuro_ai/lib/python3.7/site-packages/pandas/compat/numpy/__init__.py:10
/Users/peerherholz/anaconda3/envs/neuro_ai/lib/python3.7/site-packages/pandas/compat/numpy/__init__.py:10: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
_nlv = LooseVersion(_np_version)
../../../../anaconda3/envs/neuro_ai/lib/python3.7/site-packages/pandas/compat/numpy/__init__.py:11
/Users/peerherholz/anaconda3/envs/neuro_ai/lib/python3.7/site-packages/pandas/compat/numpy/__init__.py:11: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
_np_version_under1p16 = _nlv < LooseVersion("1.16")
../../../../anaconda3/envs/neuro_ai/lib/python3.7/site-packages/pandas/compat/numpy/__init__.py:12
/Users/peerherholz/anaconda3/envs/neuro_ai/lib/python3.7/site-packages/pandas/compat/numpy/__init__.py:12: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
_np_version_under1p17 = _nlv < LooseVersion("1.17")
../../../../anaconda3/envs/neuro_ai/lib/python3.7/site-packages/pandas/compat/numpy/__init__.py:13
/Users/peerherholz/anaconda3/envs/neuro_ai/lib/python3.7/site-packages/pandas/compat/numpy/__init__.py:13: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
_np_version_under1p18 = _nlv < LooseVersion("1.18")
../../../../anaconda3/envs/neuro_ai/lib/python3.7/site-packages/pandas/compat/numpy/__init__.py:14
/Users/peerherholz/anaconda3/envs/neuro_ai/lib/python3.7/site-packages/pandas/compat/numpy/__init__.py:14: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
_np_version_under1p19 = _nlv < LooseVersion("1.19")
../../../../anaconda3/envs/neuro_ai/lib/python3.7/site-packages/pandas/compat/numpy/__init__.py:15
/Users/peerherholz/anaconda3/envs/neuro_ai/lib/python3.7/site-packages/pandas/compat/numpy/__init__.py:15: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
_np_version_under1p20 = _nlv < LooseVersion("1.20")
../../../../anaconda3/envs/neuro_ai/lib/python3.7/site-packages/setuptools/_distutils/version.py:351
/Users/peerherholz/anaconda3/envs/neuro_ai/lib/python3.7/site-packages/setuptools/_distutils/version.py:351: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
other = LooseVersion(other)
../../../../anaconda3/envs/neuro_ai/lib/python3.7/site-packages/pandas/compat/numpy/function.py:125
../../../../anaconda3/envs/neuro_ai/lib/python3.7/site-packages/pandas/compat/numpy/function.py:125
/Users/peerherholz/anaconda3/envs/neuro_ai/lib/python3.7/site-packages/pandas/compat/numpy/function.py:125: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
if LooseVersion(_np_version) >= LooseVersion("1.17.0"):
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED test_conversion.py::test_data_conversion - AssertionError: Columns do not match
=================== 1 failed, 1 passed, 9 warnings in 2.36s ====================
---------------------------------------------------------------------------
CalledProcessError Traceback (most recent call last)
Cell In[52], line 1
----> 1 get_ipython().run_cell_magic('bash', '', 'pytest\n')
File ~/anaconda3/envs/nowaschool/lib/python3.10/site-packages/IPython/core/interactiveshell.py:2517, in InteractiveShell.run_cell_magic(self, magic_name, line, cell)
2515 with self.builtin_trap:
2516 args = (magic_arg_s, cell)
-> 2517 result = fn(*args, **kwargs)
2519 # The code below prevents the output from being displayed
2520 # when using magics with decorator @output_can_be_silenced
2521 # when the last Python token in the expression is a ';'.
2522 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False):
File ~/anaconda3/envs/nowaschool/lib/python3.10/site-packages/IPython/core/magics/script.py:154, in ScriptMagics._make_script_magic.<locals>.named_script_magic(line, cell)
152 else:
153 line = script
--> 154 return self.shebang(line, cell)
File ~/anaconda3/envs/nowaschool/lib/python3.10/site-packages/IPython/core/magics/script.py:314, in ScriptMagics.shebang(self, line, cell)
309 if args.raise_error and p.returncode != 0:
310 # If we get here and p.returncode is still None, we must have
311 # killed it but not yet seen its return code. We don't wait for it,
312 # in case it's stuck in uninterruptible sleep. -9 = SIGKILL
313 rc = p.returncode or -9
--> 314 raise CalledProcessError(rc, cell)
CalledProcessError: Command 'b'pytest\n'' returned non-zero exit status 1.
pytest
will automatically discover tests within any files that match the pattern described above in the directory and its subdirectories.
You can also run specific tests
, e.g. tests
from a specific file or those matching a certain pattern.
You can specify the file
like so:
%%bash
pytest /Users/peerherholz/Desktop/choice_rtt/code/tests/test_download.py
============================= test session starts ==============================
platform darwin -- Python 3.7.0, pytest-7.4.4, pluggy-1.2.0
rootdir: /Users/peerherholz/Desktop/choice_rtt/code/tests
plugins: anyio-3.5.0
collected 1 item
test_download.py . [100%]
============================== 1 passed in 0.90s ===============================
or run
tests
matching a name pattern like so:
pytest -k "pattern"
You can also run
tests
marked with a Custom Marker
: If you’ve used custom markers
to decorate your tests
(e.g., @pytest.mark.regression
), you can run
only the tests
with that marker
:
pytest -m markername
Writing Basic Tests with pytest
#
Tests in pytest
are simple to write. Starting with test functions, as tests grow, pytest
provides a rich set of features for more complex scenarios.
def test_example():
assert 1 + 1 == 2
Using Fixtures for Setup and Teardown#
pytest
fixtures define setup and teardown logic for tests
, ensuring tests
run
under controlled conditions
.
import pytest
@pytest.fixture
def sample_data():
return [1, 2, 3, 4, 5]
def test_sum(sample_data):
assert sum(sample_data) == 15
Parameterizing Tests#
pytest
allows running a single test
function with different inputs
using @pytest.mark.parametrize
.
import pytest
@pytest.mark.parametrize("a,b,expected", [(1, 1, 2), (2, 3, 5), (3, 3, 6)])
def test_addition(a, b, expected):
assert a + b == expected
Conclusion:#
Embracing testing
and testing frameworks
like pytest
and incorporating a comprehensive testing strategy
are essential steps towards achieving high-quality software development. These frameworks
not only automate
the testing
process but also provide a structured approach to addressing a wide spectrum of testing
requirements. By leveraging their capabilities, researchers
and software developers
can ensure thorough test coverage
, streamline debugging
, and maintain high standards of software quality
and `performance``.
Task for y’all!
Remember our script
from the beginning? You already went through it a couple of times and brought to code
(get it?). Now, we would like to add some tests
for our script
to ensure its functionality.
Add
tests
that check if thedataset
wasdownloaded
andunzipped
properly, as well as if theDataFrames
have the correct shape. (Make sure to look at the Intro to data handling section again.Add
tests
that check if theDataFrame
has the right amount and types ofcolumns
after theconversion
and if the first fewcolumns
contain the expected values.
You have 40 min.