Project and Data Organization#
Objectives📍
the TONIC Template
BIDS
Note: Most of the content before the BIDS section was copied from The Turing Way Handbook under a CC-BY 4.0 licence.
Project Organisation#
To organize your data, you should use a clear folder structure to ensure that you can find your files. For this there are already multiple existing templates. Within the NOWA project, Thorsten Arendt co-developed the comprehensive TONIC-Template, which can be found and downloaded here. The template is made to organize multiple small projects in one overall project folder. E.g., this template is very well suited if you just started your PhD and you will have to work on three projects over the next years to finish your PhD. You can use this template to organize all of them. The way this template works is that through the folder numbering the projects are connected but components such as experiment or analysis have their own sections. Check out the documentation website of TONIC.

Fig. 6 The Turing Way project illustration by Scriberia. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807.#
If you don’t find any template that suits your needs (which I doubt…), make sure you follow these general suggestions on organization of folders:
Make sure you have enough (sub)folders so that files can be stored in the right folder and are not scattered in folders where they do not belong, or stored in large quantities in a single folder.
Use a clear folder structure. You can structure folders based on the person that has generated the data/folder, chronologically (month, year, sessions), per project (as done in the example below), or based on analysis method/equipment or data type.
Avoid overlapping or vague folder names, and do not use personal data in folder/file names.
Project Organisation: Other Examples#
This folder structure by Nikola Vukovic
You can pull/download folder structures using GitHub: This template by Barbara Vreede, based on cookiecutter, follows recommended practices for scientific computing by Wilson et al. (2017).
See this template by Chris Hartgerink for file organisation on the Open Science Framework.
How to Organize Your Digital Files by Melanie Pinola.
More Information on Project Organisation#
How to organise your data and code by Rene Bekkers.
File Naming Conventions#
File Naming Conventions#
Structure your file names and set up a template for this. For example, it may be advantageous to start naming your files with the date each file was generated. This will sort your files chronologically and create a unique identifier for each file. The utility of this process is apparent when you generate multiple files on the same day that may need to be versioned to avoid overwriting.
Some other tips for file naming include:
Use the date or date range of the experiment:
YYYY-MM-DD
Use the file type
Use the researcher’s name/initials
Use the version number of file (v001, v002) or language used in the document (ENG)
Do not make file names too long (this can complicate file transfers)
Avoid special characters (?!@*%{[<>) and spaces
Avoid personal data in file names
You can explain the file naming convention in a README file so that it will also become apparent to others what the file names mean.
Jenny Bryan’s ‘naming things’ presentation (also available as a 5 minute summary video on youtube) gives very concrete and intuitive recommendations and examples. Here’s the main content of her talk for you:
Don’t do this |
Do this instead |
---|---|
myabstract.docx |
2022-09-24_abstract-for-normconf.docx |
Jane’s Filenames Use “Spaces” & Punctuation ;).xlsx |
janes-filenames-are-getting-better.xlsx |
figure 1.png |
fig01_scatterplot-talk-length-vs-interest.png |
JW7d^(2sl@deletethisandyourcareerisoverWx2*.txt |
1986-01-28_raw-data-from-challenger-o-rings.txt |
Good file names are:
machine readable
human readable
sorted in a useful way
Machine readable
Wikipedia: globbing = “glob patterns specify sets of filenames with wildcard characters. For example, the Unix Bash shell command mv *.txt textfiles/
moves all files with names ending in .txt from the current directory to the directory textfiles. Here, * is a wildcard and *.txt is a glob pattern. The wildcard * stands for “any string of any length including empty, but excluding the path separator characters (/ in unix and \ in windows)”. “
This means: use _
underscore to delimit fields, i.e. when you have multiple .csv
files that contain data of one type of observations and need to be parsed to one dataframe
in the end, use the observation name in the file name and delimit this name from the rest of the filename using an underscore. This way you can easily find it with the ls
command and easily code the read-in for data analysis. At the same time, use -
hyphen to delimit words within fields.
Example: If you have a list of files named like this 2022-09-24_Plasmid-Cellline-100-1MutantFraction_A01.csv
, 2022-06-26_Plasmid-Cellline-100-1MutantFraction_H02.csv
, 2022-06-26_Plasmid-Cellline-100-1MutantFraction_H03.csv
you can do multiple machine operations with it:
you can find those files in your folder under all the other files by simply typing
ls *Plasmid*
in the terminalyou can read in the different parts of the filename as headers for your dataframe by coding a delimiter rule, e.g., in
R
a code like:
separate_wider_delim(
filenames,
delim = regex("[_\\.]"),
names = c("date", "assay", "well", NA)
)
leads to an output of:
date |
assay |
well |
|
---|---|---|---|
1 |
2022-09-24 |
Plasmid-Cellline-100-1MutantFraction |
A01 |
2 |
2022-06-26 |
Plasmid-Cellline-100-1MutantFraction |
H02 |
3 |
2022-06-26 |
Plasmid-Cellline-100-1MutantFraction |
H03 |
Human readable
Make sure that at least you yourself are able to decode from the filename what is in it. Try also to make it easy for others to guess what something is.
Don’t do this |
Do this instead |
---|---|
01_marshal-data.md |
|
01.R |
01_marshal-data.R |
02_pre-dea-filtering.md |
|
02.R |
02_pre-dea-filtering.R |
03_dea-with-limma-voom.md |
|
03.R |
03_dea-with-limma-voom.R |
04_explore-dea-results.md |
|
04.R |
04_explore-dea-results.R |
90_limma-model-term-name-fiasco.md |
|
90.R |
90_limma-model-term-name-fiasco.R |
Dates
To be able to sort file in a chronological order it is always a good idea to include a date in the filename. For this you should respect ISO 8601
which states that dates should be written in the YYYY-MM-DD
format. Don’t let the US convince you to use MM-DD-YYYY…they’re really the only ones using this format.
Sorted in a useful way
plan for alphanumeric sorting
put something numeric-ish first-ish
use the ISO 8601 standard for dates
left pad numbers with zeros. Otherwise the file starting with a
10
will be shown above the file starting with a1
in the folder, which is confusing.
File renaming tools#
If you want to change your file names you have the option to use bulk renaming tools. Be careful with these tools, because changes made with bulk renaming tools may be too rigorous if not carefully checked!
Some bulk file renaming tools include:
Bulk Rename Utility, WildRename, and Ant Renamer (for Windows)
Renamer (for MacOS)
PSRenamer (for MacOS, Windows, Unix, Linux)
Backups#
To avoid losing your data, you should follow good backup practices.
You should have 2 or 3 copies of your files, stored on
at least 2 different storage media,
in different locations.
Backups are ideally done automatically and should take into consideration your institute’s guidelines. The more important the data and the more often the datasets change, the more frequently you should back them up. If your files take up a large amount of space and backing up all of them proves to be challenging or expensive, you may want to create a set of criteria for when you back up the data. This can be part of your Data Management Plan.
Watch this video on Safe data storage and backup from the TU Delft Open Science MOOC.
Research Data Organization: BIDS#
The Brain Imaging Data Structure (BIDS) initially was created to describe and organize neuroimaging data (Gorgolewski et al., 2016). Due to its success by being intuitive, simple, and comprehensive at the same time (the “bidsy way”), there are now specification for a lot of other data modalities in the field of neuroscience and psychology. You can see all the published specification in their handbook and can check out the current proposal of BIDS extensions and see if there will be a specification for your modality in the near future.
There are a lot of presentations and tutorials about BIDS, so I will focus on the main components here.
Main Principles of BIDS#
BIDS…
modularizes data
specifies a folder structure
names files in a human AND machine friendly way
uses standard interoperable file formats
documents metadata
minimizes duplication (inheritance principles)
follows the FAIR principles
From this list you can already guess the benefits of BIDS compared to “just a folder strucutre”: BIDS is not only a folder structure but it also provides you with a specification of which metadata your project should contain and also how to name and organize this metadata. Plus, it also tells which file formats to use and which not, ensuring easier collaboration and reproducible results. Due to the fact that BIDS is developed in a community effort the focus is on minimizing complexity and maximizing adoption and flexibility. Because BIDS is now so popular and used by a large community, a lot of software was developed specifically for handling BIDS-compliant data. There are converters which bring your sourcedata into the BIDS standard, there are BIDS-Apps which automatically (pre-)process your data if it’s in the BIDS format and soooo many other software that makes your daily research work easier.
Modularization#
BIDS differentiates between three stages of data:
sourcedata (= what comes out of your recording device; usually very unstructured and some special software by the provider of the device is required to read it)
raw data (= when the data is already a bit more organized and in a reusable format)
derivatives (= output data of analyses)
BIDS is mostly concerned with your raw data. It doesn’t tell you how your sourcedata should be organized. It also only has very light specifications on how the derivative data should be organized. This is because BIDS concentrates mainly on the principles of interoperability and reusability. The sourcedata, i.e., the data that comes out of the device, is little interoperable because it often comes with file formats that often can only be read by specific software that comes with the device. The sourcedata only becomes interoperable by turning it into the raw data, hence this is the kind of data people want to reuse and need for reproduce published results. Derivatives are the results of analysis pipelines and therefore also the product of reproduction. Of course, sometimes we also want to reuse some data that was processed by a specific pipeline, so BIDS is making an effort to also organize this, too.
Sourcedata |
Raw Data |
Derivatives |
|
---|---|---|---|
MRI |
dcm |
nifti |
GMV |
Eyetracking |
edf |
eye coordinates |
amplitudes |
Folder Structure#
The folder structure in BIDS has the following levels:
study level
subject level
session level
modality level

Fig. 7 The bidsy folder structure. The main folder is on the study level. The next level is organized by subjects. If you have multiple sessions, then you should include a session folder under the subject folder, too. Lastly, one folder per modality within the subject level.#
File Naming#

Fig. 8 File naming the bidsy way.#
Tabular Data#
All your tabular data has to be in the tsv file format (tabulation separated values). The reason is simply that everyone can read a tsv. All you need for this is a simple text editor that comes already pre-installed with every device. The headers in your table have to be written in snake_case.

Fig. 9 Tabular data in BIDS needs to be a .tsv file with headers using snake_case. Example: participants.tsv file.#
Task
Create a folder for our research project inside your Desktop/NOWA_School
folder according to the BIDS specification (we will generate the raw data folders (sub-01, sub-02…) automatically on Thursday when we do our data analysis).
Try doing it through the terminal. Your folder structure should look like this in the end:
choice_rtt/
README.md
code/
experiment/
stimuli/
sourcedata/
derivatives/
Answer
new directories are created with
mkdir name-of-new-directory
you can navigate in this new directory through
cd name-of-new-directory
if you want to get one folder up again, type
cd ..
you can create new files with
cat > filename.filetype
ortouch filename.filetype
ornano filename.filetype
–> this file will be created in the directory you’re currently in.
Metadata#
BIDS also provides you with a descriptions of metadata. What metadata is and how it looks in BIDS, you’ll find out in the next section.
Versioning#
We will learn about versioning on Wednesday in the Git & GitLab course.