Data Storage and Sharing#

Objectives📍

  • different storage providers

  • what to consider when sharing data

  • Licenses

  • Copyright

Note: Most of the content was copied from The Turing Way Handbook under a CC-BY 4.0 licence.

Data loss can be catastrophic for your research project and can happen often. You can prevent data loss by picking suitable storage solutions and backing your data up frequently.

Where to Store Data#

  • Most institutions will provide a network drive that you can use to store data.

  • Portable storage media such as memory sticks (USB sticks) are more risky and vulnerable to loss and damage.

  • Cloud storage provides a convenient way to store, backup and retrieve data. You should check terms of use before using them for your research data.

Especially if you are handling personal or sensitive data, you need to ensure the cloud option is compliant with any data protection rules the data is bound by. To add an extra layer of security, you should encrypt devices and files where needed.

Your institution might provide local storage solutions and policies or guidelines restricting what you can use. Thus, we recommend you familiarise yourself with your local policies and recommendations. Please see the next section for our local storage solution the TAM data hub.

When you are ready to release the data to the wider community, you can also search for the appropriate databases and repositories in FAIRsharing, according to your data type, and type of access to the data.

Data Repositories#

A repository is a place where digital objects can be stored and shared with others (see also this repository definition).

Data repositories provide access to academic outputs that are reliably accessible to any web user (see the OpenDOAR inclusion criteria). Repositories must earn the trust of the communities they intend to serve and demonstrate that they are reliable and capable of appropriately managing the data they hold (Lin et al. (2020)).

Long-term archiving repositories are designed for secure and permanent storage of data, ensuring data preservation over extended periods. This differs from platforms like GitHub and GitLab which primarily serve as collaborative development tools, facilitating version control and project management in a more dynamic and transient environment. Platforms such as GitHub and GitLab do not assign persistent identifiers to repositories, and their preservation policies are more flexible compared to those of data repositories.

Repositories and FAIR#

Selecting an appropriate repository for your research outputs has many benefits:

  • It helps make your Research Objects more FAIR. This is achieved through:

    • Repositories assign a Persistent Identifier to your Research Objects, which makes them findable and citable. The most commonly used persistent identifiers for research objects is the Digital Object Identifier, usually abbreviated to DOI.

    • Repositories use metadata standards in describing your Research Object, which ensures that other people can find it using search engines.

    • Repositories add a licence to the Research Objects. A license describes to potential reusers of your work what they are allowed to do with it.

    • Repositories provide documentation for Research Objects. This can be in the form of READMEs and/or wikis that provide a description of your project and why it might be relevant to people.

    • Encouraging widely-used file formats. Many repositories have restrictions on the file formats used to ensure the sustainability of Research Objects. Some file formats (especially proprietary ones with a limited user base) can become deprecated.

  • It allows to determine the levels of access to Research Objects. There are good reasons to not to make all Research Objects completely open. However, it’s still worthwhile to at least open the metadata and provide an option for people to obtain access to the actual Research Objects if they have certain credentials or if they have been given explicit access. That way, your work will still be FAIR (because the metadata are findable and there is an access procedure in place), as well as and secure (because you can control who has access).

    • Restricting access and storing data on European servers can help to manage sensitive data.

Why not the supplemental materials?#

Supplemental materials are not following the FAIR principles - as there is no seperate DOI assigned to the supplemental materials which makes it difficult to retrieve these materials. Next to supplemental materials not being aligned with the FAIR principles, there are other reasons why a data repository is a better solution:

  • Data control: Supplementary materials cannot be updated, unlike materials available at data repositories.

  • Interoperability: If publishers only allow text and PDF formats it hampers data sharing and it will be difficult to reuse the data.

  • Availability: Supplementary materials are difficult to access if the article is behind the paywall, and links to supplementary materials can break (since they do not have their own persistent identifier).

  • Impact: Data and code should be a primary research output instead of being hidden in the supplementary materials.

  • Publisher requirements: Some publishers recommend using a data repository instead.

  • Size limits: There may be size limits in place of how large or how many supplementary materials can be shared.

Selecting an appropriate repository#

This chapter outlines some of the crucial functionalities that you should look out for when picking where to share your data, code, methods, hardware, slides, or any other Research Object.

Data should be submitted to domain or discipline specific, community recognised, repository where possible. A general purpose repository can be used when there are no suitable discipline specific repositories. Discipline specific data repositories are likely to have more functionalities for the type of data that you would like to share, as well as community standards that you can adhere to to make the data more FAIR.

The choice of repository can depend on multiple factors:

  • Your discipline

  • Type of digital output

  • File size

  • Policies/requirements from institutions, national policies, funding agencies

  • Access restrictions

You can search for relevant repositories on re3data and FAIRsharing. However, a search will likely result in a long list of repositories, which you will need to narrow down. The following questions may help you with that:

  • Is the data repository discipline-specific and community-recognised? Does it use the recognised standards in my discipline?

  • Is the data repository known by the research community?

  • Are others using the data repository to share their data?

  • Has a data repository been specified by my funder/publisher/institution?

  • What are the file size requirements and limitations?

  • What are the costs for data sharing?

  • What data formats are allowed? Will it take the data that you want to share?

  • Does it provide a persistent identifier, for example a Digital Object Identifier (DOI)?

  • Does it provide the right type of access control that suits the sharing conditions of the data? (restricted access/embargo’s)

  • Is there support available on how to curate the data/metadata?

See the ARDC’s Guide to choosing a data repository or the DCC checklist for evaluating data repositories for more information.

Data Sharing#

Motivations For Sharing Data#

There are many reasons to share your research data publicly.

  1. To allow the possibility to fully reproduce a scientific study.

  2. To prevent duplicate efforts and speed up scientific progress. Large amounts of research funds and careers of researchers can be wasted by only sharing a small part of research in the form of publications.

  3. To facilitate collaboration and increase the impact and quality of scientific research.

  4. To make results of research openly available as a public good, since research is often publicly funded.

Data Availability Statement#

Once you made your data available, it is important to ensure that people can find it when they read the associated article. You should cite your dataset directly in the paper in places where it is relevant, and include a citation in your reference list, as well as include a Data Availability Statement at the end of the paper (similar to the acknowledgement section).

Privacy And Data Protection#

Many fields of research involve working with sensitive personal data, with medical research being the most obvious example.

Additional resources on data sharing#