Data Storage and Sharing#
Objectives📍
different storage providers
what to consider when sharing data
Licenses
Copyright
Note: Most of the content was copied from The Turing Way Handbook under a CC-BY 4.0 licence.
Data loss can be catastrophic for your research project and can happen often. You can prevent data loss by picking suitable storage solutions and backing your data up frequently.
Where to Store Data#
Most institutions will provide a network drive that you can use to store data.
Portable storage media such as memory sticks (USB sticks) are more risky and vulnerable to loss and damage.
Cloud storage provides a convenient way to store, backup and retrieve data. You should check terms of use before using them for your research data.
Especially if you are handling personal or sensitive data, you need to ensure the cloud option is compliant with any data protection rules the data is bound by. To add an extra layer of security, you should encrypt devices and files where needed.
Your institution might provide local storage solutions and policies or guidelines restricting what you can use. Thus, we recommend you familiarise yourself with your local policies and recommendations. Please see the next section for our local storage solution the TAM data hub.
When you are ready to release the data to the wider community, you can also search for the appropriate databases and repositories in FAIRsharing, according to your data type, and type of access to the data.
Data Repositories#
A repository is a place where digital objects can be stored and shared with others (see also this repository definition).
Data repositories provide access to academic outputs that are reliably accessible to any web user (see the OpenDOAR inclusion criteria). Repositories must earn the trust of the communities they intend to serve and demonstrate that they are reliable and capable of appropriately managing the data they hold (Lin et al. (2020)).
Long-term archiving repositories are designed for secure and permanent storage of data, ensuring data preservation over extended periods. This differs from platforms like GitHub and GitLab which primarily serve as collaborative development tools, facilitating version control and project management in a more dynamic and transient environment. Platforms such as GitHub and GitLab do not assign persistent identifiers to repositories, and their preservation policies are more flexible compared to those of data repositories.
Repositories and FAIR#
Selecting an appropriate repository for your research outputs has many benefits:
It helps make your Research Objects more FAIR. This is achieved through:
Repositories assign a Persistent Identifier to your Research Objects, which makes them findable and citable. The most commonly used persistent identifiers for research objects is the Digital Object Identifier, usually abbreviated to DOI.
Repositories use metadata standards in describing your Research Object, which ensures that other people can find it using search engines.
Repositories add a licence to the Research Objects. A license describes to potential reusers of your work what they are allowed to do with it.
Repositories provide documentation for Research Objects. This can be in the form of READMEs and/or wikis that provide a description of your project and why it might be relevant to people.
Encouraging widely-used file formats. Many repositories have restrictions on the file formats used to ensure the sustainability of Research Objects. Some file formats (especially proprietary ones with a limited user base) can become deprecated.
It allows to determine the levels of access to Research Objects. There are good reasons to not to make all Research Objects completely open. However, it’s still worthwhile to at least open the metadata and provide an option for people to obtain access to the actual Research Objects if they have certain credentials or if they have been given explicit access. That way, your work will still be FAIR (because the metadata are findable and there is an access procedure in place), as well as and secure (because you can control who has access).
Restricting access and storing data on European servers can help to manage sensitive data.
Why not the supplemental materials?#
Supplemental materials are not following the FAIR principles - as there is no seperate DOI assigned to the supplemental materials which makes it difficult to retrieve these materials. Next to supplemental materials not being aligned with the FAIR principles, there are other reasons why a data repository is a better solution:
Data control: Supplementary materials cannot be updated, unlike materials available at data repositories.
Interoperability: If publishers only allow text and PDF formats it hampers data sharing and it will be difficult to reuse the data.
Availability: Supplementary materials are difficult to access if the article is behind the paywall, and links to supplementary materials can break (since they do not have their own persistent identifier).
Impact: Data and code should be a primary research output instead of being hidden in the supplementary materials.
Publisher requirements: Some publishers recommend using a data repository instead.
Size limits: There may be size limits in place of how large or how many supplementary materials can be shared.
Selecting an appropriate repository#
This chapter outlines some of the crucial functionalities that you should look out for when picking where to share your data, code, methods, hardware, slides, or any other Research Object.
Data should be submitted to domain or discipline specific, community recognised, repository where possible. A general purpose repository can be used when there are no suitable discipline specific repositories. Discipline specific data repositories are likely to have more functionalities for the type of data that you would like to share, as well as community standards that you can adhere to to make the data more FAIR.
The choice of repository can depend on multiple factors:
Your discipline
Type of digital output
File size
Policies/requirements from institutions, national policies, funding agencies
Access restrictions
You can search for relevant repositories on re3data and FAIRsharing. However, a search will likely result in a long list of repositories, which you will need to narrow down. The following questions may help you with that:
Is the data repository discipline-specific and community-recognised? Does it use the recognised standards in my discipline?
Is the data repository known by the research community?
Are others using the data repository to share their data?
Has a data repository been specified by my funder/publisher/institution?
What are the file size requirements and limitations?
What are the costs for data sharing?
What data formats are allowed? Will it take the data that you want to share?
Does it provide a persistent identifier, for example a Digital Object Identifier (DOI)?
Does it provide the right type of access control that suits the sharing conditions of the data? (restricted access/embargo’s)
Is there support available on how to curate the data/metadata?
See the ARDC’s Guide to choosing a data repository or the DCC checklist for evaluating data repositories for more information.
Types of repositories
If your disicpline does not have a disciplinary specific repository you can make use of several general repositories. Below follows a (non-exhaustive) list of these different types of repositories:
General purpose repositories
Project repositories
CRAN for R-Packages
Generic data repositories
Repositories for neuroscience data
DANDI
OpenNeuro
Institutional or National repositories
Many countries and/or institutions also provide access to repositories that you could use (see below “The TAM Data Hub”). Check with your local Research Data Management support to see if this available at your institute, or try to search for such a national repository using re3data and FAIRsharing.
Data Sharing#
Motivations For Sharing Data#
There are many reasons to share your research data publicly.
To allow the possibility to fully reproduce a scientific study.
To prevent duplicate efforts and speed up scientific progress. Large amounts of research funds and careers of researchers can be wasted by only sharing a small part of research in the form of publications.
To facilitate collaboration and increase the impact and quality of scientific research.
To make results of research openly available as a public good, since research is often publicly funded.
Steps To Share Your Data
Step 1: Select what data you want to share
Not all data can be made openly available, due to ethical and commercial concerns (see the Open Data section), and you may decide that some of your intermediate data is too large to share. As such, you first need to decide which data you need to share for others to be able to reproduce your research.
Step 2: Choose a data repository or other sharing platform
Data should be shared in a formal, open, and indexed data repository [def] where possible so that it will be accessible in the long run. Suitable data repositories by subject, content type or location can be found at Re3data.org, and in FAIRsharing where you can also see which standards (metadata and identifier) the repositories implement and which journal/publisher recommend them. If possible use a repository that assigns a DOI, a digital object identifier, to make it easier for others to cite your data. Have a look in the cm-citable to see how to share and cite your data and other research objects. The cm-citable-linking section explains several options for linking your data and other research objects.
A few public data repositories are Zenodo, Figshare, Harvard Dataverse, 4TU.ResearchData, and Dryad. See the NIH list of Generalist Repositories for more data repositories.
Step 3: Choose a licence and link to your paper and code
So that others know what they can do with your data, you need to apply a licence to your data. The most commonly used licences are Creative Commons, Open Government Licence, or an Open Data Commons Attribution License. To get maximum value from data sharing, make sure that your paper and code both link to your data, and vice versa, to allow others understand your project better.
Step 4: Upload your data and documentation
In line with the FAIR-principles, upload the data in open formats as much as possible and include sufficient documentation and metadata so that someone else can understand your data. It is also essential to think about the file formats in which the information is provided. Data should be presented in structured and standardised formats to support interoperability, traceability, and effective reuse. In many cases, this will include providing data in multiple, standardized formats, so that it can be processed by computers and used by people.
Data Availability Statement#
Once you made your data available, it is important to ensure that people can find it when they read the associated article. You should cite your dataset directly in the paper in places where it is relevant, and include a citation in your reference list, as well as include a Data Availability Statement at the end of the paper (similar to the acknowledgement section).
Privacy And Data Protection#
Many fields of research involve working with sensitive personal data, with medical research being the most obvious example.
Data Privacy Strategies
There are a number of strategies that you can adopt to safeguard the privacy of your research subjects:
1. Data minimisation
If personal information isn’t needed, don’t collect it.
Periodically review whether you are retaining unnecessary identifying information.
When identifying information is no longer needed, safely remove, delete or destroy it.
2. Data retention limits
Decide how long you will retain identifiable data before removing direct identifiers, applying more complex anonymisation techniques, or deleting the data altogether.
When deleting sensitive data you need to be aware that standard methods for deleting files (for example moving files to the recycle bin and emptying it) are not secure. These deleted files may be recovered. Use software like BleachBit (Linux, Windows), BC Wipe, DeleteOnClick and Eraser (Windows) or Permanent Eraser or ‘secure empty trash’ (Mac) to safely delete the data. An alternative is the physical destruction of the storage media. Degaussing disturbs the magnetic alignment of magnetic storage media (such as hard drives and tapes) and may render those unusable. If you encrypted the data (see point 4 below), you can also delete the encryption key.
3. Secure data transfer
Before deciding to transfer personal data, you should consider whether the transfer of identifiable data is necessary. For example, can data be de-identified or anonymised?
If data cannot be made unidentifiable then you must ensure you have authority to transfer the personal data, and that there are appropriate safeguards in place to protect the data before, during and after transit.
Keeping data in one place is safer than transfering it elsewhere. Consider whether it is possible to provide access to the data, instead of transferring them outside of your institution.
Often your university or institute will provide solutions for secure file transfer. Contact you research data, privacy or IT support team for guidance.
4. Encryption
Encryption provides protection by ensuring that only someone with the relevant encryption key (or password) will be able to access the contents.
Protect on disk level: Bitlocker for Windows, FileVault for MacOS
Protect on “container” level (a folder containing multiple files): Veracrypt (or Archive for MacOS)
Portable storage: Bitlocker
File level / Exchange information:
Simple method: use 7zip, and pack with a password
More complicated to setup: use PGP tooling (can also be used to securely send email)
See the Ghent University Encryption for Researchers manual for more details and step-by-step guides
5. Access permissions
Control who has access to which parts of the data, and which type of permissions they have, such as “read” vs. “write” access.
Deny access to sensitive data if that access is no longer needed.
Password protection.
6. Anonymisation
Anonymisation is a process by which identifying information in a dataset is removed. It is used primarily to allow data to be shared or published without revealing the confidential information it contains.
Where possible, direct identifiers (such as names, addresses, telephone numbers and account numbers) should be removed as soon as the identifying information is no longer needed. You can delete the data or replace it with pseudonyms. For qualitative data you should replace or generalise identifying characteristics when transcribing interviews.
De-identified data that can be re-identified using a linkage file (for example, information linking data subjects to identifiable individuals) is known as pseudonymised data. NOTE: In this instance, the linkage file should be encrypted and stored securely and separately from the de-identified research data.
Identification of individuals in pseudonymised or de-identified data may still be possible using combinations of indirect identifiers (such as age, education, employment, geographic area and medical conditions). Further, data and outputs containing small cell counts may be potentially disclosive, particularly where samples are drawn from small populations or include cases with extreme values or relatively rare characteristics.
As such, when intending to share potentially identifiable data or the outputs generated from the data, you may need to consider more advanced anonymisation techniques such as statistical disclosure control (SDC, see this handbook for more information).
For more information about anonymisation
Watch a presentation on Amnesia – Data Anonymisation Made Easy or a webinar on Amnesia - a tool to make anonymisation easy
Or read an explanation by the Finnish social science data archive
Citing Research Objects
You should cite research objects directly in the paper in places where it is relevant. This is a commonly practised way of citing publications and is valid for citing other research components like data and software. A citation includes the following information:
Author
Title
Year of publication
Publisher (for data, this is often the data repository where it is housed)
Version (if indicated)
Access information (a URL or DOI)
A citation style is a specific arrangement, order and formatting of information necessary for a citation. For instance, the MLA style was developed by Modern Language Association (originally used in the humanities) and the APA style was developed by American Psychological Association (originally used in psychology and the social sciences).
MLA citation style uses the following format:
Author. "Title of the Source." Title of the Container, Other contributors, Version, Number, Publisher, Publication date, Location.
APA citation style uses the following format:
Author. (Year). Title of data set (Version number). [Retrieved from] ***OR*** [DOI]
See Scribbr Citation Styles Guide. See also FORCE11 resource.
Citing Data
When sharing a dataset, use the assigned DOI (from the data repository) and add this to your data availability statement at the end of the paper (similar to the acknowledgement section). It is important to also cite your dataset in the references themselves, as only the citations in the reference section will contribute to citation counts. Data citation is important because it facilitates access, transparency and potentially reproducibility, reuse, and credit for researchers. It also provides recognition and visibility for the repositories that share data.
You can find examples of these statements in the publishers’ (research data) author policies.
Data availability statement examples:
Using the Digital Object Identifier (DOI): “The data that support the findings of this study are openly available in [repository name] at http://doi.org/[doi].”
If no DOI is issued:
“The data that support the findings of this study are openly available in [repository name] at [URL], reference number [reference number].”
When there is an embargo period you can reserve your DOI and still include a reference to the dataset in your paper:
“The data that support the findings will be available in [repository name] at [URL / DOI] following a [6 month] embargo from the date of publication to allow for the commercialisation of research findings.”
When data cannot be made available:
“Restrictions apply to the data that support the findings of this study. [Explain nature of restrictions, for example, if the data contains information that could compromise the privacy of research participants] Data are available upon reasonable request by contacting [name and contact details] and with permission of [third party name].”
“The data that support the findings of this study are available upon request. Access conditions and procedures can be found at [URL to restricted access repository such as EASY.]”
When code is shared:
Data and code to reproduce the results shown in the paper can be obtained from The Turing Way (2023) at Zenodo (https://zenodo.org/doi/10.5281/zenodo.3233853) and GitHub (the-turing-way/the-turing-way). We used R version 4.2.2 (use citation() to check the suggested citation) and the following R packages: ggplot2 (Wickham 2016), another example (and citation added to the references!).
More Data Availability Statement examples:
You can find more examples on the Manchester’s Data Access Statements page, the Cambridge Data Availability Statement examples, the AMS Data Availability Statement examples, or Nature’s Tips for writing a dazzling Data Availability Statement.
Citing Software
A software citation has a lot of the same elements as a data citation, described above, and are described in more detail in the Software Citation Principles. When using others software, it is vital to cite and attribute it properly. See also How to Cite R and R Packages for more information.
To make your code citable, you can use the integration between Zenodo and GitHub.
Create a file to tell people how to cite your software. Use this handy guide to format the file.
Link your GitHub account with a Zenodo account. This guide explains how.
You can tell Zenodo what information or metadata you want to include with your software by converting your
CITATION.cff
file tozenodo.json
.pip install cffconvert cffconvert --validate cffconvert --format zenodo --outfile .zenodo.json
Add
.zenodo.json
to your repository.On Zenodo, flip the switch to the ‘on’ position for the GitHub repository you want to release.
On GitHub, click the Create a new release button. Zenodo should automatically be notified and should make a snapshot copy of the current state of your repository (just one branch, without any history), and should also assign a persistent identifier (DOI) to that snapshot.
Use the DOI in any citations of your software and tell any collaborators and users to do the same!
To make your code citable, through an automated publication of your Gitlab repository to Zenodo:
Create a file to tell people how to cite your software. Use this handy guide to format the file.
Convert your
CITATION.cff
file to.zenodo.json
. This file tells Zenodo what information or metadata you want to include with your software.pip install cffconvert cffconvert --validate cffconvert --format zenodo --outfile .zenodo.json
Add
.zenodo.json
to your repository.Use the gitlab2zenodo package to publish a snapshot of your repository to your Zenodo instance. By following the installation and setup instructions of this package, you will get a workflow on your CI that will take care of the publication to Zenodo.
Use the DOI in any citations of your software and tell any collaborators and users to do the same!
Note
If you don’t have a Zenodo record for your software yet when you attempt to publish it for the first time, you may encounter an error due to the undefined
ID
. To address this issue, we recommend manually creating a record on Zenodo and updating the value of the CI variablezenodo_record
. Detailed instructions for this process can be found in the gitlab2zenodo installation and setup instruction.
Licensing
‘Intellectual Property (IP)’ law is a complex subject. However some understanding of it is important for anyone producing creative works governed by it including software, datasets, graphics and more. This is true irrespective of the nature of your project: Closed commercial projects building on open tooling; Commercial projects maintaining an open resource; Open community driven and/or non-profit projects. Each of these may need to make slightly different licensing choices from the beginning of their projects to be compatible with their goals.
This chapter aims to give a brief summary of relevant intellectual property laws (enough to be able to read most software, and related licenses), explain free and open source software licensing, and explain how combining software from different sources works from a legal perspective. Decisions about licencing made at the inception of a project can have long-lasting and significant ramifications. The choices that you make about how your work is licensed shape who can and cannot legally use your work and for what purpose. Consequently, this chapter will feature some discussion of the ethical ramifications of licensing choices. It aims to be informative about the implications of licencing choices for the use of your work but not to prescribe a specific ethic, as there are divergent schools of thought on the ethics of different licencing choices.
Many of the concepts which apply to the licensing of software, data, AI/ML models, hardware and other creative works such as visuals share common attributes and concepts which will be covered here. We will address the specifics of licensing each of these types of output in their own sub-chapters, as well as a separate sub-chapter on license compatability.
Intellectual property is an umbrella term that refers to a number of distinct areas of law, primarily these three:
What these have in common is the attempt to extend property rights to intangible goods, meaning their use by others can be prevented or licensed. Governments with such laws effectively create a limited grant of monopoly over these goods for their creators, and other holders of these rights. This is generally done with the ostensible intent to incentivise the creation and improvement of such goods, but can in practice result in perverse incentives which fail to do so.
NOTE: It is important to consider that copyright, licenses, and patents are all legal concepts. As such, they are subject to what the law prescribes, which may change over time and space. Simply put, different countries have different laws, and follow different procedures with regard to enforcing them. The content provided here is broadly based on American and European law and legal traditions. It might not be applicable - might even be contra indicated - or relevant in your particular context. However most nations are signatories to international treaty agreements which somewhat harmonise these laws notably the Berne Convention, the TRIPS Agreement, and others under the World Intellectual Property Organization (WIPO). Whilst international efforts have sought to harmonize copyright enforcement, the real world is a messy place.
Good legal advice is timely, specific, and given by an expert; this chapter is none of these. It was written by engineers & scientists, not by lawyers, and it is a heavily simplified overview of a very complex field. The intent is to give you an overview of the basics so that you will know when to check whether something you want to do has potential legal ramifications. Do not make any important decisions based solely on the contents of this chapter.
So do not take the descriptions provided or viewpoints shared as legal advice, they are not that. This document is not intended to be used in that manner. Consult a legal expert to provide actual legal advice for your case.
Perhaps the most relevant part of intellectual property law for software, data and other creative works is copyright. We will dispense quickly with patents and trademarks here, so we can move on to the main topic of copyright.
Patents
The most important difference between patent and copyright to be immediately aware of is that by default all rights are retained by the author on works made public under copyright, whereas patents must be registered before their content is publicly disclosed. Thus, if you want to patent something, you must do so prior to sharing it publicly. The precise details of what constitutes a disclosure and the strictness of the application of this rule can vary by jurisdiction.
Patents on processes and software rather than specific inventions are a matter of contention in US law and explicitly not recognised in EU law (at time of writing). Unlike copyright, you generally have to pay to register and maintain a patent. You must also do so in each jurisdiction in which you want this patent to apply, though some have reciprocal agreements for recognising patents from other jurisdictions. To ensure that patents held by the authors of software do not impact on the freedom to use and distribute open software, some licenses specifically include permission to use any applicable patents (for example section 3 of the Apache 2.0 license), though this cannot protect against patents held by 3rd parties.
Trademarks
Trademarks are a brand, symbol, or identifying mark associated with a project, product or service. Trademarks differ from the copyright & patent in that their primary function is consumer protection. They prevent bad actors from impersonating recongnisable brands and deceiving consumers into purchasing products that are not being offered by who they think they are. They, like patents, must also be registered, but unlike patents, this can be done after they have been made public.
Registering a trademark generally comes with an administrative fee, but is not as costly as maintaining a patent. Trademarks generally only apply within a specific sector, as people are unlikely to confuse brands which do completely different things. They can be relevant in the context of the name and logo of a software project, especially when a project changes hands or is forked, in which case the fork may not be able to use the original name of the project even if that project is no longer maintained. Open source projects not associated with a company which have trademarks may have these held by a legal entity such as a non-profit, through which they might also take donations and pay for project infrastructure. It can be valuable for open source projects to register for trademarks as their work can easily be cloned, modified and re-distributed with ill intent. Examples of modified open source tools distributed with malware added have been documented, and trademark enforcement could in some cases help to prevent or deter this. Nextcloud, for example, has a very comprehensive guide to the use of their marks with excellent explanations for the restrictions that they place on their use.
Copyright
By default, if you make a work publicly available, you retain the copyright to that work and all rights that this gives you over it. Anyone wishing to re-use that work must seek to license the right to do so from you, or open themselves to the possibility of a lawsuit for infringing on your copyright. Irrespective of how you choose to license your work, however, there are some generally accepted exceptions to the protections of copyright that permit the re-use of works (or parts of works) without the consent of the copyright holder, under certain circumstances. These are known as ‘fair use’ or ‘fair dealing’ exceptions. Under the ‘fair use’ standard originating in the USA, the following criteria are considered on a case-by-case basis to decide if a use constitutes an infringement of copyright:
From 17 U.S.C § 107
the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
the nature of the copyrighted work;
the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
the effect of the use upon the potential market for or value of the copyrighted work.
The ‘fair dealing’ standard, originating in British law, generally includes more explicitly enumerated exceptions but with similar intent. Notably disputes over what constitutes fair-use are not easily administrable and can require protracted court proceedings to settle definitively.
For anyone wishing to circulate their work and grant others the right to re-use, remix, or re-distribute that work free of charge, coming to individual licensing arrangements with everyone who might want to do this is obviously impractical. To address this, there exist numerous pre-made ‘off-the-self’ licenses that you can apply to your work. Which of these you choose shapes how and under what circumstances others are permitted to re-use your work without infringing on your copyright.
Pre-made licenses exist that are tailored to the differences between different types of works. For example, there are licenses intended to be used for software and licenses intended to be used for other creative works such as images, prose (text), as well as hardware & designs.
In addition, there are now licenses tailored for machine learning or artificial intelligence models as these are comprised of several parts, including: training data, code, and model weights. Each of these parts may be licensed differently, and there is even some dispute as to whether model weights are subject to copyright at all under current law.
This is an area which is likely to see (by legal standards) rapid changes in the near future, given recent developments in the commercialisation of AI/ML models.
There are some general principles which apply to licenses across the different types of entity that they try to license. Licenses can generally be placed on a spectrum from proprietary, through permissive, to ‘share alike’ or ‘copyleft’ (the opposite of copyright). This spectrum is something of an oversimplification, and there are some extensions and caveats we’ll get to later.
What are ‘Usage Restricting’ Licenses?
Usage restricting licenses seek to affirmatively protect users or others affected by the use of the work by placing specific restrictions on its use. This curtails freedom 0, the freedom to use software ‘for any purpose’ and prohibiting the use of the software, or other system, for unethical purposes. Both ‘Ethical source’ & ‘Responsible AI’ Licenses are examples of this approach and seek to place restrictions on the uses to which the licensees can put the software or machine learning systems licensed in this fashion. Consequently, these licenses by the classical definitions of free and open source software from the FSF and OSI would not be considered free or open source licenses. They do however generally resemble them in the other three criteria of the definition. Their merits versus conventional open source licenses have been the subject of some debate, and their adoption has thus far been relatively limited.
Even an attribution requirement (the BY in CC-BY) can in some cases be considered a usage restriction. For example the Debian project found the Common Public Attribution License (CPAL) to be incompatible their free-software guidelines for this reason whilst it is approved by the Open Source Initiative. In the case of academic works attribution requirements can serve to re-enforce the citation convention with the force of copyright law.
Where to find open licenses for different types of work
Code
The Open Source Initiaitive (OSI) maintains a list of approved licenses open source licenses
Free Software Foundation maintains a list of GPL-Compatible Free Software Licenses
choosealicense.com provides a tool to guide you through the license choice project.
Organisation for ethical source maintains a list of ethical source licenses
Prose, Images, Audio, Video, Datasets, and similar
Machine Learning (ML) / artificial inteligence (AI) systems
Creative commons and Software licenses can be applied to different parts of ML/AI systems, CC to training data and weights, software licenses to code used in training / depoyment.
Licencing enforcement
There have been a number of successful legal cases that have been brought in defence of the terms of copyleft licenses obliging the parties abusing the terms of these licenses to appropriately release their code. But this can be hard to discover, as it is not immediately obvious if copyleft code has been used from looking at a black box proprietary end product.
Organisations which take legal action in defence of free software, and which can provide information and resources for anyone else seeking to do the same, include:
Contributor license Agreements
The holder of the copyright on a copyleft project can still re-license that project or dual-license that project under a different license, for example to grant exclusive rights to commercially distribute that project with proprietary extensions or to make future versions proprietary. In a large community developed project, this would require the consent of all contributors, as they each own the copyright to their contributions. To get around this, some copyleft projects developed by companies that commercially license proprietary extensions to these projects ask their contributors to sign contributor license agreements (CLAs) which may assign the contributor’s copyright to the company, or include other provisions so that they can legally dual-license the project.
How and where to add licenses
Wherever you share your project it is likely to be organised in a heirarchy of directories, place a plain text file containing the license in the top level directoty of your project.
If it is a git project for example that is shared on a git forge like github or gitlab, using a standard file name like LICENSE
will allow your license to be picked up the the host and displayed on your project.
If the license that you have used has a standarised short name from SPDX then this will be displayed as a small icon on your projects home page by these hosts.
It can also be useful to include license information in the form of standard strings at the top of each text file in your project.
There are useful tools which automate this available from REUSE a project from the FSFe which developed the spec.
This is especially true if your project contains material that is licensed in multiple different ways or a part of your project is being used in someone else’s which uses a different (compatible) license.
Additional resources on data sharing#
‘How can you make research data accessible?’: a blog that contains five steps to make your data more accessible
The European Commission’s data guidelines
Videos on Data sharing and reuse & Data Preservation and Archiving from the TU Delft Open Science MOOC.
Coursera Videos from Research Data Management and Sharing on the Benefits of Sharing, Why Archive Data?, and Why is Archiving Data Important?
Blog: Ask not what you can do for open data; ask what open data can do for you