Submission

Caltech Library

introduction

title

Naming data files descriptively for easier reuse

short description

A worksheet for creating file naming conventions to label research data descriptively and consistently

Submission Form - Scored Questions

Please complete all questions related to your project. All questions in this section are scored.

Category of submission

Data reuse

Overview / Abstract

In an ideal world, research data is always accompanied by either research notes or, for shared data, metadata. But in reality, researchers often waste time digging through files on their computer trying to remember either what data each file contains or which file is the one the researcher actually needs. This is where consistent and descriptive file names make or break the reusability of research data by helping researchers easily understand what is in a file and how it differs from its neighbors. File naming conventions are one of the most basic ways to help organize files, richly describe them, and differentiate between related files – all factors that enable data reuse – yet researchers are rarely taught how to create a good file naming system. This results in poorly named files that tell the researchers who created them very little about what they contain, let alone anyone else looking to reuse that data. This recipe for creating a file naming convention guides researcher through how to create a custom file naming convention in eight steps. The recipe also works through an example file name for microscopy images. By following this recipe, researchers will gain a consistent, descriptive, and documented file naming convention to use on a group of related files. Using this convention will help researchers organize their files, find them easily, and be able to tell at a glance what each file contains so that files may be reused, even years after they are created.

Data sharing or reuse recipe title

A worksheet for creating file naming conventions to label research data descriptively and consistently

Data Sharing or Reuse Practices

Researchers often forget details about their data as time passes after the data is collected. This is why good research practices emphasize documentation for data collection and metadata for data sharing. These methods are not foolproof, however, as digital research data can easily be separated from written research notes and shared data divided from any repository metadata. For these reasons, it is useful to include metadata within the files themselves in the form of descriptive file names. This small but mighty practice can make all of the difference in the ability to find a specific file and understand what it contains without wasting huge amounts of precious research time trying to find the exact file a researcher needs.

This recipe for creating a file naming convention (Briney 2020) consists of eight steps: six questions to answer and two steps for documentation. The questions, as follows, walk a researcher through the key decisions that need to be made to create a consistent file naming scheme:

What group of files will this naming convention cover?
What information (metadata) is important about these files and makes each file distinct?
Do you need to abbreviate any of the metadata or encode it?
What is the order for the metadata in the file name?
What characters will you use to separate each piece of metadata in the file name?
Will you need to track different versions of each file?

Each step in the recipe is accompanied by clarifying instructions and applied to an example naming convention for a set of microscopy files. The recipe itself is formatted as a worksheet that can be worked through again and again for various groups of files and projects. The included example demonstrates the utility of working through this process – the final example name “P1-MUS023_20200229_051_raw.tif” (meaning the data is from: project 1; mouse #23; 51st image taken on Feb 29, 2020; image is direct from the microscope) is a much richer and more useful file name than a generic “mydata.tif”.

File naming conventions are a small but important part of the larger data sharing and reuse ecosystem. They act as flexible metadata, occupying the space between unstructured research notes and highly structured metadata for shared datasets. While file naming conventions rarely utilize existing metadata standards, this recipe does leverage the date standard ISO 8601 (YYYY-MM-DD or YYYYMMDD), which allows dates in file names to sort chronologically. The goal of consistent file naming is to balance between flexibility and structure, allowing consistent naming conventions to fill an important niche between other resources and tools for data sharing and reuse.

Despite being so beneficial, file naming conventions are often overlooked as something necessary to teach to research trainees. While fields such as computer science do cover how to name data (though usually when naming variables within code instead of naming files), most often researchers are left to figure out file naming all on their own – or not at all. The file naming convention worksheet submitted for this prize was designed to make the process of creating a file naming convention clear, with discrete steps to follow. It can be used by those who have never used a naming convention before as well as researchers who want to design a better one.

The file naming convention worksheet supports the FAIR principles of data sharing and reuse. It leverages the “Findable” portion of the FAIR principles by directing researchers to incorporate rich metadata into the file names themselves, augmenting any metadata that may accompany files. File naming conventions also create unique file names for individual files, making it easier to identify which dataset is which. Putting metadata directly into the file name also makes the data “Accessible”, as there is no reliance on a proprietary protocol to gain information on what a file contains. Consistent file naming, particularly when used by large groups, also makes data “Interoperable”, as anyone within the group can understand what a file contains when they know the naming scheme. All of these attributes support the “Reusable” ideal of FAIR by providing rich metadata about files, even in the absence of a research notebook or metadata in a repository record.

The file naming convention worksheet is openly available under a CC BY license in the Caltech institutional repository, allowing for use by researchers seeking to create file naming conventions as well as modification by instructors teaching file naming. It has been used in many data management workshops at Caltech and adopted by others beyond its campus.

Briney, K. A. (2020). File Naming Convention Worksheet. https://doi.org/10.7907/894q-zr22

Impact

File naming conventions may seem trivial, but the alternative – having to replicate data that cannot be found or understood – is huge waste of resources. While using consistent file naming do not automatically mean researchers will always find their needed files, the metadata in file names does make searching easier, as system searches can leverage such metadata. When researchers do find the needed files, the metadata in descriptive file names also aids in understanding that data, which facilitates reuse. The alternative is that researchers cannot find or understand their files, sometimes requiring data to be recollected. Using consistent and descriptive file naming means more time is spent on actual research instead of file management, searching for specific files, or – in the worst case – costly data replication.

One of the best parts about creating and using a file naming convention is that it doesn’t cost anything! A few minutes spent creating a file naming convention at the beginning of a project can have huge downstream impacts on data’s findability and usability during a project and even well after it ends. The investment is upfront, in creating the naming convention, and in the few seconds it takes to name files as data is collected. This is a trivial amount of time compared to time wasted trying to find and understand your old files, let alone another researcher’s files. This is the ultimate example of greater results with fewer resources.

File naming conventions are a small practice that has an especially large impact for collaborative research. When teams agree to name files consistently with a shared convention, it is significantly easier to find and use one another’s data files because everyone understands what each file contains simply by looking at the file name. Consistent file naming can also act as part of an audit trail to see which data the team has collected and what remains to be collected. Many large team projects have already discovered the benefits of consistent file naming, one example of which is documented by Briney, Goben, & Jones (2022). Briney, et al. used consistent file naming to organize files collected at eight different institutions, track files as they went through the data analysis process, and differentiate files that resulted from the three distinct research phases of the project. For example, the file name “SHA_NW03_20180222_Audio.mp3” informed all team members that the file contained the audio recording (“Audio”) for Northwestern University’s third student interview (“NW03”) on the theme of data sharing (“SHA”), a dataset which was collected on February 22, 2018 (“20180222”); this file was easily be linked to comparable student interviews on the same theme collected at other universities, as all related file names started with the “SHA” code. All of the file naming conventions for the project were created using the heuristic submitted for this prize and are documented in the research team’s data management plans (Briney, Jones, et al. 2022). Without these naming conventions, managing the team’s files would have been chaotic, preventing data from moving seamlessly through the analysis pipeline.

While many large research teams have already discovered the benefit of consistent and descriptive file naming conventions, such conventions also aid projects the generate a large number of related files, “small” research, as well as individual researchers. We often forget details about our research data as time goes by and file naming conventions are one more tool to help record this information. Descriptive and consistent file naming makes it easier to find, understand, and reuse data files – even years after they were generated – without wasting extra time trying to remember which file contains what information. It is to everyone’s benefit (especially our own, as we are most likely to reuse our own data) to clearly label what files contain, so that they may be easily identified and reused.

File naming conventions have the largest impact on the teams that use them daily, but consistent and descriptive file naming also benefits entire organizations. For example, descriptive file naming helps organizations identify and understand data from researchers who have left an institution, especially where the institution is still responsible for data retention and sharing. File naming conventions can also benefit whole disciplines, such as by using consistent file names in a disciplinary database to augment disciplinary metadata and file formatting standards. No matter the scale, it’s to everyone’s benefit to use descriptive file naming.

Briney, K., Goben, A., & Jones, K. M. L. (2022). Data Management Planning for an Eight-Institution, Multi-Year Research Project. International Journal of Digital Curation, 17(1), Article 1. https://doi.org/10.2218/ijdc.v17i1.799

Briney, K., Jones, K. M. L., et al. (2022, May 6). Data Doubles Data Management Plans. OSF. https://doi.org/10.17605/OSF.IO/JE7QP

How to learn from this project

This recipe’s steps for creating file naming conventions are intentionally basic and foundational, providing lots of leeway for applying this recipe to any research practice. Creating a file naming convention is a process and the worksheet is a navigational aid, rather than an explicit set of directions to reach a particular destination. This navigation viewpoint is key to its replicability, as it allows the research context to change yet the principles of file naming to remain the same. By focusing the worksheet on the specific decisions that the researcher must make about their file names, it allows the recipe to be replicable and broadly applicable.

The recipe is intentionally formatted as a worksheet that can be worked through repeatedly for different groups of files. It has space to write notes and hash out details. A worksheet is meant to be a work in progress rather than a final rule; it’s messy and sometimes researchers need to brainstorm, but eventually they will reach the best outcome.

There is no one correct way to name files descriptively and consistently. By following this maxim and not being prescriptive, the file naming convention worksheet works for many types of researchers.

Adoption of practice by peers

The file naming convention worksheet has been tested and used in many data management workshops at Caltech, demonstrating is applicability to many fields of scientific research. It has also been adapted by other data management experts, such as Harvard Medical School’s Data Services (HMS Research Data Management, 2023), to teach file naming in the medical research context. Given its broad adoption by many types of researchers, we expect this recipe to be readily replicated by researchers in a range of disciplines to create new file naming conventions.

There are two audiences for more direct outreach about the file naming convention worksheet submitted for this prize: researchers creating a file naming convention and data management experts who teach about file naming. Beyond direct messaging to researchers, sharing this recipe with data management experts will help bring this recipe to researchers, with the benefit that this outreach avenue provides local support for file naming and other data management concerns. This can be done through outreach, presentations, and publications to existing data management communities such as Research Data Access & Preservation Association (RDAP) and the International Association for Social Science Information Service and Technology (IASSIST).

Finally, the naming convention worksheet is licensed under a permissive Creative Commons Attribution license (CC BY), allowing for it to be adapted to other research disciplines. Such adaptions could include providing a discipline-specific file naming example or incorporation of discipline-specific metadata standards into file names. Outreach to data management communities, as mentioned above, will encourage such adaptation.

HMS Research Data Management. (2023). File Naming Conventions. Retrieved June 14, 2023, from https://datamanagement.hms.harvard.edu/plan-design/file-naming-conventions

Team

Caltech Library is represented by Kristin Briney and Tom Morrell. Kristin and Tom comprise the Library’s Data Services team, which supports the research data management needs of the California Institute of Technology. The team works together to provide data management education, guidance on data management plans, and tools for open data sharing. Kristin and Tom are also key support for the journal microPublication Biology in Caltech Library’s role as the journal’s publisher and, with several biology researchers, published on improving storage infrastructure for biology research.

Dr. Briney is a researcher and the Biology & Biological Engineering Librarian at Caltech, where she specializes in biological literature and information resources. Kristin led the Caltech Library’s response to the recent NIH Data Management & Sharing Policy by creating guidance for biologists on creating Data Management and Sharing Plans (DMSPs) and helping campus meet this new data sharing requirement. Dr. Briney conducts research on research data policy and research data sharing, as well as investigating management of library data. Kristin has studied the prevalence of university data policy, published on teaching data management skills in cheminformatics, and recently examined the continued availability of thousands of shared, public datasets resulting from Caltech research.

Tom Morrell is a researcher and the Research Data Specialist at Caltech. Tom runs the CaltechDATA institutional repository, which empowers researchers to share their data and software. He also collaborates with Dr. Briney on researcher outreach, training, and consultations including support for NIH DMSPs. Tom has managed a number of biological data projects, including transfer of the Caltech Tomogram Archive and publication of the “The Atlas of Bacterial & Archaeal Cell Structure” through the CaltechDATA repository. He researches approaches for efficiently creating metadata for both data and scientific software and is co-chair of the SciCodes Consortium.

Key principles

file naming; research data management; metadata; data documentation

Video: How to learn from this practice

Research disciplines

Microscopy; Bioinformatics; Genomics and proteomics; Biological research; Medical research

Supporting Documentation

Include links to relevant and publicly accessible website page(s), up to three relevant works that resulted from using your data sharing or reuse recipe or which were integral to the development of these practices publications, and/or up to three relevant resources.

Supporting Documentation 1

https://doi.org/10.7907/894q-zr22

Supporting documentation 2

https://doi.org/10.17605/OSF.IO/JE7QP

Supporting documentation 3

https://datamanagement.hms.harvard.edu/plan-design/file-naming-conventions

Team information - Not Scored

Please respond to these questions related to team participation in the challenge. Your responses to questions in this section will not be scored by judges.

Entity Participation

My team is participating in this challenge on behalf of an entity

Entity Name

California Institute of Technology Library

IDeA State Status (not scored)

no, my entity is not within an IDeA state

Minority Serving Institution (not scored)

no, my entity is not a Minority Serving Institution

Participation in 2022 DataWorks! Prize

yes, our team captain or majority of our team participated in the 2022 DataWorks! Prize

2022 DataWorks! Prize Team name

Caltech Library

Eligibility Requirements

yes, I have read and understand the eligibility requirements

comments (public)

Was this page helpful? yes no