10 min read
Top resources on FAIR, summed up in an understandable manner.
Here is a challenge for you.
Being an expert in the field of diet research, you come to the point in your career where you need to run a controlled clinical trial, include thousands of people who are willing to turn their eating habits upside down and convince the government to give you hundreds of thousands of dollars to do it.
Dr Ramsden, a Clinical Investigator, NIH, was facing that challenge.
Until he realized that those studies have already been done in one of the most rigorous diet trials ever conducted – The Minnesota study.
The study which was forgotten over a quarter of a century. The study which had the crucial data of interest.
“I’ve heard the possibility that there might be some very interesting data in your father’s basement“, Dr Ramsden said in a phone call to Robert Frantz, a cardiologist at the Mayo Clinic.
Thankfully, Ivan Frantz, in charge of the study, prioritized the importance of immaculate record keeping.
The Minnesota study was a part of a massive undertaking in the early 1960s, which included a hundred thousand people in five different states. Ivan Frantz died thinking his study was a failure as the final results showed unexpected truths that denied his expectations of the outcome. Results which were indispensable for the needs of further research, decades later. The results his son found and forwarded for the benefit of further research. The Basement Tapes by Malcolm Gladwell go more into detail on the story.
The takeaway message is – the data was findable, accessible and reusable.
And that is the point of FAIR.
To be able to maximize the impact of our research and build upon the published research, FAIR principles emphasize the fact that data and metadata related to scholarly publications need to be findable, accessible, interoperable and reusable. By both humans and machines (i.e. our computers).
“FAIR means thinking about the people who could benefit from your data“, Lambert Heller, leader of the Open Science Lab at TIB, the German National Library of Science and Technology, explains in Nature Index.
FAIR starts with FINDABLE
Findable means that it should be easy for people and computers to find the data related to your published research. It applies to the metadata as well (metadata can be understood as data about your data). What we are talking about here are the hyperlinks, DOI numbers, unique identifiers, etc. You want the link to your data to be unique. No two sources can have the same. And, making sure the link is persistent. These are long-lasting references to digital data sources. What we want to avoid are the broken pages, old links that don’t work, resources that can’t be found anymore etc.
“It also means presenting the data in a standardized way so it’s machine readable”, Heller elaborates.
That is the purpose behind data repositories that assign a globally unique and persistent identifier (PIDs, DOIs, etc.) to your deposited data. If someone is using or building upon your data, they can use the identifier to cite it.
Kate LeMay, senior research data specialist at the ARDC, adds that besides the big picture, there are both altruistic and selfish reasons for researchers to make their data FAIR. “Most people get into research because they want to make a difference,” she says. “That includes making your data as useful as possible.” FAIR can also be good for career advancement, particularly for early-career researchers. “FAIR helps you demonstrate the impact of your research when people re-use and cite your dataset,” LeMay says in the article. “It gets your name out there and can lead to new collaborations.“
Which takes us to the next point – what if the data can’t be shared and should be kept private?
ACCESSIBLE but not OPEN
Let’s talk about rhinos, for example.
Say you invested 20 years of your life into battling antipoaching activities. Your data is of great value, but you cannot disclose anything that could pinpoint the locations of threatened species.
Similar applies to national security data, defence research, or sensitive medical data that could identify or reveal info about individuals.
Data can be FAIR even if it’s not open. It can be kept under mediated access controls.
“Once the user finds the required data, she/he needs to know how can they be accessed, possibly including authentication and authorization” – Go FAIR.
Sensitive data and defining the process for accessing it can still have a high level of being FAIR, even if certain data cannot be disclosed.
“As open as possible, as closed as necessary“, Heller sums up the guiding principle.
Accessible doesn’t mean open, but it gives the exact conditions under which the data are accessible. Even heavily protected private data can be FAIR. This means that you’d be able to see that the data isn’t open, but the exact steps that need to be taken to get access to that data are clear. And these can be as rigorous and complex as they need to be.
If the access has been granted, then the data should be accessible through the authentication and authorization procedure. More details on this are available in Martin Schweitzer’s talk on FAIR.
INTEROPERABLE to boost knowledge discovery
“Imagine the ability to link data in the Framingham Heart Study (NHLBI) with Alzheimer’s health data (NIA) to understand correlative effects in cardiovascular health with ageing and dementia. Imagine the ability to quickly obtain access to data, and related information, from published articles. Imagine the ability to link electronic health care records with personal data and with clinical and basic research data“, these are the promises and aims of the 2019 NIH Strategic Plan for Data Science.
By enabling a FAIR-data ecosystem, improving knowledge discovery becomes the main aim.
The point here is to make sure people and computers can look everything up. Interoperability goes into details. It prioritizes the precision in communication, standardizing the vocabulary and putting data into the format that is recognizable, to save us valuable time.
Martin Schweitzer describes the following use case in his talk:
You have two different files. In the first file, a table has two columns – country names and exports.
In the second file, there is another table that contains – country names, country codes and latitude and longitude measurements. You want to consolidate the two.
However, the country names are written slightly differently in those two tables (e.g. Congo Democratic Republic vs Congo (DRC)). The computer won’t recognize those two names as being the same. International country codes, however, are universal. So, in the document that didn’t contain country codes, these needed to be added i.e. the names replaced with country codes. That allows the two datasets to be consolidated.
The point is, if the authors would use the standard vocabulary that is findable, accessible, interoperable and reusable, which in this case means using the international country codes, the task would be done much quicker.
Currently, substantial amounts of time are being spent on consolidating data sets that use different vocabularies.
“The focus on assisting machines in their discovery and exploration of data through application of more generalized interoperability technologies and standards at the data/repository level, becomes a first-priority for good data stewardship“, Wilkinson et al. elaborate.
However, interoperability of data means that both data and metadata will have to be standardized so that datasets can be merged for further research. “The challenge is that different research fields have different cultures and requirements for data and metadata. There needs to be community ownership of these data standards,” LeMay says. “We can’t just impose them on researchers.”
REUSABLE for the benefit of authors and the progress of science
Many scientific journals and research funders now require scientists to share their data openly.
As described on the ANDS webpage, a great source of everything FAIR related, “reusable data should maintain its initial richness. For example, it should not be diminished to explain the findings in one particular publication. It needs a clear machine-readable license and provenance information on how the data was formed. It should also have discipline-specific data and metadata standards to give it rich contextual information that will allow for reuse.“
But, are researchers even motivated to shair their data? Do they want it to be reused?
The State of Open Data 2018 report shows that early career researchers are focused on the credit they receive for making data available, with regards to career progression opportunities.
However, “to provide true credit for good data practice, published, citable datasets need to be viewed as research outputs on a par with a research article in terms of career advancement and assessment. Realistically, routine inclusion of datasets, their citations and impact in grant assessments and CV evaluation is probably still years away“, Grace Baynes, VP, Research Data and New Product Development, Open Research, Springer Nature, writes in the report. “Researchers would share data more routinely, and more openly, if they genuinely believed they would get proper credit for their work that counted in advancing their academic standing and success in career development and grant applications, and for subsequent work that builds on their data“, Baynes explains.
Among the practical challenges for not sharing data, 46% of researchers stated that the most prominent challenge is “Organizing data in a presentable and useful way“.
Organizing and managing research data in a presentable way – how can electronic lab notebooks (ELNs) help?
The purpose of electronic lab notebooks extends beyond record-keeping and is evolving towards the project, data and team management platforms. Since all aspects of lab work can be unified within one platform, all data can be interconnected for better traceability. This allows researchers to organize all data related to their studies within an easy-to-use interface. Good data management is becoming a priority.
“Science funders, publishers and governmental agencies are beginning to require data management and stewardship plans for data generated in publicly funded experiments. Beyond proper collection, annotation, and archival, data stewardship includes the notion of ‘long-term care’ of valuable digital assets, with the goal that they should be discovered and re-used for downstream investigations, either alone, or in combination with newly generated data. The outcomes from good data management and stewardship, therefore, are high-quality digital publications that facilitate and simplify this ongoing process of discovery, evaluation, and reuse in downstream studies“, Wilkinson et al. elaborate.
European Commission’s Guidelines on FAIR Data Management in Horizon 2020, state that “good research data management is not a goal in itself, but rather the key conduit leading to knowledge discovery and innovation, and to subsequent data and knowledge integration and reuse”.
ELNs can contribute to better data management through an organized collection of all your research data in one place.
“The importance of using an electronic lab notebook like SciNote cannot be overstated. As a neuroscientist, I am aware that the field is currently facing a reproducibility crisis. Additionally, large funding bodies like the National Institutes of Health require indices of rigour and data management. SciNote makes it easier to address these issues because everything from the approach to the end result is all in one place“, says Jonathan Fadok, assistant professor and principle investigator, The Fadok Lab, Tulane University, USA (for more use cases and reviews, visit the Stories from Laboratories).
SciNote offers a set of functionalities that enable researchers to efficiently manage their data and prioritizes data protection:
- Keeping all research-related data in one place, annotated, traceable and searchable
- All records are timestamped and all changes within the system are recorded
- Inventory management allows information on every sample to be assigned and connected to the experiments
- Comprehensive reports are generated automatically
- It is possible to export all data in a readable format, organized by folders
- Collaboration with internal members of the team and external community or partners by inviting them and assigning them data viewing or even greater permissions when needed
- Every action by every user is automatically recorded within the system
- Activities management allows easy filtering to gain insight and overview into every detail of activities within the lab
- Entire experiments, workflows and processes can be saved as templates, cloned and re-used
- Protocol repository enables lab members to manage and share protocols
- Archiving and backups of all research data
Download SciNote Data Protection White paper (PDF):
All you need to know about the protection of your lab data.
Here is a challenge for you.
A quarter of a century from now, someone is looking for a crucial piece of information that was reported in your published paper.
Will they be able to find, access and reuse your data?
By Tea Pavlek