Laboratory Technology

iStock
Why an Understanding of Data Stewardship Is So Important

Why an Understanding of Data Stewardship Is So Important

The next generation of data scientists needs to be aware of FAIR Data Principles

Ted Slater

Much of our world today is fueled by data—whether monitoring climate change, predicting financial markets, or tracking health trends. In scientific research and development, the use of data holds considerable potential to accelerate breakthroughs in drug development. Today, we are seeing how valuable scientific data is in the urgent global search for safe and effective therapies and vaccines for COVID-19, as well as modeling the predicted progression of the virus. At the same time, the technologies that both create and utilize data are developing at a breakneck pace, with new trends and innovations emerging all the time.

In recent years, we’ve seen the term “Big Data” arise and then drop out of fashion, and “AI” (artificial intelligence) become common parlance (often as a misnomer for “deep learning”). But even a year or two ago, “deep learning” would have been unfamiliar terms to many. For the most part, these aren’t just buzzwords; there is a technology shift underneath the hype. But it underlines something important—it’s exceptionally difficult for scientists to keep up with the pace of change and learn the skills needed to master technology, when those needs change all the time. 

Even the latest graduates entering the workplace will find their skills quickly need updating. The required blend of tech savvy and scientific nuance can be a difficult mix to achieve. But despite all these variables and the rapid development of technology, there is one common capability all those working with data need—an understanding of data stewardship. A vital aspect of which is the knowledge of, and the application of, practical guidelines for data stewardship. Which brings us to that sense of FAIR play.

FAIR-aware from education to employment

The FAIR Data Principles—Findability, Accessibility, Interoperability, and Reusability—are a set of guidelines for the management of scholarly data, published in 2016. I was one of the authors of the original paper, which marked the first formal publication of the Principles. The reason we need guidelines like FAIR is because data scientists are already the unicorns of enterprise, with rare combinations of valuable skills that allow them to gain scientific insights from data. However, most spend up to 80 percent of their time finding and wrangling data rather than analyzing it. Decreasing the time data scientists spend just getting data ready for analysis will compress time-to-insight.

In the four years since publication, FAIR has picked up momentum; more than 120,000 people have accessed the paper and it has more than 1,500 citations. To now solidify and further this progress, we need our next generation of scientists to be ‘FAIR-aware’ from the get-go, because while the technology will change, our demand for ‘good’ data won’t. Students need to be taught the value of best data practices early on, by educators following curricula that center on data management methods. Once they make it to the workplace, as employees they need to be armed with tools and techniques that ensure data are ‘born’ FAIR. Retroactively FAIR-ifying data is yet further waste of time and resources. The aim is to create a generation of committed data stewards, who recognize the value of industry standards.

Developing an aptitude for data stewardship ultimately leads to better science. When we talk about making data FAIR, we’re talking about the bigger picture of knowledge engineering—in essence, how knowledge-based systems are built and used. Researchers need to understand how best, for instance, to represent the concept of a gene in a computer. An example is oncology; cancer is gene-based, and the various mutations, protein products, and metabolic pathways involved are simply too much to hold in one’s head. Understanding how to represent that data on a computer so it can be shared and reused will improve R&D outcomes.

Building a FAIRer future

Good data stewardship is also the responsibility of organizations themselves. They need to create a culture that encourages good data hygiene and instills FAIR as the best practice for making data reusable—from making sure critical safety data aren’t stored in a PowerPoint deck to enabling data-sharing between colleagues. A central repository so data are findable and interoperable is key. It’s also important to invest in tools, applications, and instruments that generate FAIR data (and metadata) from the beginning. This will also involve a period of change; established scientists are often wedded to existing processes and may need some encouragement to adopt new ways. 

There are resources that can help with FAIR implementation. The Pistoia Alliance, a non-profit group which works to lower barriers to innovation in life sciences R&D, recently launched a freely accessible FAIR Toolkit containing method tools, training resources, and use cases, allowing organizations to learn from industry successes. For any organization seeking to build a FAIR environment, one important element to remember is that FAIR is a continuum and there is no finite end point. You can decide how FAIR your data need to be for your purposes and some elements will be more important to you than others; ‘findable’ might be your priority to begin with. The small steps you take toward FAIR will help you make great strides toward better data management, and will help to develop a generation of scientists who are also data stewards.