David Dooling, Ph.D., is the assistant director of informatics at the Genome Institute at Washington University, where he oversees the Laboratory Information Management Systems (LIMS) and Information Systems. He has contributed to building one of the most advanced and powerful data-tracking systems at the institute and is now investigating methods to more efficiently store, compare, and operate sequencing data. Here he discusses with Tanuja Koppal, Ph.D., contributing editor for Lab Manager, the challenges associated with tackling vast amounts of data in large sequencing labs and shares his experiences in building and utilizing an advanced LIMS for data tracking and storage.
Q: What are some of the data-related challenges you face in your projects?
A: We are a large-scale sequencing center, and thus, data really drives what we do. The main projects that we’re working on now are cancer and human health related. We are also sequencing pools of microbes, fungi, and bacteria for the Human MicroBiome project, using DNA captured from various body sites in humans. We have several hundred individual samples from about 15 body sites, and we are trying to figure out what microbes are there and how they’re interacting with the human body, whether it’s in the mouth, in the gut, or on the skin. So as far as our LIMS system goes, it has increased in complexity as the scale of sequencing has increased over the last three or four years. There are many more samples and many more projects that we now need to track. Previously the data coming off the sequencers used to be on the order of kilobytes and megabytes, and now it’s on the order of terabytes. We have about 50 sequencing machines, and the runs are about 10 days long and each run generates somewhere around 1-2 terabytes of data per day. So every 10 days we’re generating about 50 to 100 terabytes’ worth of data.
Q: Can you give us some details about the LIMS that you’re working with? Is it homegrown or purchased from a commercial source?
A: Our LIMS is completely developed in-house and maintained by a staff of about 15 developers. We have designed our LIMS to be very high performance and highly flexible, and there is nothing that’s available in the market, certainly not at the scale that we operate in. Ours is a very dynamic system, and our design is really driven by abstraction. We have a very minimal amount of things hard-coded in the software and it is all driven by metadata that is stored in the database. So we don’t have a very elaborate software infrastructure that’s tied into how we do things. Recognizing that it is dynamic, the software reads the information from the database and dynamically creates the constraints and user interfaces and the reports that are needed to execute and track samples and data as it flows through the system.
Q: Can you discuss some of the pros and cons of doing it yourself versus buying a system from a vendor?
A: I think for most people if you can buy something from a vendor it makes a lot of sense, at least from a cost perspective. If you have workflows that can be modeled with an existing piece of software or with only slight modifications, then that is going to be a more economical solution. The disadvantage in getting something off the shelf is that it is off the shelf and was not designed for what you’re doing. Depending on what you’re doing and how much that deviates from norms of lab processing at other places, you can be left high and dry unless you’re able to get in and customize the software yourself. And obviously that would incur increased costs. So it really depends on where you’re at in the adoption curve, and as a part of our directive, we are an early adopter of technologies. For our requirements, because we’re out on that leading edge, there’s not going to be something that a commercial center could provide that would suit our needs. And it doesn’t make sense for them to do so either, because we’re out on the curve and there aren’t a lot of institutions that are doing what we’re doing.
Q: Any advice for lab managers as far as buying LIMS or upgrading their LIMS and how to avoid challenges associated with poor implementation?
A: The main thing is you want to involve the stakeholders that are going to be using the LIMS, and you want to have them involved in the process of specifying what’s needed and evaluating what’s out there. A lot of times these purchasing decisions are actually driven by a lot of other things than what’s actually needed on the ground, in the lab. So I think it’s important to get a real sense for what’s going on in the lab and how they’re doing things and how bringing in any system would affect people, whether it is an upgrade or a new system. There’s always going to be informal collection of data and informal reporting no matter how much you automate and how much you track. There’s always going to be a spreadsheet somewhere that someone is using to aggregate information in a way that they like to do. Obviously, you want to capture as much as you can in the LIMS, but you don’t want to stifle people’s creativity, because a lot of times, those things can lead to better reporting and tracking and they give you an insight into how people are actually using the tool and what sort of information you’re not giving them that you could. The other important thing is, talk to other labs that have used the system. Before you buy anything, I would strongly recommend you get a list of current customers from the company and follow up with those people and talk to them about what their experience has been as far as implementing the system, getting it up and running, the support, and all that sort of stuff.
Q: What do you perceive to be the biggest challenge with proper implementation of LIMS?
A: I think it would probably be different depending on the situation. For us, the challenges are largely the dynamic nature of our environment and keeping pace with all of the new ways that samples are handled, processed, and sequenced in our labs. Certainly with next-generation sequencing, the scale of the operation, the amount of disk space required, and the amount of computing power required is challenging for a lot of people. And then the other thing is just communication. You have to make sure that people on both sides really want to get a good solution that works for everybody and are willing to voice their opinions and work together in an honest and constructive way.
Q: How has LIMS evolved in the ten years that you’ve been here at the Genomics Institute, and where would you predict the field will go in the next four to five years?
A: I think the main change has been in the scale of operations, moving from tracking hundreds to thousands to tens of thousands of samples, and obviously the change is in the sequencing itself, with the reads going from thousands to billions to trillions. So the scale creates a challenge because there’s still a need to track information at a detailed level but you also need to be able to synthesize all that into useful, highlevel metrics for people to be able to look inside the operation to see where things are and how far along a project is. That’s a big challenge for people right now, to synthesize these two needs—the low-level tracking and the high-level visualization, which are kind of at odds with each other.
Q: As far as trends go, has anything caught your attention in recent months that you think is going to change how things are done?
A: Within the last few years there is definitely a trend toward more Web-based interfaces. I think that will continue with mobile computing device interfaces. We’ve been kicking around the idea of replacing touch screens and bar code scanners with tablet devices that can scan bar codes and read them in with a camera on them. There are all sorts of things on the horizon that right now are in the investigatory phase. As far as IT infrastructure in the cloud, I think you’re going to see that trend continue. I think it will be a relatively slow uptake at least for large-scale sequencing because of the data volumes at that scale. Once you’ve done the primary analysis and you’ve reduced the amount of data to a sequencing value, it becomes a lot more manageable to upload, but for data from sequencing runs where you’re generating a terabyte of information, it can be expensive and slow to rely on a cloud.
Q: How do you justify a return on investment for LIMS?
A: hire an army of people to track everything and generate reports, but really it would have to be hundreds, if not thousands, of people who would have to be tracking all these things and trying to synthesize the results into some cohesive story. The other thing is repeatability, reproducibility, and elimination of errors. A lot of times because of the constraints you’re allowed to place on the software, whether it is hard-coded in the software or information in the database, you’re able to prevent people from mixing up plates or doing something that shouldn’t be done and so you’re able to constrain the system, and that’s important in high-throughput production, like in a sequencing environment. I think those are the main two things that drive ROI, the reduction of errors and the ability to actually do things without having hundreds or thousands of people trying to manage it.