Joel Dudley, Ph.D., Assistant Professor of Genetics and Genomic Sciences and Director of Biomedical Informatics at Mount Sinai School of Medicine, talks to contributing editor Tanuja Koppal, Ph.D. about the current changes impacting bioinformatics. While data generation gets simpler and less expensive, data search and interpretation remain a formidable challenge. Lack of standardization in data nomenclature and analysis tools continues to spur innovation in the creation of custom software programs and services. Since there exists no easy, one-stop shop for data management, Dudley advises lab managers to leverage their core facilities for bioinformatics, to look out for emerging software companies, and to get creative with informatics tools when tackling complex integrative biology.
Q: Can you tell us about your work and the types of data you handle on a regular basis?
A: The work that we are doing is around drug discovery and clinical genomics, trying to understand complex diseases through integrative and multi-scale biology. Integrative implies combining different types of data, taking a holistic view of the disease, and using all the molecular measurements available to us. The multi-scale piece is critical as well, where we figure out how to use informatics to connect everything, from the genomics at a cellular level to what is going in different tissues, at a broader physiological level. We look at various data types and at different scales of resolution and relevance to the disease.
Q: Where do you see the biggest challenges in your work?
A: Today, data generation is a place where you can compete easily, as it’s not really expensive and most techniques can be outsourced. Data interpretation and integration are where the big wins are going to come from. A lot of labs are now trying to figure out how to leverage next-generation sequencing (NGS) data. Initially we thought that by generating huge amounts of data, we could process and compute the DNA variants by comparing them to the reference genome. What’s clear though is that we need to store the terabytes of raw data so that as the science and methods improve we can go back and reanalyze the genome data and update the variant profile. That’s definitely a challenge. Integrating the data is another challenge. There are no easy off-the-shelf tools for doing that. We have to write our own custom software programs because the field is changing so fast. However, there are some standard tools available. For instance, Cytoscape is a robust, open-source platform for doing complex network analysis where you are looking at how things are connected at a systems level.
Q: What other software programs and resources do you find helpful?
A: There are some good community efforts for dealing with sequencing data. There is a project called Galaxy, which is an open-source platform for managing NGS data and analysis. What’s nice is that you can define analysis pipelines and workflows, save them as live protocols, and share them as they are very reproducible. NextBio is another interesting tool that provides a Google-like search interface, making it easy to mine all the public data available. Leveraging all the published data to compare how your experiments relate to other findings is hard to do. Finding what is out there is one thing, but then being able to use the data is another issue. The other aspect is that with the explosion of data you can’t get away with working on your laptop for these types of analysis. You need a serverbased system, or you need to leverage informatics software services or figure out how to use tools to deploy data in the cloud so you can scale up. But cloud computing is not trivial either, and without the right resources it could take weeks to put the data into the system. DNAnexus is a cloud-based system for NGS data storage and analysis. There is an interesting project called GenomeSpace coming out of the Broad Institute; it is a cloud-based centralized infrastructure for connecting existing tools and communities to do scale-up, data sharing, and computation. Sage Bionetworks has developed the Synapse platform, which is an open-source project trying to provide a reproducible and user-friendly infrastructure for managing and sharing data and computational workflows. What’s nice is that it is not aimed at the power user, but it is making it very approachable for nontechnical folks.
Q: What do you think is contributing to the problems associated with data integration and interpretation?
A: The biggest challenge, honestly, is getting the right data you need. There is a lot of interesting data, but it’s difficult to free it up from its source or to integrate across datasets in a systematic way. The terms used to describe the data in one dataset can be very different from the ones used in another dataset. There is very poor use of a common language or nomenclature to connect across various datasets.
Q: What specific informatics-related challenges do the students and post-docs in your lab face?
A: People in my lab have both computational and biological expertise, and we bring in clinical researchers to collaborate with us. Getting students up to speed with computational tools in the informatics space is tough because there is a lack of easy-to-use tools for this type of work. It’s almost inevitable that at some point they are going to have to write their own software program and get familiar with working with large databases. There is definitely no Microsoft Office equivalent for integrative genomics, although there are a lot of software companies trying to accomplish that. There are some companies emerging that are providing user-friendly tools for doing large-scale computing, so hopefully things will change in the future.
Q: What are your perspectives on the cost and speed of computing?
A: Cloud computing is certainly interesting when it comes to cost and speed. What’s really expensive today is people. Computing is relatively inexpensive. In the past people spent time optimizing algorithms to make computers run faster; whereas today, time is better spent by launching a number of servers in the cloud and running your algorithms on a bunch of different computers. It is oftentimes more economical than paying someone to optimize your software. So now you go from capital investments to operational costs. Instead of buying the big servers, computers, and software packages, cloud computing—like electricity—becomes an operational cost. It’s a different model, but it can be very cost-effective. With cloud computing, the analysis can be done in a few weeks for a few thousand dollars. However, there are not too many user-friendly tools to do that. I often have to write my own software programs for cloud computing projects. But that is changing, and I see it moving more toward informatics software services with less focus on desktop-installed software packages.
Q: What is your advice to lab managers who don’t have adequate resources to tackle their informatics problems?
A: I would advise lab managers to leverage their core facilities for bioinformatics at their university. I would ask them to go to conferences for discovering the emerging software companies in sequencing and other areas. Unfortunately, there is no one-stop shop. Some companies are good at data storage, some for data visualization, and others for interpretation. I am not aware of any one company that can do it all in a turnkey fashion. It’s still a lot of piecing things together, and you have to get creative when looking at the various options for doing complex integrative biology. On a positive note, there are many companies looking to deal with the data deluge in biology and provide user-friendly tools. Instrumentation companies are also investing in informatics companies to provide a complete solution for data interpretation and downstream analysis.
Q: Are you concerned at all about data security?
A: I am very much into open source, but there is definitely a need for data security. Cloud computing doesn’t make it necessarily secure. You still need to secure the servers that are loaded into the cloud. This is a big issue for medical centers and labs that handle large sets of patient data. They are not comfortable pushing the data outside the walls of their institution and into the hands of software service companies. People are very concerned about security and not comfortable with the paradigm of sending the data outside. The solution is not very clear, but there are some technologies trying to solve this issue. There are some Federal Information Security Management Act-compliant severs that are certified to have some level of security. But again, there are not many off-the-shelf systems for use.
Dr. Joel Dudley is Assistant Professor of Genetics and Genomic Sciences and Director of Biomedical Informatics at Mount Sinai School of Medicine. His current research is focused toward solving key problems in genomic and systems medicine through the development and application of translational and biomedical informatics methodologies. Dr. Dudley’s published research covers topics in bioinformatics, genomic medicine, and personal and clinical genomics, as well as drug and biomarker discovery. His recent work with co-authors describing a novel, systems-based approach for computational drug repositioning was featured in the Wall Street Journal and earned designation as the NHGRI Director’s Genome Advance of the Month. He is also co-author (with Konrad Karczewski) of the forthcoming book Exploring Personal Genomics. Dr. Dudley received a B.S. in microbiology from Arizona State University and an M.S. and Ph.D. in biomedical informatics from Stanford University School of Medicine.