Marina Sirota, PhD, assistant professor at the Institute for Computational Health Sciences at the University of California, San Francisco, talks to contributing editor Tanuja Koppal, PhD, about how she is utilizing big data to address her research questions. She discusses trends in data analysis and tools for data storage and security, and she advises lab managers on how they can tackle some of the challenges associated with big data.
Q: How would you define big data?
A: Over the past couple of decades there has been a lot of data generated using different types of measurements and technologies, such as genomic data, DNA sequencing data, gene expression data, and more. In terms of biomedical science, big data is really a collection of lots of different types of information gathered by measuring various kinds of molecular entities. Another source of big data is electronic medical records. A lot of medical information and records are now provided through computer systems, and that provides another avenue to look at big data. Big data includes a lot of personal information gathered from mobile devices, which includes things like global positioning system (GPS) coordinates or activity levels. Big data is really the intersection of all these diverse types of data and use of the data to ask and answer all sorts of interesting questions. This provides us an opportunity to develop and apply computational techniques to the data, and I am particularly interested in using these tools to ask new questions about diseases. I want to see how we can use these computational methods on very diverse types of data to better understand disease and to develop better diagnostic and therapeutic strategies.
Q: How do you go about asking the right questions so you can get more information from the data that is available?
A: In terms of asking the right questions, you have to first figure out what are the problems or bottlenecks in the field. Then, you have to figure out how to use the data and the data sources that you have to solve those problems. In biomedical science, that may involve reading the literature to find out what questions remain unanswered and what the next steps are. In other areas, it may involve using the data that is available to eliminate a certain bottleneck in a process or analysis. Data-driven approaches can be used to address any kind of question and can be very useful.
Q: How are you using big data to answer the questions you are asking?
A: I can give you an example of a project that I am working on right now that involves studying preterm birth. The idea is to use all different types of data to identify risk factors for individuals who might be at risk for preterm birth. Here we are using genomic approaches to study different populations of individuals who were born preterm versus healthy controls who weren’t [in order] to try to identify specific genetic variants that may be associated with preterm birth.
In addition, we are also looking at environmental factors that may be contributing to preterm birth. We are trying to pull together genetic and environmental data to see how these factors may interact. In addition, we are also looking at other types of data, such as from microbiome and immune [system] measurements. The idea is to put all these different types of data together to see whether there are any interactions between the contributing risk factors, and whether we can use that to identify the populations that are more at risk for preterm birth.
Q: What challenges do you face when working with big data analysis, and what improvements are you hoping for?
A: There are some challenges in terms of pulling these different types of data together, both when doing the analysis separately for each modality and then pulling the results, and when pulling all the data together first and then doing a comprehensive analysis. For our project, the data is collected on different populations and we try to use the genotype to bring them all together, and there are challenges associated with that. There are certain normalization processes that need to be put in place so that we know what we are finding is real. We also do a lot of validation once we identify certain parameters. Validation is the other challenge, but it is something that has to happen. Computational analysis is a great tool to generate hypotheses, but we have to go back to the biology to understand what is going on, and experimental validation needs to be done.
Q: Can these challenges ever be fully tackled, or are they intrinsic when working with big data?
A: For us, there are challenges with bridging different types of data and when making assumptions, because they are not all collected from the same population. If we can pull data from very well-characterized disease populations, where we have all the genomic data, the electronic medical records, and mobile platform information on a single individual, then we can reduce some of the challenges associated with data analysis and integration. But probably more challenges will arise as we develop new technologies. It’s always important to understand the assumptions that come with generating each type of data, regardless of the challenges that exist.
Q: Is it challenging to find and train the people to do the analysis?
A: There are three ways to get into the bioinformatics field. You can start as a biologist and then pick up computer science and programming. You can start as a computer scientist and learn the biology later, or you can start doing both at the same time. It is easier if you can do both at the same time. It is not so common or easy to pick up the computational techniques later. If I were to advise those looking to get into bioinformatics, I would advise them to do both biology and computer science together or get the computational training first. As we generate more data in all different fields, having a statistical or computational background will be very useful.
Q: Are you seeing any trends in the tools available for data analysis or visualization?
A: We do a lot of our analysis using the R statistical software. It’s not very difficult to learn, and there have been a lot of methods implemented already. In terms of the machine learning techniques, there is an area called deep learning that is gaining popularity. It has been applied extensively in the field of imaging to perform object recognition using image analysis, and it has been extremely successful. Applications of deep learning in other fields are just starting, and that’s something I will be looking out for. We need to understand what kinds of data it can be applied to. It may prove useful for medical imaging and for analyzing genomic and other types of data.
Q: Is there anything to look for in terms of data security or data storage?
A: We are moving a lot of our analysis into the cloud, and that’s a trend that will likely continue. There are certainly concerns around data security when it comes to genomic and medical data, but there are groups working specifically on that. Due to the ease of access, more data analysis and storage will likely move to the cloud platform. With more collaborative efforts under way, more people are going to want a cloud-based platform to share data and also their analyses.
Q: Would you recommend the use of open source software?
A: We tend to use a lot of open source software, and the software and methods we develop are also shared and made available to the research community. In terms of the academic community, that’s the way forward for sharing data and methodologies. We also do a lot of our analysis on publicly available data. For instance, we regularly use the Gene Expression Omnibus (GEO), which is a database that contains data from over 1.6 million microarray experiments. Whenever a new microarray experiment is published, the data has to be made publicly available through this database. There are many such publicly available databases, and mining all that data is extremely valuable. I strongly believe in open source software, both for analytical methods and for data. When we do our analysis, we use and mix several different approaches on the same dataset. Some include methods that are already well developed, and others we have developed ourselves, and we then look for concordance. Ultimately validation and understanding of the biology are very important for the work that we do.
Q: How would you advise lab managers looking to evaluate and use bioinformatics tools for their work?
A: I would like to stress the importance of thinking about computational methods and learning what questions can be asked and answered using the data that is available to them. There are two ways of looking at it. One is to look at the data and see what types of questions can be addressed using the dataset. Or you can look at the intrinsic problems that exist and find the ones that can be solved using the data that’s out there. Start looking at how data science has impacted other fields and see whether you can do the same for your field. Try to think of disruptive technologies in other fields and ways in which your industry can be transformed in similar ways.