Today, visualizing large diverse datasets requires new approaches and tools for viewing and computing complex data. Jeremy Goecks, PhD, and Tommy Dang, PhD, discuss some of these new tools, and comment on upcoming trends in web-based visualization, visual analytics, and custom visualization that could become game-changers in the near future.
Q: What are some of the challenges in data visualization today?
A: Two key challenges that my colleagues and I encounter regularly are heterogeneity and scalability. There are so many different types of biomedical data today, from molecular data such as genomics and proteomics to imaging at different scales to clinical data from electronic health records. Each data type can be visualized in many different ways. Trying to develop visualization tools, and especially dashboards with multiple visualizations for working with collections of different data, is challenging. Due to technological advances, the size of biomedical datasets is growing substantially. Even after significant processing, today’s genomic and imaging datasets are very large and can include data from thousands of samples. Visualizing increasingly large datasets requires new approaches for viewing and making sense of so many data points, and more powerful supporting computing infrastructure is often necessary as well.
Q: What improvements are being made in terms of new tools and applications for data visualization?
Q: How is data visualization going to impact big data analytics?
A: Data visualization will play a key role in quality assessment and understanding as larger analyses are done. It can help in determining whether an analysis produced high-quality results. For example, visualization can help in finding spurious correlations that an analysis may have picked up, but it does not provide real insight into relationships in a dataset. Another example where visualization can help is in selecting a threshold to produce the best trade-off between true and false-positive rates. Visualizing the trade-offs at a set of thresholds can simplify this decision. Data visualization will also help in understanding how statistical and machine-learning models are making decisions. To understand these models, data visualization can show what parts of an image were most important for classifying the image or highlight customer attributes used to decide what offers to make to the customer.
Q: Can you describe some of the work that you are doing in this field and how you see that changing in the near future?
A: I am a principal investigator for Galaxy (https://galaxyproject.org/), an open source, web-based platform for analyzing large biomedical datasets that is used daily by thousands of scientists across the world. Scientists can use Galaxy from a web browser to upload their genomic, proteomic, and other biomedical datasets, and run complex analysis tools and even entire workflows. Over the past several years, we have extended Galaxy so that visualizations can be added to Galaxy and scientists can use them in their web browsers just as they use other tools. Galaxy users now have access to more than 50 visualizations. These visualizations include basic numerical visualizations such as histograms and scatterplots, as well as biomedically focused visualizations including genome browsers, Circos plots, phylogenetic trees, and metagenomic viewers. Looking forward, we are focusing on two exciting data visualization advances in Galaxy. First, we are bringing Jupyter and RStudio (https://www.rstudio. com/) to Galaxy so that scientists can create their own visualizations using these platforms. Second, we are developing Galaxy visualization dashboards so that users can combine visualization and analysis together for custom visualization and visual analytics.
Q: Any advice or recommendations in terms of evaluating or investing in data visualization tools? Are there any useful resources you can recommend for additional information?
Surprisingly, I don’t have a set of go-to resources for learning about data visualization trends and advances. I think this may be because so much of visualization is tied to the particular computing platform being used. That said, some of the very best cutting-edge but also practical visualization work comes from the University of Washington Interactive Data Lab (https://idl.cs.washington.edu/), which develops the web-based visualization tool kits Vega-lite (vega. github.io/vega-lite/) and Vega (vega.github.io/vega). Another great academic group is the Visualization Design Lab at the University of Utah (http://vdl.sci.utah.edu/team/). I prefer Bokeh when working in Python and ggplot2 (http://docs.ggplot2.org/current/) when working in R. The best commercial visualization tool that I use is Tableau (https://www.tableau.com/).
Jeremy Goecks, PhD, is an assistant professor in the Department of Biomedical Engineering and the Computational Biology Program at Oregon Health and Science University. His research centers on developing computational tools and infrastructure for analyzing large biomedical datasets. Dr. Goecks is a lead investigator for Galaxy, a web-based biomedical data analysis platform. He also leads data management and integration efforts for several cancer informatics initiatives at the university. Dr. Goecks received his PhD in computer science from Georgia Tech and did postdoctoral research in genomics at Emory University.
Q: What do you see as some of the gaps or challenges in data visualization?
A: Data visualization is about analyzing and presenting data in an intuitive graphical manner in order to highlight insights based on large amounts of data. Some predominant challenges in data visualization include the following:
Q: What do you see as big improvements in data visualization?
A: In the past couple of years, many data visualization tools have been proposed and the improvements are coming more frequently. One significant improvement in data visualization is the introduction of web-based visualization frameworks (e.g., d3.js, three.js, vis.js) to visualize data in the web browser. For 2D data visualization and visual analysis, d3.js1 has become quite popular. For graphics and virtual reality / augmented reality, three.js2 is a good starting point. There are also powerful commercial tools such as Tableau, Microsoft Power BI, IBM Watson, and more. The packages are provided in the form of “plug and play,” which makes visualizing complex datasets much easier for novice users.
Q: Can you describe some of the data visualization work that you are doing?
A: Currently, our team is working on several data visualization projects such as high-dimensional data analysis using visual features, underground water visualization, and biological network analysis. For underground water visualization, we created a system to monitor the groundwater and supply capability of wells in the Ogallala Aquifer, which is one of the world’s largest aquifers. The tool assists in detecting unusual patterns in the water level such as a sudden increase or decrease. We will be bringing virtual reality into this research to navigate through the different aquifers to see and understand the water quality and how it changes over time.
Tommy Dang, PhD, is an assistant professor of computer science at Texas Tech University, where he directs the interactive Data Visualization Lab (iDVL). His research relates to big data visualization and visual analytics, but he also has special interests and skills in 3D modeling, virtual reality, and computer animation. Dr. Dang has previously been a postdoc on a DARPAfunded project on biological network visualization at the Electronic Visualization Lab at the University of Illinois at Chicago.