Trends in Data Visualization

Today, visualizing large diverse datasets requires new approaches and tools for viewing and computing complex data. Jeremy Goecks, PhD, and Tommy Dang, PhD, discuss some of these new tools, and comment on upcoming trends in web-based visualization, visual analytics, and custom visualization that could become game-changers in the near future.

Q: What are some of the challenges in data visualization today?

A: Two key challenges that my colleagues and I encounter regularly are heterogeneity and scalability. There are so many different types of biomedical data today, from molecular data such as genomics and proteomics to imaging at different scales to clinical data from electronic health records. Each data type can be visualized in many different ways. Trying to develop visualization tools, and especially dashboards with multiple visualizations for working with collections of different data, is challenging. Due to technological advances, the size of biomedical datasets is growing substantially. Even after significant processing, today’s genomic and imaging datasets are very large and can include data from thousands of samples. Visualizing increasingly large datasets requires new approaches for viewing and making sense of so many data points, and more powerful supporting computing infrastructure is often necessary as well.

Q: What improvements are being made in terms of new tools and applications for data visualization?

A: Visualization is increasingly being integrated with analysis methods for visual analytics. In visual analytics, visualization and analyses are done side by side in the same interface/dashboard so the results can be visualized immediately. This tight connection between visualization and analysis tools offers powerful and immediate data exploration and understanding. Web-based analysis notebooks such as Jupyter (http://jupyter.org/) are a great example of how interactive visual analysis can be realized as a game-changing platform. On the technical side, tool kits for creating high-performance, webbased visualizations are becoming quite mature. I especially like Bokeh (http://bokeh.pydata.org), where Python can be used to create simple, publication- quality visualizations for the web or complex dashboards with interactive visualizations that dynamically query large databases. With Bokeh, I can write all my code in Python but run it on the web without having to write any JavaScript.

Q: How is data visualization going to impact big data analytics?

A: Data visualization will play a key role in quality assessment and understanding as larger analyses are done. It can help in determining whether an analysis produced high-quality results. For example, visualization can help in finding spurious correlations that an analysis may have picked up, but it does not provide real insight into relationships in a dataset. Another example where visualization can help is in selecting a threshold to produce the best trade-off between true and false-positive rates. Visualizing the trade-offs at a set of thresholds can simplify this decision. Data visualization will also help in understanding how statistical and machine-learning models are making decisions. To understand these models, data visualization can show what parts of an image were most important for classifying the image or highlight customer attributes used to decide what offers to make to the customer.

Q: Can you describe some of the work that you are doing in this field and how you see that changing in the near future?

A: I am a principal investigator for Galaxy (https://galaxyproject.org/), an open source, web-based platform for analyzing large biomedical datasets that is used daily by thousands of scientists across the world. Scientists can use Galaxy from a web browser to upload their genomic, proteomic, and other biomedical datasets, and run complex analysis tools and even entire workflows. Over the past several years, we have extended Galaxy so that visualizations can be added to Galaxy and scientists can use them in their web browsers just as they use other tools. Galaxy users now have access to more than 50 visualizations. These visualizations include basic numerical visualizations such as histograms and scatterplots, as well as biomedically focused visualizations including genome browsers, Circos plots, phylogenetic trees, and metagenomic viewers. Looking forward, we are focusing on two exciting data visualization advances in Galaxy. First, we are bringing Jupyter and RStudio (https://www.rstudio. com/) to Galaxy so that scientists can create their own visualizations using these platforms. Second, we are developing Galaxy visualization dashboards so that users can combine visualization and analysis together for custom visualization and visual analytics.

Q: Any advice or recommendations in terms of evaluating or investing in data visualization tools? Are there any useful resources you can recommend for additional information?

A: First, I am a big believer in open source software because it is available to anyone and can be extended/improved by a community of developers. Investing in technologies and tools, in data visualization, or in other domains is about selecting the tool that not only performs well now but is also likely to be useful in one to three years. With this in mind, I choose technologies/tools that are packaged well and easy to deploy, such as through npm for JavaScript (https://www.npmjs.com/) and Conda (http://conda.pydata.org) for Python. I also choose technologies that are being actively worked on, as evidenced by a source code repository (e.g., on Github) that shows ongoing work.

Surprisingly, I don’t have a set of go-to resources for learning about data visualization trends and advances. I think this may be because so much of visualization is tied to the particular computing platform being used. That said, some of the very best cutting-edge but also practical visualization work comes from the University of Washington Interactive Data Lab (https://idl.cs.washington.edu/), which develops the web-based visualization tool kits Vega-lite (vega. github.io/vega-lite/) and Vega (vega.github.io/vega). Another great academic group is the Visualization Design Lab at the University of Utah (http://vdl.sci.utah.edu/team/). I prefer Bokeh when working in Python and ggplot2 (http://docs.ggplot2.org/current/) when working in R. The best commercial visualization tool that I use is Tableau (https://www.tableau.com/).

Jeremy Goecks, PhD, is an assistant professor in the Department of Biomedical Engineering and the Computational Biology Program at Oregon Health and Science University. His research centers on developing computational tools and infrastructure for analyzing large biomedical datasets. Dr. Goecks is a lead investigator for Galaxy, a web-based biomedical data analysis platform. He also leads data management and integration efforts for several cancer informatics initiatives at the university. Dr. Goecks received his PhD in computer science from Georgia Tech and did postdoctoral research in genomics at Emory University.

Tommy Dang, PhDQ: What do you see as some of the gaps or challenges in data visualization?

A: Data visualization is about analyzing and presenting data in an intuitive graphical manner in order to highlight insights based on large amounts of data. Some predominant challenges in data visualization include the following:

Real-time scalability of dynamic data: Social media sites such as Facebook and Twitter are producing huge amounts of data every hour that consist of diverse types of information. Visualizing large amounts of this data in real time is not a straightforward task.
Combining data visualization with other modern techniques: Machine-learning techniques are playing a vital role in big data analysis. Having smart underlying algorithms that quickly highlight trends/patterns is very useful, as users are typically interested in a very small (but important) subset of the larger dataset.
Scalability on heterogeneous screen size: As datasets keep increasing, visualizing them on a standard screen size is no longer enough. Large and high-resolution (cluster) displays are highly desirable for many applications, such as medical image analyses or weather forecasts. Large displays also promote a collaborative environment to connect researchers in different geographic locations.

Q: What do you see as big improvements in data visualization?

A: In the past couple of years, many data visualization tools have been proposed and the improvements are coming more frequently. One significant improvement in data visualization is the introduction of web-based visualization frameworks (e.g., d3.js, three.js, vis.js) to visualize data in the web browser. For 2D data visualization and visual analysis, d3.js¹ has become quite popular. For graphics and virtual reality / augmented reality, three.js² is a good starting point. There are also powerful commercial tools such as Tableau, Microsoft Power BI, IBM Watson, and more. The packages are provided in the form of “plug and play,” which makes visualizing complex datasets much easier for novice users.

Bostock, Michael, 2012, D3.js: Data-Driven Documents, 492 (2012).
Danchilla, Brian, “Three.js framework,” Beginning WebGL for HTML5, Apress, 2012, 173–203.

Q: Can you describe some of the data visualization work that you are doing?

A doctoral student at the iDVL lab experiencing a virtual model of the Southern High Plains Aquifer of Texas using Google cardboard. More details can be found on the iDVL website at http://www.myweb.ttu.edu/tnhondan/iDVL
A: Currently, our team is working on several data visualization projects such as high-dimensional data analysis using visual features, underground water visualization, and biological network analysis. For underground water visualization, we created a system to monitor the groundwater and supply capability of wells in the Ogallala Aquifer, which is one of the world’s largest aquifers. The tool assists in detecting unusual patterns in the water level such as a sudden increase or decrease. We will be bringing virtual reality into this research to navigate through the different aquifers to see and understand the water quality and how it changes over time.

Tommy Dang, PhD, is an assistant professor of computer science at Texas Tech University, where he directs the interactive Data Visualization Lab (iDVL). His research relates to big data visualization and visual analytics, but he also has special interests and skills in 3D modeling, virtual reality, and computer animation. Dr. Dang has previously been a postdoc on a DARPAfunded project on biological network visualization at the Electronic Visualization Lab at the University of Illinois at Chicago.