Trends in Data Visualization
Jeremy Goecks, PhD, and Tommy Dang, PhD, talk about the key challenges in data visualization.
Today, visualizing large diverse datasets requires new approaches and tools for viewing and computing complex data. Jeremy Goecks, PhD, and Tommy Dang, PhD, discuss some of these new tools, and comment on upcoming trends in web-based visualization, visual analytics, and custom visualization that could become game-changers in the near future.
Q: What are some of the challenges in data visualization today?
A: Two key challenges that my colleagues and I encounter regularly are heterogeneity and scalability. There are so many different types of biomedical data today, from molecular data such as genomics and proteomics to imaging at different scales to clinical data from electronic health records. Each data type can be visualized in many different ways. Trying to develop visualization tools, and especially dashboards with multiple visualizations for working with collections of different data, is challenging. Due to technological advances, the size of biomedical datasets is growing substantially. Even after significant processing, today’s genomic and imaging datasets are very large and can include data from thousands of samples. Visualizing increasingly large datasets requires new approaches for viewing and making sense of so many data points, and more powerful supporting computing infrastructure is often necessary as well.
Q: What improvements are being made in terms of new tools and applications for data visualization?
A: Visualization is increasingly being integrated with analysis methods for visual analytics. In visual analytics, visualization and analyses are done side by side in the same interface/dashboard so the results can be visualized immediately. This tight connection between visualization and analysis tools offers powerful and immediate data exploration and understanding. Web-based analysis notebooks such as Jupyter (http://jupyter.org/) are a great example of how interactive visual analysis can be realized as a game-changing platform. On the technical side, tool kits for creating high-performance, webbased visualizations are becoming quite mature. I especially like Bokeh (http://bokeh.pydata.org), where Python can be used to create simple, publication- quality visualizations for the web or complex dashboards with interactive visualizations that dynamically query large databases. With Bokeh, I can write all my code in Python but run it on the web without having to write any JavaScript.
Q: How is data visualization going to impact big data analytics?
A: Data visualization will play a key role in quality assessment and understanding as larger analyses are done. It can help in determining whether an analysis produced high-quality results. For example, visualization can help in finding spurious correlations that an analysis may have picked up, but it does not provide real insight into relationships in a dataset. Another example where visualization can help is in selecting a threshold to produce the best trade-off between true and false-positive rates. Visualizing the trade-offs at a set of thresholds can simplify this decision. Data visualization will also help in understanding how statistical and machine-learning models are making decisions. To understand these models, data visualization can show what parts of an image were most important for classifying the image or highlight customer attributes used to decide what offers to make to the customer.
Q: Can you describe some of the work that you are doing in this field and how you see that changing in the near future?
A: I am a principal investigator for Galaxy (https://galaxyproject.org/), an open source, web-based platform for analyzing large biomedical datasets that is used daily by thousands of scientists across the world. Scientists can use Galaxy from a web browser to upload their genomic, proteomic, and other biomedical datasets, and run complex analysis tools and even entire workflows. Over the past several years, we have extended Galaxy so that visualizations can be added to Galaxy and scientists can use them in their web browsers just as they use other tools. Galaxy users now have access to more than 50 visualizations. These visualizations include basic numerical visualizations such as histograms and scatterplots, as well as biomedically focused visualizations including genome browsers, Circos plots, phylogenetic trees, and metagenomic viewers. Looking forward, we are focusing on two exciting data visualization advances in Galaxy. First, we are bringing Jupyter and RStudio (https://www.rstudio. com/) to Galaxy so that scientists can create their own visualizations using these platforms. Second, we are developing Galaxy visualization dashboards so that users can combine visualization and analysis together for custom visualization and visual analytics.
Q: Any advice or recommendations in terms of evaluating or investing in data visualization tools? Are there any useful resources you can recommend for additional information?
A: First, I am a big believer in open source software because it is available to anyone and can be extended/improved by a community of developers. Investing in technologies and tools, in data visualization, or in other domains is about selecting the tool that not only performs well now but is also likely to be useful in one to three years. With this in mind, I choose technologies/tools that are packaged well and easy to deploy, such as through npm for JavaScript (https://www.npmjs.com/) and Conda (http://conda.pydata.org) for Python. I also choose technologies that are being actively worked on, as evidenced by a source code repository (e.g., on Github) that shows ongoing work.
Surprisingly, I don’t have a set of go-to resources for learning about data visualization trends and advances. I think this may be because so much of visualization is tied to the particular computing platform being used. That said, some of the very best cutting-edge but also practical visualization work comes from the University of Washington Interactive Data Lab (https://idl.cs.washington.edu/), which develops the web-based visualization tool kits Vega-lite (vega. github.io/vega-lite/) and Vega (vega.github.io/vega). Another great academic group is the Visualization Design Lab at the University of Utah (http://vdl.sci.utah.edu/team/). I prefer Bokeh when working in Python and ggplot2 (http://docs.ggplot2.org/current/) when working in R. The best commercial visualization tool that I use is Tableau (https://www.tableau.com/).
Jeremy Goecks, PhD, is an assistant professor in the Department of Biomedical Engineering and the Computational Biology Program at Oregon Health and Science University. His research centers on developing computational tools and infrastructure for analyzing large biomedical datasets. Dr. Goecks is a lead investigator for Galaxy, a web-based biomedical data analysis platform. He also leads data management and integration efforts for several cancer informatics initiatives at the university. Dr. Goecks received his PhD in computer science from Georgia Tech and did postdoctoral research in genomics at Emory University.
|