Q: What is big data?
A: From a computer science perspective, it’s more accurate to call it “unstructured data.” Big data is not the pristine data usually found in tables. Unstructured data is the messy, dirty, incomplete data that is collected from a lot of different sources. So the idea is that, rather than throwing away this messy data that does not fit into any relational database, it will be useful to keep it and process the data in nontraditional ways. Big data can often be large databases with terabytes or petabytes of information, but it can also refer to small amounts of information that is hard to process with traditional software tools and relational databases. Big data is often complex, but that has to do with the incompleteness and inconsistency in the data rather than its size.
Q: Why should lab managers care about big data? Does big data impact all labs?
A: Presumably all labs get information from various data sources that they then have to store. So the question is, are they then able to analyze, compare, and correlate those datasets and results, possibly over time, to gain useful insights? The fundamental argument underlying big data is that, if only we have the right tools to process and analyze all the data that we have, then we can get “gold” (from the data) to drive future discoveries.
Q: What can we truly expect to gain if we make all the right investments in big data?
A: It depends on the type of lab. If it’s a research lab and you are able to pore through all the data, then big data promises to uncover some behaviors and patterns that would be the indicator of some new phenomenon. If it’s a production-oriented lab, then the potential benefit of big data would be to find ways to improve the lab processes by monitoring data from several different processes and machines.
Q: How have you exploited big data in your lab?
A: The work that we do involves processing genetic information for looking at cancer. As computer scientists we are building tools, particularly open source software, that people can use to process the data more accurately and faster. If people suspect that they need to process their unstructured data in new ways to make discoveries, then open source software like MapReduce from Hadoop can help those with no extensive computer science background pore through their data to make interesting insights. The MapReduce framework handles some tasks well, but not others. So to complement it we have built another open source software here at UC Berkeley, called Spark. MapReduce is good for going through all the data in one pass, while Spark is better at iterative processes where you tend to go through the data several times to process it. That’s a good way to separate the two programming frameworks.
Q: Is the data processing done in real time, or does the data have to be stored and available in certain formats?
A: There are two approaches here. One is based on processing stored data, while the other is a streaming version. Spark has a standard version for processing stored data, and there is another one for streaming data. But this industry has phenomenal amounts of storage capacity, and it’s getting cheaper to store the data. So the hope is that people will store their data and compare it over time, to gain the insights I was alluding to earlier.
Q: Do the users need to be technically proficient, or will some amount of training suffice?
A: It would depend on what they are doing and using now to process their data. The original MapReduce was created by Google to be used by new employees to pore through tons of data, so it was designed to be easy to use. These days there are free online courses available that include tutorials on how to use programs like MapReduce. I would encourage lab managers to review those courses to see if the lab personnel can handle the data processing or if they need to bring in experts. There are also many companies that are now offering commercial support for MapReduce, Spark, and other software.
Q: What changes need to be implemented in the lab for the correct use of big data?
A: The first thing is to be able to figure out how to correctly and affordably store the data obtained from the various instruments and sources. Then you need to ask how likely you are to go back and do a time series analysis of the data to find new insights. There is no point storing the data if you are not going to be processing it. Next, you need to find out if there are people in your lab who would enjoy writing some programs in these easier programming frameworks to answer some of these questions. The third step would be to try machine-learning algorithms that are intended to go through lots of data and find interesting insights. For instance, it can classify data into different buckets based on its characteristics. So in this way you are increasing the level of sophistication in your data and gaining valuable insights, which will start a positive feedback loop to get you to collect, store, and process more data.
Q: What are some of the concerns associated with big data?
A: Let’s talk about data security first. It’s one thing to put the data on the Internet as opposed to collecting it in your lab. If you put your data sources out on the Internet, people are likely to discover it, and hence it’s usually a good idea to have it all encrypted. You also want to limit the number of people who have access to it. In terms of putting patient information or other types of private data on the Internet, there are strong legal and ethical rules put in place by organizations like HIPAA [the Health Insurance Portability and Accountability Act] and others. Another thing to consider would be storing the data in the cloud rather than on servers in your organization. Companies like Amazon, Google, and Microsoft offer cloud-based computing services, and as a part of that offering they have in place best practices for keeping the data secure. People should always have a back-up plan to store their data in place in case there is an operator accident or a failure of the equipment.
Q: Any advice for lab managers based on your experiences and expertise?
A: Talking to other lab managers and people who are at the cutting edge of the technology to find out what’s worked for them would be good for everyone trying something new. In the IT industry there is a lot of enthusiasm around big data because there is the belief that if we mine the data, we can find gold. Many companies, like Walmart, are able to change the way that they are doing business based on the data that they are collecting. So it seems like common sense to be collecting the data that you are generating and mining it to improve your position in the marketplace.
Q: When and how do you come to the realization that big data is not working for you?
A: It’s a learning curve. It’s not something that you can do in three months and declare that it was a failure. You would need to give yourself at least 18 months to get people trained and get the data collected and processed. There are courses and resources out there to help you learn about the tools and technologies. There are also consultants who can tell you what you need to do. You can start small and relatively cheap. Get a storage system that is relatively cheap or use the cloud, and then look into processing the data. Another idea is to reach out to some students at universities who have the necessary skill sets in computer science or statistics and hire them for the summer to work in your lab, which is a lowrisk, low-cost option to get this done.
Hadoop MapReduce is a software framework for easily writing applications that process vast amounts of data (multi-terabyte datasets) in parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. (Source: Hadoop) More on MapReduce at http://hadoop.apache.org/docs/stable1/mapred_tutorial.html
Spark is an open source cluster computing system that aims to make data analytics fast—both fast to run and fast to write. To run programs faster, Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much quicker than with diskbased systems like Hadoop MapReduce. (Source: AMP Lab, UC, Berkeley) More on Spark at http://spark.incubator.apache.org/research.html
David Patterson joined the University of California at Berkeley in 1977 after receiving all his degrees from the University of California at Los Angeles. His most successful projects have been Reduced Instruction Set Computers (RISC), Redundant Arrays of Inexpensive Disks (RAID), and Network of Workstations (NOW). All three projects helped lead to multibillion-dollar industries. This research led to many papers and six books, with the best-known book being Computer Architecture: A Quantitative Approach, co-authored by John Hennessy, and the most recent book being Engineering Software as a Service, co-authored by Armando Fox. His current research is centered on cancer genomics for UC Berkeley’s AMP and ASPIRE Labs. In the past, he served as director of the Parallel Computing Lab (Par Lab), director of the Reliable And Distributed Systems Lab (RAD Lab), chair of UC Berkeley’s CS Division, chair of the Computing Research Association (CRA), and president of the Association for Computing Machinery (ACM). This work resulted in 35 honors, some shared with colleagues. His research awards include election to the National Academy of Engineering, the National Academy of Sciences, and the Silicon Valley Engineering Hall of Fame along with being named Fellow of ACM, the Computer History Museum, IEEE, and both AAAS organizations. His teaching honors include the ACM Karlstrom Outstanding Educator Award, the IEEE Mulligan Education Medal, and the UC Berkeley Distinguished Teaching Award. He has also received Distinguished Service Awards from ACM, CRA, and SIGARCH.