Launching a Big Data Project

Start Small and Take Your Time

Keep in mind that a big data project is not the same as a business intelligence project. While the differences are actually more complex, Eric Brown’s¹ thumbnail description does illustrate the main difference in focus.

A good place to start is with an examination of what the term “big data” means. Unfortunately, this is somewhat like stepping off solid ground into quicksand. If you ask people in different organizations what they mean by big data, you are likely to get radically different answers from each of them. It turns out that there is no real consensus of what the term big data means. In reading through a variety of papers, it seems very much like Lewis Carroll’s book Through the Looking Glass, where Humpty Dumpty states, “When I use a word, it means just what I choose it to mean—neither more nor less.”

Business intelligence helps find answers to questions you know. Big data helps you find the questions you don’t know you want to ask.Part of the reason for this is that big data is such a large umbrella term that it encompasses projects with very different goals. For the purpose of this article, the following criteria apply:

Data is generally complex and unstructured.
Data is frequently dirty and must be cleaned.
Data is difficult to process with existing tools.

While exploring the possible benefits of big data, keep in mind that a big data solution is a technology while a data warehousing is an architecture. When talking with vendors, you may come across some who claim that a data warehouse is unnecessary if you have a big data solution. As the terms are referring to different types of things, the fact that you have one does not eliminate the need for the other, though there are a number of different ways that they can be used together.² The purpose of a data warehouse is generally to ensure that everyone in the company is using the same data.

There are many ways to launch a big data project. However, the majority of the papers that I’ve examined all start with the same piece of initial advice: start small. By starting with a smaller subset of your data, or even with the external test data sets available, you’ll allow your big data team to become familiar with the various tools available, reducing stress and minimizing the risk of errors.

When selecting people to staff and head the actual project, a good place to start looking is among your existing personnel. While the person heading this team should ideally be someone who is very computer literate and has an understanding of statistics, in most cases you are better off with someone who is a generalist, as opposed to a specialist in a single field. While some might find it surprising, the personality characteristic that is frequently cited as a requirement for driving a big data analysis project to a successful resolution is that of a philosopher. As Darin Bartik³ argues, it is the application of the Socratic Method, the answering of a question with a question to push for hidden answers, that is imperative to the success of a big data program.

Depending on the size of your organization, you might have an internal Information Technology (IT) group. If you do, their expertise can be invaluable. On the other hand, I’ve observed a tendency with some IT groups to attempt to hijack various internal informatics projects because they were, well, computer related. While these groups may well have superior expertise in networking, database creation, or server support, they generally have no expertise in critical fields such as chemistry or pharmaceuticals or any clinical experience. This domain expertise is critical to being able to tease useful information out of your potential mountains of raw data. In the best of cases, all of the groups come together with the goal of making the project successful, with no attempts to extend an existing fiefdom. When the latter occurs, it may well fall to the lab manager to quell any departmental infighting, which can be a major challenge. Good communication lubricates a lot of things, but for this reason alone it is important that the designated team leader keep good communication going with the laboratory manager, not with the expectation of them resolving all the problems that inevitably occur, but so that the lab manager can recognize and act on any higher-level issues that the team leader might not have been made aware of.

There are a variety of ways that a big data project might be implemented. The three most common approaches are:

Contract to outsource the project completely.
Hire a consultant to work with internal staff.
Handle the entire project internally.

While turning the project over to an external company can be less disruptive to normal laboratory operations, it is fraught with hidden risk. The biggest risk is the same as allowing the company’s IT team to take over the project: the odds that the external company’s staff has sufficient domain expertise to successfully run the project to completion are small.

Hiring a consultant to assist in designing and installing the system definitely has its benefits, particularly if your staff resources are relatively small. This process does include the potential risk of your becoming dependent on the consultant and having their expertise leave with them. For this reason, you should maximize the value of the consultant by having the project staff work closely with them to learn as much about the process as they can.

Handling the whole project internally is definitely possible, but includes a number of caveats. Most important, the people selected for this team, particularly their leader, must understand that this implementation is their primary task, and this must be enforced. In other words, they should not be expected to also perform their normal assignments. It is critical that they be allowed to focus on this project and are protected from other people trying to divert them to other projects, even part time. Expecting them to handle both is unrealistic and generally leads to burnt-out personnel and failed projects.

Once the team has been selected, there is a natural tendency to want to figure out what tools they will be using and get familiar with them, but this would be a mistake. Picking your informatics solution first is somewhat like The Law of the Instrument, where if the only tool you have is a hammer, everything looks like a nail. Instead of starting by identifying the tools, the better approach is to identify the business problem you are attempting to solve or the business opportunity you are attempting to address. This is obviously a challenge, as one of the goals of a big data project is to identify previously unknown relationships.

If it doesn’t already exist, now would be a good time to set up an information/data governance policy to manage your big data. Data governance can be defined as “…the business-driven policy-making and oversight of data. As defined, data governance applies to each of the six preceding stages of big data delivery. [Collect; Process; Manage; Measure; Consume; and Store.] By establishing processes and guiding principles, it sanctions behaviors around data. And big data needs to be governed according to its intended consumption, otherwise the risk is disaffection of constituents, not to mention overinvestment.”⁴

This plan describes how the data is collected, processed, managed, consumed, and stored. Among other things, this document defines who has access to the data. While the company might have intellectual property concerns over the data, there are a variety of ethical5 and regulatory concerns about it as well, with one of the major ones being privacy. More than 80 countries currently have data privacy laws in place. In the U.S., one must be concerned not only with federal regulations, such as Sarbanes-Oxley and the Health Insurance Portability and Accountability Act (HIPAA), but individual state laws as well. Forrester Consulting advocates creating multiple governance zones, instead of “a single set of standards, policies, and practices,” which their research indicates “stifles the value that can be achieved from big data investment and insights.”

In conjunction with this data governance policy, you should also have your team review the state of the current data to maximize its usefulness. I wish it went without saying, but if you have been deleting/purging data, stop! Almost as important, your data only has value if you know what it is. As big data is normally unstructured, or at best semi-structured, the only way to do this reliably is to make sure that you include meta data regarding your data (in other words, data about the data). This is even more important if you are pulling data in from a variety of sources, such as satellite labs.

Eventually, tools will need to be selected and the team trained in their use. Fortunately, this is somewhat easier than it sounds, as there are multiple online courses on data analytics and the various big data tools available. Many of these courses are free or inexpensive, though some can cost thousands of dollars. While it focuses on using Apache’s Hadoop in a Linux/Unix environment, Russell Jurney’s book Agile Data Science⁶ provides a good tutorial regarding the approaches to handling big data and setting up the required software environment.

Nor does the decision regarding which tools to use need be made in the dark. Many vendors have trial versions of their wares available for evaluation, so that your team can see which ones best meet their needs. In some cases this may be a pre-integrated and configured virtual machine that you can just install and run. In other cases, they may provide online instructions regarding how to download, configure, and run the applications. Most of the cloud providers also render free online access to their systems for evaluation. In almost all cases, there are associated webcasts available to assist in evaluating and using the packages. Many of these software “stacks” are based on Hadoop, but other options are increasingly available.

Despite what some vendors might indicate, your team will not be up and analyzing data within a half hour. Unfortunately, the learning curve for big data tools, while flattening, is still fairly steep. It will take significant time to just evaluate both the tools and your data to determine the best match. As Lab Manager has previously indicated, you should allow approximately 18 months before declaring a big data project to be a success or a failure.

References

1. Brown, Eric D., “What’s the difference between Business Intelligence and Big Data?,” 2014; published online June 5. http://ericbrown.com/whats-difference-business-intelligence-big-data.htm (accessed Jan. 31, 2015).

2. Torr, Mark, “Three ways to use a Hadoop data platform without throwing out your data warehouse.” SAS Inst. 2014; published online Oct. http://blogs.sas.com/content/sascom/2014/10/13/adopting-hadoop-as-a-data-platform (accessed Jan. 21, 2015).

3. Bartik, Darin, “How Data Analytics and the Socratic Method Can Help Take Your Business to the Next Level.” Dell. 2014; published online July 17. http://en.community.dell.com/dell-blogs/direct-2dell/b/direct2dell/archive/2014/07/17/how-data-analytics-andthe-socratic-method-can-help-take-your-business-to-the-next-level (accessed Jan. 28, 2015).

4. The Intersection of Big Data, Data Governance and MDM | SAS. SAS Inst. http://www.sas.com/reg/gen/corp/2244936 (accessed Jan 27, 2015).

5. A Unified Ethical Frame for Big Data Analysis. http://www.privacyconference2014.org/media/17388/Plenary5-Martin-Abrams-Ethics-Fundamental-Rights-and-BigData.pdf (accessed Jan. 21, 2015).

6. Jurney, Russell, “Agile Data Science: Building Data Analytics Applications with Hadoop,” O’Reilly Media, 2013. http://shop.oreilly.com/product/0636920025054.do