Address the needs of your applications at each step in the life cycle.
Do you like Pringles™ potato chips? Do you use skin lotion? Have you noticed all the new planets outside our solar system that have been recently discovered? When you’ve flown, have you noticed the funny tails that are on the wing tips? Have you seen the new cancer-fighting drugs that are tailored to individuals and their tumors? All these things and many others are made possible by HPC (high performance computing). HPC is quickly becoming an important tool in many companies, research institutes, and universities.
There is no real definition for HPC; generally speaking, HPC is considered to be any form of computation that is faster than your laptop or desktop. But even this traditional definition is changing. Instead of running things faster than your desktop or laptop, HPC now encompasses the need to run the same application, but with 25,000 copies at the same time (kind of difficult to do on your dual-core laptop). Some of the largest systems in the world now have over 1,000,000 cores, such as the Sequoia System at Lawrence Livermore National Labs.
Equally important is the recent trend of using nontraditional computing systems as GPUs (graphical processing units), DSPs (digital signal processors), and specialized processors with over 50 small CPU cores for running applications. Generically, these specialized processors are called “accelerators” because they can accelerate traditional processing. Accelerators offer the potential for very large increases in computational performance for applications that can take advantage of the architecture. You can get performance improvements that are between two and 12 times better than what you can run on a single server today.
The intersection of many CPU cores and accelerators means that we need faster storage to keep up with the larger number of computations, and we need a great deal more storage because we are generating results at a much faster rate. A very conservative assertion in the HPC world is that the size of our models or problems doubles every year and the amount of storage we need also doubles every year. This means that in three years you will need eight times the storage you have today and your models or problems will be eight times larger. Unfortunately, you don’t see hard drives doubling in capacity every year and you don’t see prices coming down that rapidly either. Furthermore, hard drive speeds are not improving at a pace to keep up with the size of HPC systems and SSDs (solid state drives) are cost prohibitive at the capacities needed. As you can see, HPC storage can be a complex problem with very easy solutions. In this article, I want to put forth a high-level examination of architecting an HPC storage solution for HPC focusing on two aspects: (1) the applications and (2) the data life cycle.
There are many HPC storage options, but the “best” one really depends on your application (you hear that quite a bit in HPC). It all depends on how your application does I/O (input/ output). Does it do lots of small reads and writes? Does it write or read a great deal of data? Is the data access pattern primarily sequential or is it random or is it a combination of the two types? How does the application access data? Knowing these characteristics about your application can help you determine which storage solution is the best in terms of performance, price, price/performance, power, etc. However, measuring these characteristics is not easy and is more of an art than a science. Moreover, you may have hundreds of applications running on your HPC system, which makes the problem even more difficult. If you can’t measure these characteristics, what do you do? My recommendation is to focus on what I call the data life cycle.
Figure 1 is a basic diagram illustrating the three main aspects of the data life cycle.
The left side of the circle is where the input data for any application is created (called preprocessing). This can involve creating the input data set or getting the lab data ready for processing. This step usually does not require a great deal of performance or a great deal of capacity. However, it does need to be costeffective and reliable.
Once the preprocessing is done, the next step in the data life cycle is for the application to use the input data to create some output data (i.e., run the analysis). This step in the life cycle is labeled “Application work space.” Some applications will require a fair amount of IO performance, some will not, and you might be running tens of thousands of copies of an application, which generates a great deal of IO. The focus of this step is to allow the application(s) to run as fast as possible so that results are obtained as quickly as possible. Consequently, this type of storage should be focused on being fast and relatively inexpensive. But reliability isn’t much of an issue because the focus is on running the application as fast as possible.
Once the application is done, postprocessing of the data may be needed to interpret the results and perhaps run a different data set. In this step, the data moves back to the “post-processing” step on the left in Figure 1. The storage needs for postprocessing are fairly similar to the preprocessing step, so I combined the steps into a single label.
Finally, once the data has been used and is of no further immediate use, the data can be archived. This is labeled on the right-hand side of Figure 1 as “Archive.” Data archiving requirements are sort of the antithesis of the application work space. Archive storage requires massive capacity, very good reliability (usually achieved through multiple copies), virtually no performance, and low cost.
Looking at Figure 1 and keeping in mind the storage requirements for each step in the data life cycle, let me ask a simple question: Is there a single solution that can effectively address all three storage aspects? I think the answer is obviously no (if a vendor says they can do everything effectively in a single platform, quickly run from them and don’t look back). You will really need to have different solutions that address each of the three sets of requirements and use software tools to combine the solutions into an HPC storage solution. But the good news is that, depending on your application(s), you may be able to combine the requirements for several solutions and even combine them into a single solution, depending on the requirements.
The real key to success is combining these solutions into a single cohesive storage strategy. But don’t get lost in the details of each solution and don’t feel as though you have to put all three together from the beginning. Rather, focus on your applications first and understand their requirements. Based on this analysis, you can determine which of the three solutions with which to start and which ones can be either deferred or rolled into an existing solution. You can also develop milestones for when you might need one of the other solutions. For example, many people add an archive capability at a later stage once they have a large pool of relatively unused data.
The amount of data that can be generated by HPC today is simply astounding. Multiple petabytes are commonplace and that amount still doesn’t appear to be enough for some HPC centers. To keep up with this data deluge, you need HPC storage solutions that may have really great performance but are also very likely to need lots of capacity. It’s not possible to create a single storage solution to satisfy all the data needs during the data life cycle, so the best approach is to have solutions that address the needs of your application(s) at each step in the life cycle. To get there, here are my recommendations:
• Understand the IO needs of your major application(s).
• Using the IO requirements from your application(s), develop a modular storage plan and develop milestones or metrics that determine when you need to upgrade or add capability.
• When possible, start with simple storage solutions that are manageable and cost-effective. An example of this is to start with an NFS (network file system) or clustered NFS solutions. For many applications, NFS works well for all steps in the data life cycle.
• Add faster solutions if your applications require them. But be sure to measure the impact of faster solutions because they are usually more expensive.
• Don’t forget that data goes cold after a while. There are tools that can help you measure the last time data has been accessed so you can determine if it’s being used. Seriously consider an archive solution for storing cold data that has not been used in a while.
• Measure, measure, and measure the IO needs of your application, as this is the key to effective HPC storage solutions.
This is just a brief introduction to HPC storage. But I hope it has piqued your interest into how some of these recommendations can be accomplished and it impacts the research performed on the HPC system, how you can adjust your applications to better utilize HPC storage, and how the storage solution can be architected.
Dr. Jeff Layton, HPC enterprise technologist, Dell | Research Computing, can be reached at firstname.lastname@example.org or by phone at 678-427-5819.