Putting Out the Fire – Addressing the Hot Spots in Scale-Out NAS By Doug Rainbolt On July 20, 2011
As news of the Las Conchas wildfires approaching the Los Alamos National Laboratory hit the Internet, many people, me included, pause to think of possible consequences. While there is probably nothing to worry about in this instance, the news made me think about wildfires, specifically “hot spots,” and the parallels to the dangers that exist in storage architectures. Left unchecked, hot spots can devastate a forest. Left unchecked, a hotspot in storage operations can bring an entire storage cluster to its knees. Last week I had the opportunity to meet with a life sciences storage guru in San Francisco. During the conversation the very words “hot spot” came up in the context of scale-out NAS architectures, specifically as they relate to storage ecosystems heavily tied to Isilon® products.
Isilon, now an integral part of EMC®, has done a remarkable job of gaining footprint in the world of life sciences, specifically in genomic research. There are three reasons for this. First, taken from direct input from a number of Isilon’s customers, the company has been quick to listen and respond. This was probably one of the advantages of being a small entrepreneurial company hungry for success. It will be interesting to see if EMC can maintain this image given that there are corporate-wide architectures to consider. Second, Isilon’s scale-out architecture consisting of clustered nodes interconnected by InfiniBand can create very large aggregate storage capacities served under a simple Global Namespace. In the world of genomic research, given the widespread use of advanced yet increasingly inexpensive laboratory instrumentation used in digitizing soft samples, the need for raw storage capacity has skyrocketed. And as we move further downstream in the research workflow, storage requirements will continue to grow, almost exponentially.
More incoming files captured from instrumentation can lead to a data explosion similar to the concerns about ‘big data’ that exist in almost all industries. Life science researchers prefer working with a Global Namespace, an easy means for cataloging and accessing all directories and files. In many research institutions, researchers have also become storage administrators. Based on the limitations of previously available hardware, the mantra has been to add capacity quickly and make such additions readily available to the Global Namespace.
The third reason for Isilon’s explosion in the life sciences area is the need to support massively scalable sequential I/Os, what some people think of as streaming. The files, especially in the early stages of the workflow where HPC is often applied, are very large. Getting these files off of the NAS cluster and onto the compute farm with maximum efficiency is critical.
Life sciences storage administrators’ top concerns include being able to grow capacity with minimal disruption while making use of a Global Namespace and being able to secure extremely high aggregate data transfers. Now here is the catch: They will sacrifice IOPS performance if it means losing one of the above. It’s a compromise that they are quick to make. If administrators don’t have enough capacity to address tomorrow’s research, things could get ugly. And this is where the hot spots come into the picture. There is without a doubt a mix of small directories and files within the life sciences workflow. In addition there is the issue of NFS metadata. Small files and NFS metadata are often very IOPS hungry, meaning that low latency associated with data access and retrieval is critical. An Isilon cluster is good at many things, but it is not known for generating high IOPS. When the system encounters these “IOPS hot spots” based on the requirements of these small directories and files, it can bring overall performance down. The respected life sciences storage guru I spoke with last week asked me if the Alacritech ANX 1500 NFS Throughput Acceleration Appliance could be inserted into the storage infrastructure and pointed at directories known to contain files that demand IOPS to eliminate these hot spots. The answer was an unequivocal “Yes.” The ANX 1500 can give IOPS back in a way that maintains the integrity of the Isilon environment, yet dramatically improves performance.
The typical storage vendors’ response to the quest for IOPS is to purchase and install more nodes to spread the compute and storage I/O load. This can be very expensive and inefficient. The vendor may also suggest the use of integrated SSDS as a storage tier or a cache for namespace metadata. But this will only get them so far. Storage architects make design decisions based upon focused optimizations. Base designs are optimized for capacity and high sequential I/O or low to moderate capacities and screaming IOPS performance. Isilon designers elected to do the former. In the past, it was clearly the right choice given the company’s success. Everything down to the file system design supports this. Putting SSDS in a cluster to assist in IOPS certainly should result in less latency than going to disk, but still relies upon the native file system and supported I/O hardware to get data off SSDs back through DRAM and out to the network. The SSD “tier” certainly helps, but it doesn’t solve the IOPS hot spot problem. The life sciences expert I talked to in San Francisco knew this and could immediately see the merits of the Alacritech ANX 1500. The ANX 1500 includes cache in the form of both DRAM and SSDs but more importantly, uses data acceleration to get the data on the wire. This contributed to the record breaking and market-leading low latency that Alacritech reported in its SPECsfs®2008 results.
What lesson can be learned by all of us trying to eliminate hot spots in big data environments? It’s critical to pick the storage architecture that is right for your specific workload. It’s clear that scale-out NAS has earned a position in the world of life sciences. But you should recognize that this choice does present some challenges as it relates to IOPS performance. These small hot spots can, if left unattended, bring a system to its knees because they can consume a disproportionate share of system resources, leaving less for other functions. You don’t need to compromise. By quickly and easily adding a performance layer product to support the scale-out NAS, the hot spots can be contained. The ANX 1500 both caches and accelerates this data ensuring that the users get the response they need and that the storage cluster doesn’t take a big hit. Using the ANX 1500 allows the scale-out NAS to continue operating at its peak performance at a much lower cost than simply piling on nodes.
Leave a Reply