Blog

RSS Icon

Distributed Systems Challenges Demand Different Skill-set

Monday, May 14, 2012 posted by Dave Wright

SolidFire's unique approach to scale-out all-SSD storage for cloud environments involves different engineering challenges than those confronted by traditional storage systems. Rather than focusing on ASICs, buses, and RAID firmware, SolidFire is solving difficult distributed systems problems dealing with scale, latency, reliability, and quality of service. The magnitude of this challenge requires us to continually add new talent with experience in this area. We've added more than a dozen great people to the team so far this year. One recent hire I'd like to highlight is our new Vice President of Engineering, Dan Berg.

In addition to his skills as a leader and manager for our engineering team, Dan has a long history of building the type of complex distributed systems that SolidFire delivers. After a 15 year career at Sun Microsystems, which concluded as VP of Systems Engineering and Distinguished Engineer, Dan served as CTO of Skype in Europe. At Skype Dan helped grow the engineering team and significantly broaden their product offerings while increasing platform scale and stability. Following his return to Colorado from Europe, Dan most recently ran R&D for Avaya in the US.

While a P2P VOIP platform like Skype may seem very different from a primary storage system, it represents exactly the type of scale-out, fault-tolerant distributed system at the core of true cloud architectures like SolidFire. Cloud computing is changing not just how IT is deployed, but fundamentally how the underlying infrastructure is built.

I'm pleased to welcome Dan Berg as well as all our other recent hires to the team. If you're excited about the work SolidFire is doing to advance the way the world is using the cloud, I'd encourage you to bookmark our Careers Page and check it regularly!

-Dave Wright, Founder & CEO

Big Players Make Big Plays

Thursday, May 10, 2012 posted by Dave Wright

EMC has made a big play with its announced acquisition of XtremIO for a reported $430 million. In acquiring the all-flash scale-out flash storage system vendor, EMC has made another aggressive bet in an emerging growth market.  When a growth opportunity justifies making a bet, EMC is best in class at getting it done. But to assume EMC spent $430 million to simply double down on its investment in flash is shortsighted.

This deal is not just about flash. This deal is about scale. I suspect EMC's early entry into the flash market was invaluable learning experience for understanding the opportunities and challenges posed by flash. Somewhere along the way they realized that building flash into an architecture is one thing, but building a true scale-out flash system is a whole different challenge. This is not a challenge to be solved with traditional storage controller technologies that were designed in the hard disk era.

Scale imposes an entirely different set of constraints on a system and its underlying media. Delivering consistent performance at scale, delivering efficiency and data reduction at scale, automating management at scale...each of these challenges on their own are hard enough. Solving them with a completely different media at the base of the design requires a rethink architecturally.

The timing is interesting here.  As it pertains to the flash market, acquisitions at this stage of the game are much earlier than the storage industry traditionally likes to place their chips. However, the urgency with which EMC chose to strike is indicative of the market demand for more than just bolt-on solutions backed by go-to-market heft.

If this deal was just about flash, EMC had a number of different options at their disposal, including staying the course with its evolving portfolio of flash solutions while the market matured. However, the transformative nature of flash necessitated a different approach. Realizing these challenges EMC made a rich bet, but one that will eventually seem small compared to the opportunities created by scale-out flash storage.

-Dave Wright, Founder & CEO

Our Thoughts Off The Recent Solid State Storage Symposium

Tuesday, May 01, 2012 posted by Jay Prassl

Last week our Founder and CEO Dave Wright attended Tech Field Day's Solid State Storage Symposium (SSSS) in San Jose. At the event he joined a number of other companies from across the flash storage ecosystem for a day full of lively discussions on the most optimal use cases, implementation types and future directions for flash technology.

Dave kicked off the day with a presentation on SolidFire's vision for the future of flash storage that really set the tone for the event.  My one line takeaway from his presentation was this: "Sure flash is fast, but what good is all that performance without control". In his talk he expands the argument to include efficiency and scale. The net of all this is that flash is a means to an end but without complementary innovations across quality of service, efficiency and automation the end market is never going to be as big as some industry analysts are predicting. 

At SolidFire we believe that our technology and approach to the market is fundamentally advancing the way the world uses the cloud. In his SSSS presentation I think you will find that Dave paints a clear and compelling picture for where flash is headed and what companies like SolidFire are doing to bring this vision to life. You can find the full presentation from the event on slideshare along with the video posted on Vimeo.

I would also encourage you to check out the panel sessions from the event as well. You will surely find some useful insights across a number of key trends that are shaping the future of solid state.

Hats off to Stephen Foskett and the fantastic moderators that he brought on board for the day.  The content and discussion on these panels are much richer than what you would find at a run of the mill tradeshow.


- Jay Prassl - VP of Marketing

Why OpenStack Matters

Monday, April 09, 2012 posted by Dave Cahill

OpenStack matters because choice matters. In order for markets, and innovation within these markets to thrive, consumers must have platform choices. Multiple platform options help to accommodate the varying requirements, skill-sets and risk profiles of different customers. In the cloud context, platform options help service providers right-size cost and quality of service to the unique needs of a subset of customers. Competition between multiple platforms forces all the players to be better (In this context, Citrix's recent release of CloudStack to the Apache Software Foundation might turn out to be one of best things to ever happen to OpenStack).

Despite the fragmentation that competition creates early on, market forces will whittle down the number of platforms choices over time. Technology history has taught us that platform markets can sustain only a few dominant players. Often times this includes a proprietary and open source alternative. The operating system wars that started 20+ years ago are the most frequently cited evidence of this dynamic. The fragmented and proprietary Unix variants eventually lost out to Linux and Windows as the open source and proprietary standards respectively. Server virtualization has seen a similar trajectory with VMware and Xen leading in a race that is still underway. Most recently iOS and Android have created a competitive and rapidly evolving mobile operating system market.

Fast forward to today and history is repeating itself in the cloud "operating system" market. VMware's proprietary stack has become the clear commercial leader. Meanwhile, there is an emerging group of open source platforms vying to become the "Linux" of the cloud data center. Only time will tell how this plays out, but OpenStack has as good a shot as any to become this defacto standard. With the stakes so clear the question isn't why invest in OpenStack, but rather why wouldn't you?

Despite the magnitude of the opportunity, let's not lose sight of the fact that it is still early days. July of this year marks only the two year anniversary of the OpenStack effort. In just six short months since the last release, OpenStack has made some big strides. Of course, challenges still persist, but there are more than 150 companies and 2500+ developers working on the problem.

Coinciding with the Essex code release last week, the OpenStack Conference & Design Summit will be held April 16-21 in California. At SolidFire, we have been working hard since the last summit and are proud of our achievements over this period. We will be very active participants throughout the week of the conference. If you are attending, make sure to stop by our booth or come see our panel, "OpenStack & Block Storage...Where to from here?" on Thursday at 1 p.m. PST. We will also be hosting a party with CloudScaling and RightScale on Monday night. Building off the Mirantis reception earlier in the evening, make sure to come hang out with three of the most innovative companies in the cloud ecosystem at 111 Minna Gallery in downtown San Francisco. Details and registration for the party are posted here.

-Dave Cahill, Director of Strategic Alliances

Bringing SSDs to the Cloud (at scale)

Wednesday, March 21, 2012 posted by Dave Wright

At the Cloud Connect Performance Summit back in February I presented the topic "Increasing Storage Performance in a Multi-Tenant Cloud". The way the schedule fell out I took the stage after Adrian Cockroft from NetFlix. Coincidentally, I borrowed a few quotes from Adrian's prior blogging on the subject to help bring to life the biggest roadblocks to achieving great storage performance in a multi-tenant cloud. In my discussion I called out three key problem areas: the capacity vs IOPS imbalance, handling multi-tenancy, and performance consistency. My discussion centered around the limitations of legacy solutions and how flash storage, if leveraged correctly, can help remedy current cloud performance woes.

Many thanks to Adrian, who continues to be a great straight man for the biggest challenges we are tackling here at SolidFire. In a recent Q&A with ZDNet UK's Jack Clark, Adrian shared some perspectives that we commonly hear from cloud service providers and their customers:

  • "The thing I've been publicly asking for has been better IO in the cloud. Obviously I want SSDs in there. We've been asking cloud vendors to do that for a while."
  • "The instances available from AWS have similar CPU, memory and network capacity to instances available for private datacentre use, but are currently much more limited for disk I/O."
  • "The hard thing to do in the cloud is to do high-performance IO [input-output], but that is starting to change as third-party vendors are figuring out ways of connecting high-performance IO externally, and we've worked around it with our [Cassandra] data store architecture."

Probably the most interesting answer was in response to a question around why it took Amazon so long to roll out an SSD-based offering (referring to DynamoDB). Cockcroft remarked:

"It's purely scale for them. For Amazon to do something they have to do it on a scale that's really mind-boggling. If you think about deploying an infrastructure service with a new type of hardware - if they got it wrong, they can't turn it back out and do it again differently. So they have to over-engineer what they do."

The key point here is that performance (through SSDs) was only part of the problem Amazon had to address. In fact, the bigger challenge for them to overcome was scale. Scale is what differentiates true clouds from small virtualized environments. Everything has to be designed to scale, which imposes a very different set of design considerations and constraints on an architecture. SSD or not, you can't escape this reality. At SolidFire scale is what we do best. There are many options for high-performance storage these days, but only SolidFire is designed for cloud scale. In doing so we are enabling service providers to focus on offering a differentiated portfolio of high performance cloud services and advancing the way we all use the cloud.

-Dave Wright, Founder & CEO

Sorting through the noise (and the bottlenecks)

Tuesday, February 28, 2012 posted by Dave Cahill

The current flash-based storage landscape is filled with many vendors proposing to address different niches of the market with their respective solutions. With flash as the common ground, some of the more easily identifiable differentiators are in areas like host interface, form factor, media support and data protection schemes. The design choices for these specifications are heavily influenced by each vendors' target workload and/or customer set. Of course, there are strengths and weaknesses to every approach. There are bottlenecks to be minimized or altogether avoided if possible. If all goes according to plan a vendor's target market will play to more of its strengths than weaknesses.

At SolidFire we have taken direct aim at solving the challenges encountered in delivering high performance storage for large-scale multi-tenant cloud environments. For this customer set the objective is not about delivering massive amounts of performance to single application at any cost. Instead, these providers are focused on cost effectively delivering consistent performance to thousands of applications at the same time. This use case has shaped many of our early design choices at SolidFire. We believe the most efficient way to achieve the right price/performance balance at scale is through a shared storage architecture.

In the case of shared storage, regardless of how fast the storage system can deliver I/O, there will always be the issue of network latency. Fusion-io has eliminated the network latency issue altogether with its server resident PCI-based designs. This design works well for DAS topologies serving massive IOPS to extremely performance hungry applications. However for the service provider use case referenced above, the price/performance and availability story of server-resident flash misses the mark.

So if network latency is unavoidable, what is the best approach? How do you optimize the storage stack to maximize IOPS and minimize latency to deliver consistent performance to thousands of applications? Sparing you a buzzword infused tongue twister that distills our approach into as few words as possible (think "Raid-less All-SSD Scale-Out Storage System"), we have instead outlined some of the key enabling features of our design in a more digestible format below;

  • An All-SSD system is the only way to confidently deliver predictable performance across a large number of tenants and applications in a large-scale cloud infrastructure. A tiered approach may suffice in a controlled setting with a few applications. However, the resource intensity and performance variability encountered in larger QoS-sensitive environments make tiering an unsustainable option.
  • Scale-out can mean lots of different things. For SolidFire this means no monolithic storage controllers. It also means a fully distributed design with IO and capacity load evenly balanced across every node in the cluster. At the media layer, data still has to traverse the SAS bus, but ten drives per node are working in tandem to deliver more than enough aggregate performance. Thinking through alternative design choices here, it is important not to lose sight of the fact that any latency encountered at this layer of the stack is an order of magnitude less than what is encountered at the network layer.
  • RAID-less means exactly what you think, no RAID. More than any controller bottlenecks, RAID is the biggest performance drag in the storage stack. By rethinking the date protection algorithm you cure a lot of what ails storage system performance today. At SolidFire we have done just that, implementing a replication-based redundancy algorithm where data is distributed throughout the cluster. The result is significant improvements in write performance and drastic acceleration of rebuilds from failure without performance impact.

Sure our storage system does a heck of a lot more than these three things. You can read all about the software innovations embedded in our Element OS on our site. But these three concepts we highlight above are critically important design choices that we made early on. They are foundational components of our architecture that make the rest of the story possible. They are also three fairly tangible concepts to help you differentiate one vendor from the next in the flash-based storage market. Good luck, it's noisy out there!

-Dave Cahill, Director of Strategic Alliances

Extending The Storage Disruption Cycle

Thursday, January 26, 2012 posted by Dave Cahill

"There comes a time when a storage company needs to define itself by what it does for customers and not by the machinery it uses to do so."

Chris Mellor, "How to tell if your biz will do a Kodak", The Register

The Register's Chris Mellor penned a great article the other day reflecting on the continuous cycles of innovation and disruption that have come to characterize the storage media industry. He uses Kodak to paint the picture of an incumbent getting capsized by a media transition. He goes on to cite other examples across tape and optical media where incumbents failed to manage the transition to the next generation media.

As the storage industry has transitioned through different media types there have always been opportunistic stopgap innovations that have bridged the gap from one generation to the next. Virtual Tape Library (VTL) technology is a great example of an innovation serving as a transitional bridge between the tape and disk eras. Once applications were written with the capability to natively interface with disk, deduplication and compression drove down solution costs quickly making it an effective bulk storage medium.  Once financially viable, the flood gates were opened and tape was relegated as a deep archive. Similarly, today we are seeing flash-based caching and tiering technologies forming a similar transitional bridge while the $/GB economics of flash fully converge with, and eventually eclipse, disk.

So with history as a guide for how this plays out, why will the disk to flash media transition be any different than the ones before it? Well, I suspect this cloud thing might have something to do with it.

In the enterprise IT sector, systems always seem to consume features over time. At its core, the cloud is a massive infrastructure system that when used properly is an extension of existing IT. However, cloud infrastructures will increasingly chip away at the incumbent IT footprint by rapidly incorporating new innovations into its architecture. These enabling innovations allow cloud providers to continually expand their portfolio of cloud services. Over time the IT use cases applicable to this medium naturally expand as applications and interfaces catch up, performance improves and the economic value proposition can no longer be ignored.

So what does this mean? From our perspective, the cloud adds a third leg to the innovation sequence we have witnessed in the past. New component level technologies will continue to enable new architectures. But where it gets interesting is when these new architectures drive the performance and economics to enable new cloud services.

In storage, the media innovations that Mellor refers to, and their related price/performance value proposition, are a powerful enabling force behind new storage architectures. Applied to traditional IT cost centers these architectures are interesting, when applied to profit-driven cloud services they are game changing. Amazon's recently announced DynamoDB service is an early instantiation of this extended innovation sequence where component level technologies (SSD), enable new architectures that drive new services. Fortunately for the end-customers, the economics of flash are only getting better from here. Now is it up to the storage industry to innovate on top of this medium, delivering next generation systems that can extend the reach of cloud hosted services to an even wider range of application workloads.

-Dave Cahill, Director of Strategic Alliances

Inefficiency & Unpredictability...A Service Providers Worst Enemy

Tuesday, January 24, 2012 posted by Dave Wright

In our first two posts on storage tiering we talked through the difference between capacity-centric vs. performance-centric approaches and also exposed some of the hidden costs of an automated tiering implementation. Closing out this mini-series I wanted to touch on a few other deficiencies inherent to an automated tiering solution.

Within a storage infrastructure it is IOPS, not capacity, that are the most expensive and limited resource. In a tiered architecture, SSDs are inserted into the equation to try and improve the balance between IOPS and capacity. However, while an SSD tier may reduce performance issues for well-placed data, the usage of this expensive tier remains inefficient. This inefficiency stems from a lack of granularity in the data movement of a tiered system. If a sub-LUN tiering system needs to move hot data chunks anywhere from 32MB to 1GB, it will likely promote a lot of cold data in the process. This overhead forces sub-optimal utilization of the premium SSD capacity.

Another potential problem area from tiering, specifically in a multi-tenant environment, is dealing with IO density - that is, how IO is distributed across a range of disk space. Applications whose IOs are concentrated within close proximity to each other (IO dense) will gain greater benefit from sub-LUN tiering than those whose IOs are spread more evenly over the entire logical block address space (IO sparse). Because tiering mechanisms measure data usage at the chunk level, an application who has more hits within a small number of chunks is more likely to be promoted than an application who spreads the same number of IOPS across more chunks. From an array performance perspective this approach is reasonable, as you get more performance within the same resource footprint. However, in a multi-tenant setting with data distributed across many distinct application this leads to serious problems with fairness and performance consistency across workloads.

We originally discussed the performance implications of tiering in July of last year. In a multi-tenant setting this performance variability exposure is magnified. Customers are continually exposed to the risk that the promotion of another customer's hot data will result in the demotion of their own. The order of magnitude difference in latencies and IOPS between the different tiers makes it practically impossible for a service provider to guarantee performance to an individual application (or tenant) under these conditions.

In recognition of the deficiencies of a tiered architecture, SolidFire sought a better way. Our Performance Virtualization technology decouples the tight binding between the storage performance and capacity, resulting in a far more precise allocation of IOPS and capacity on a volume by volume basis regardless of issues such as IO density. Instead of best guess efforts as to the size and tiers of media required to meet customer performance requirements, a service provider can now dial-in IOPS and capacity individually at the volume-level from cluster-wide independent pools of capacity and performance. These allocations can also be dynamically adjusted over time as application requirements change. All things considered, Performance Virtualization is a far more efficient way to address IOPS scarcity, without exposing customers to the inefficiency and unpredictable performance inherent in an automated tiering architecture.

-Dave Wright, Founder & CEO

Amazon launches DynamoDB...We like what we see!

Wednesday, January 18, 2012 posted by Dave Wright

Amazon launched a new service today: DynamoDB. It's a scaleable NoSQL database service that will run in the AWS cloud. It is akin to a hosted version of Cassandra or MongoDB with unlimited scalability. The most notable section of Werner Vogel's blog announcing the new service is worth repeating:

Cloud-based systems have invented solutions to ensure fairness and present their customers with uniform performance, so that no burst load from any customer should adversely impact others. This is a great approach and makes for many happy customers, but often does not give a single customer the ability to ask for higher throughput if they need it.

As satisfied as engineers can be with the simplicity of cloud-based solutions, they would love to specify the request throughput they need and let the system reconfigure itself to meet their requirements. Without this ability, engineers often have to carefully manage caching systems to ensure they can achieve low-latency and predictable performance as their workloads scale. This introduces complexity that takes away some of the simplicity of using cloud-based solutions.

The number of applications that need this type of performance predictability is increasing: online gaming, social graphs applications, online advertising, and real-time analytics to name a few. AWS customers are building increasingly sophisticated applications that could benefit from a database that can give them fast, predictable performance that exactly matches their needs.

Looking under the covers a bit further here there are two really interesting enabling components of the DynamoDB service that deserve highlighting:

  1. All-SSD- the service is deployed using 100% SSDs to provide consistent high performance at a very large scale. This is notable in that it is AWS' first use of SSDs in their cloud architecture.
  2. Guaranteed Throughput - The DynamoDB service includes a concept called "Provisioned Throughout". This is essentially a guaranteed QoS model, where a customer can purchase reserved capacity (measured in queries per second), rather than paying for the actual queries run. Applied to a storage service, this would be akin to paying based on guaranteed IOPS. Currently Amazon EBS's current pricing model is based on actual IO operations with no guaranteed throughput or latency.

Amazon DynamoDB is a strong endorsement of several of SolidFire's key principals. The first being that the cloud needs Solid-State Drives (SSD) to adequately support the evolving performance demands of multi-tenant storage. The second is the idea that as more of these performance-sensitive applications make their way to the cloud there is a clear requirement for guaranteed QoS controls that can dynamically support performance requirements at a much more granular level. Finally, and building off the first two, is the validation that when armed with the enabling architecture to confidently and economically deliver performance-based services, service providers can stand-up cloud service offerings based on committed performance.

Amazon is a great indicator on the pulse and direction of the industry. The broader implications here for running performance sensitive applications in a cloud environment are intriguing to think about. Here at SolidFire, the continued innovations around the enabling architectures required to make this a reality are what get us really excited.

-Dave Wright, Founder & CEO

The Diseconomies of Tiering

Tuesday, January 17, 2012 posted by Dave Wright

In the initial post of our series on tiering we covered the merits of a proactive performance-driven approach to tiering relative to the more traditional capacity-centric discussions. Today we take a closer look at some of the less obvious cost implications of "automated" tiering. On the surface, the promise of tiering looks like an clear win - SSD performance with spinning disk capacity and cost. However, the true economics of this type of solution are not nearly as compelling as some vendors would lead you to believe. Considered in the context of the unique burdens faced by cloud service providers and the proposed value proposition is even less appealing.

To start with, the "SSD performance" promise part of the catchy tagline above must be caveatted by the fact that this only proves to be the case if the data is actually residing in the SSD tier. Easier said than done. The ability to guarantee SSD performance in a tiered architecture requires a substantial SSD tier and/or extremely accurate data placement algorithms. Rightsizing the former skews the proposed economics of a tiered solution substantially, while the latter has been long on promise but short on delivery for at least three generations of marketing executives. Before the industry marketed this functionality as Automated Tiering it was known as Information Lifecycle Management (ILM) and a few years before that it was Hierarchical Storage Management (HSM). Regardless of what you call it, tiering has always been impaired by the inability to accurately predict and automate the movement of data between tiers. In the context of cloud environments the significant scale requirements and extremely low application-level visibility make solving this challenge even more difficult.

It's also important to consider the flash media requirements of a tiered solution. The write patterns in the flash layer of a tiered architecture require a higher grade flash solution to withstand the impact of write amplification and churn. Vendors are forced to use the most expensive SLC flash to ensure adequate media endurance. The cost impact even modest amounts of SLC flash destroy the economic advantage of a tiered architecture relative to an all-MLC design. In many examples we've seen that the "combined" $/GB of a storage solution that incorporates SLC-flash, 15k SAS and SATA is actually higher than an all-flash MLC solution with similar raw capacity. Importantly, this price advantage for MLC over tiered storage is achieved before factoring in the favorable impact of compression and deduplication for the all-flash solution, making the flash design even more compelling.

Tiering also hurts capacity utilization and controller performance. In order to ensure data is in the right place at the right time it is constantly being promoted and demoted between the flash and disk tiers. There needs to be a certain capacity buffer to accommodate this movement. There is also a controller processing cost to keep up with all this activity. Most legacy systems have limited CPU and controller memory relative to their overall capacity, making the overhead of tiered storage processing one more burden for them to manage. Even complex tiering requires only a fraction of the processing power and memory needed for in-line data reduction features like compression and dedupliction, which is why those features are seldom found on legacy primary storage controllers. A recent article from TechWorld references a Forrester Research report by Andrew Reichman (@ReichmanIT) that expands on the data management burden of a tiered storage topology.

The issues outlined above are just a few examples of the hidden costs embedded in an "automated" tiering solution. In some cases these deficiencies may be acceptable in smaller IT environments. However, in a large scale multi-tenant cloud infrastructure the capital and management costs of these shortcomings are magnified. The hyper-competitive nature of service provider business model necessitates a more efficient approach.

-Dave Wright, Founder & CEO