The term “Cloud” refers to a software development and delivery methodology that consists of decomposing applications into multiple services (a.k.a. “micro-services”) such that each service can be made resilient and scaled horizontally, by running multiple instances of each service. “Cloud” also implies a set of methodologies for the application delivery (how to host the application) and application management (how to ensure that each component is behaving appropriatly). The name, “Cloud”, seems to only capture the delivery piece of the equation, which is the proverbial tip of the iceberg.
An example that would help us break down the jargon into something a bit more concrete: a simple web application that lets a user add entries to a database table (e.g. “customers”) and perform some simple queries over this table. This is what is known as a CRUD application, from the initials of Create, Read, Update, Delete.
The “classic” version of this application would be a VisualBasic/Access (in the pre-Web days), .NET/SQLServer or Ruby On Rails/MySQL application. The software component is responsible to generate a set of forms/web pages for the user to input its data, execute some validation and access the database. In most application development frameworks (e.g. RoR), this example can be made to work in hours or a few days.
One minor issue with our example above is that, typically, not every user in an organization has the same level of access. Even when that is the case there is often the need to audit who does what. Thus our application needs to have also a “user” table and some sort of access control. Not a big deal: a couple of more forms and a new database table of users is created.
Until someone else creates another CRUD application (e.g. to manage inventory) that also needs users and access control rules. Clearly components that are common to multiple applications should be shared. Lets assume that our developers built a simple web-API that can use an LDAP backend for authentication and manages a set of access control list rules. Both our CRUD applications can use this authentication service to go from username/password to a cookie and then query the authorization information from cookie to access permissions with then application.
By now we have a reasonable description of what a simple “classic” application looks like from a development standpoint. In our “example, each of our CRUD applications and the authentication service consist of a single VM built manually by the development team. These VMs are then handed off to the system administration group which configures monitoring and network access.
The above roughly describes the state of the art in many enterprises; except that the number and complexity of the applications is significantly larger. And that “customer” and “inventory” applications are actually not typically developed in house; these are often components of CRM software suites built by third parties. They only serve in our story as examples.
The key issues with our “classic” application are:
Of these three, scale, is often not the major concern unless this application is being delivered to a large audience. That is the case in both consumer and SaaS markets but less so in enterprise. We can think of scale in terms of the number of concurrent sessions that the application needs to serve. In many cases this number is low and does not warrant a significant investment.
Reliability comes from two different vectors: the correctness of the software itself (which we can think of a function of the test coverage); and the availability of the infrastructure.
The “classical” approach to infrastructure availability has been to try to make the infrastructure as resilient to failure as possible. A significant factor behind this approach is the handoff point between software and infrastructure management. If those responsible for running the application (infrastructure teams) are not aware of the design or availability requirements of the application they can only assume worst case scenario.
For the infrastructure to completely mask power, network, server and disk failures without understanding the application semantics is so prohibitively expensive as to be considered practically impossible. Still, infrastructure teams are typically measured in terms of uptime. They attempt to mask single disk failures and single server failures with virtual machine restart, which does have impact to the application. Network card or switch failures can be masked with server link redundancy also. It is common to have a goal of 99.999% availability.
That begs the question of what happens 0.001% of the time, which corresponds statistically to roughly 8 hours per year. The problem with statistical averages is that as the number of application servers increase so do the failures. Assuming 1000 application servers and a perfect distribution of failure, one can assume that there is at least 1 failure occurring at any particular point in time, despite the significant resource and performance cost of infrastructure based availability.
It also turns out that masking failures also ends up making the impact of a failure worse from a software reliability perspective. Events that happen less frequently may not be tested; which then may lead to catastrophic failures such as an application loosing transactions or leaving data in invalid state.
The “cloud” approach to availability is to expose a (micro)service directly to infrastructure failures but hide them at the micro(service) level; this means that the authentication service in our example above would be responsible to serve its APIs successfully independent of data-center, power, server or disk failures. It goes further: it stipulates that one should purposely trigger failure on the production infrastructure in order to verify that the services are still operational.
Google, for instance, simulates large scale disasters yearly in order to ensure that its services are still operational in the event of a major disaster such as a earthquake or other large natural disaster that could affect its infrastructure. Netflix created a software tool called “chaos monkey” whose job is to randomly kill production servers as well as produce other types of havoc.
This is not as crazy as it seems: users care about the total availability of the system of which software reliability is the most important component. Application software is more complex in terms of functionality than the infrastructure it runs upon and thus more prone to failure.
The financial crisis of 2008 highlighted the “black swan” effect. The consequences of events with very low probability but with catastrophic effects, which tend to disappear in statistical risk models such as 99.999% availability. The “cloud” philosophy is to increase the probability of failure in order to avoid “black swans”.
One reasonable criticism of this approach is that it creates more complex software architectures with additional failure modes.
Perhaps instead of discussing “cloud” one should focus on modern software engineering practices. These have the goal of taking software from a an ad-hoc artisan mindset and transforming it into a first class engineering discipline. Modern software engineering has the dual goals of maximizing agility and reliability; its corner stone is testing. And testing requires a very different hand-off and service model between developers and infrastructure.
Modern software engineering practices typically imply release cycles within a range from 2 weeks to 3 months. The intend is to release to production incremental sets of changes that are tested and from which one can gather real world feedback.
Software is expected to be tested in:
- unit test
- integration test
- system test
- Q/A and staging
- A/B testing
While unit and integration test happen in developer workstations (or a cloud application that pre-verifies all proposed commits); system test, staging, A/B testing and troubleshooting require the ability to create production like application environments that are the exact mimic of the production configuration. Testing against triggered infrastructure failures is typically a requirement of both system and Q/A testing.
A software release cycle implies a carousel of application execution environments that execute simultaneously. If release X is the stable release in production, there may be release X+1 soaking in production for A/B testing; potentially multiple environments running release X+2 for a Q/A environment with simulated traffic and system test environments on the pre-released version X+3. Development may need to go back and create a system test environment for an arbitrary version previous to X in order to do troubleshooting.
This development methodology requires that all interactions with the infrastructure are based on version controlled deployment templates (e.g. CloudFormations, Heat, etc) that exercise an API. Trouble tickets or GUIs are not desirable ways to interact with the infrastructure because they do not provide a repeatable and version controlled method to describe the resources that are in play.
In summary, cloud is the result of the prescribed approach of modern software engineering practices that attempt to improve reliability and agility for software. The main driver to adopt a cloud infrastructure is to serve a community of application developers and their requirements.