Recently, I’ve heard several people suggest that the advent of IPv6 changes the requirements for data-center virtual network solutions. For instance, making the claim that network overlays are no longer necessary. The assumption made is that once an instance has a globally unique IP address that all requirements are met.

In my view, this analysis fails in two dimensions:

  • In the assumption that it is desirable to give instances direct internet access (via a globally routed address);
  • In the assumption that overlay solutions are deployed to solve address translation related problems;

Neither of these assumptions hold when examined in detail.

While there are IaaS use cases of users that just want to be able to fire up a single virtual-machine and use it as a personal server, the most interesting use case for IaaS or PaaS platforms is to deploy applications.

These applications, serve content for a specific virtual IP address registered in the DNS and/or global load-balancers; that doesn’t mean that this virtual IP should be associated with any specific instance. There is layer of load-balancing that maps the virtual IP into the specific instance(s) service the content. Typically this is done with a load-balancer in proxy mode.

As an aside, enabling IPv6 in the load-balancer is typically the best approach to make an application IPv6-ready; it is also the best way to do so. While IPv6 doesn’t add functionality to IPv4 (other than more addresses) it does add an additional burden in terms of manageability and troubleshooting; dual-stack applications or dual-stack networks are twice as much operational load, with no benefit as compared to terminating the IPv6 session on the load-balancer.

Back to our application, this is typically implemented as a set of multiple instances; often with specialized functions: front-ends (web-page generating); business-logic and caches; databases. The reasonable default, from a data security perspective, is to disallow internet access to and from these instances. RFC1918 addresses come in as a real benefit since they prevent someone from accidentally routing traffic directly.

In several IaaS platforms that i work closely with, as soon as the platform is operational, there is an incident with VMs generating DDoS attacks from the cloud (usually benefiting from high bandwidth internet connectivity). My recommendation currently, is for applications to LBaaS from inbound traffic plus a jump host with a SOCKs proxy for outbound access. That still leaves the LB and jump host as threat vectors but it is easier to monitor and manage than having direct internet connectivity from the VMs with either floating-ip (bi-directional) or source-nat.

Thus, to me at least, globally routed addresses are hardly a feature. Lets now examine the rational for using overlays in data-center networks.

The first thing we need to notice is that overlays are replacing vlan based designs (based on IEEE 802.1D) that already support the main functionality of the overlay: the separation between identifier and the “locator”. The identifier being the IP address of the instance and the locator the ethernet address in the case of 802.1D or IP address in the case of the IP based overlay.

The reason this separation is needed is because transport sessions, configurations and policies are tied to IP addresses. Assigning an IP address based on the server that is currently executing an instance doesn’t work for a multitude of reasons:

  • Instances need to exist independently of their mapping to servers;
  • Many configuration based systems are tied to IP address rather than DNS;
  • Network based policies are tied to IP addresses;

Operators deploy applications by defining a set of relationships between instances; this is typically done in an orchestration tool such as CloudFormations or OpenStack HEAT. When instances are defined and typically before they are executed and scheduled, IP addresses are assigned; these IP addresses are then used in configuration files and access control lists.

When an instance is scheduled, it is assigned temporarily to a server. Even in scenarios where life-migration of the instance is not supported, the “logical” concept of the Nth cache server for application “foo” must be able to span its scheduler assignment. The server can die for instance; or the instance may have to be rescheduled because of interference with other workloads.

Of course, it is possible to construct a different mapping between identity and location. One that it is named based, for instance. One example would be for every service (in the SOA sense of the word) to publish its transient address/port in a directory; given the limitations of the default Docker networking implementation, some operators are doing just that. The drawback is that it forces one to tweak every single application in order to use this directory for any resolution and policy enforcement. This provides the equivalent functionality of what the network overlay is doing.

There is at least one very very large cloud provider that does just that. They have a directory that provides both for name resolution as well and authorization. The tradeoff is that all services must speak the same RPC protocol; and no externally developed application can be easily deployed inside the infrastructure. The RPC protocol as become what most of us think of the TCP/IP layer: the common interoperability layer.

Overall, It doesn’t seem to me that IPv6 changes the landscape; it just creates an additional manageability burden in the case of dual-stack deployment through the application. If one does terminate the IPv6 session at the load-balancer, however, it should be mostly a NOP.


2014 in review

Looking back at 2014, it feels like a lot of progress was achieved in the past year in both the cloud infrastructure and NFV infrastructure markets. Some of that progress is technical, some is in terms of increased understanding of the key business and technical aspects. This post is my attempt to capture some changes I’ve observed from my particular vantage point.

This December marks the second anniversary of the acquisition of Contrail Systems by Juniper Networks. In the last year the Contrail team managed to deploy the Contrail network virtualization solution in several marquee customers; to solidify the image of the OpenContrail project as a production-ready implementation of the AWS VPC functionality; but, probably, more importantly to help transform attitudes at Juniper (and in the industry) regarding NFV.

In the late 90s and early naughts, the carrier wireline business went through a significant change with the deployment of provider managed virtual networks (using BGP L3VPN). From a business perspective, this was essentially outsourcing the network connectivity for distributed enterprises. Instead of a mesh of frame relay circuits managed by the enterprise; carriers provide a managed service that includes the circuit but also the IP connectivity. This is a service that has proven to be fairly profitable for carriers and a key technology for networking vendors such as Juniper. The company’s best selling product (the MX) earned its stripes in this application.

Carrier wireline is going through a similar change in the next few years. This time outsourcing network based services that are still present at either the branch office or central offices: security (firewalls), VPN access, NAT, wireless controllers, etc… The business case is very similar to managed connectivity. I believe it became clear during 2014 that these virtualized services are going to run on OpenStack clusters and that the OpenStack network implementation is going to have a similar role that the L3VPN PE had in connectivity.

The timeline for this transformation also seems to be accelarating. Some carriers with a more competitive outlook for their wireline business are planning on trials with live traffic for Q1/Q2 2015; some of these projects are in a rather advanced stage.

Across all the major carriers, I see a demand for a production-ready OpenStack networking implementation now. The consensus seems to be that there are two alternatives available on the NFV market: OpenContrail and Alcatel’s Nuage. Not surprisingly, both of these solutions where built on the technologies that made connectivity outsourcing successful: BGP L3VPN and EVPN. Technical experience provides an advantage in delivering production capable solutions.

While OpenStack seems to be solidifying its position in both the public cloud space and in NFV solutions, 2014 is the year that containers (which docker greatly helped popularize) went mainstream for SaaS/enterprise developers. A significant percentage of operators considering docker/container solutions, that I spoke with recently, are still uncertain of what orchestration system they will use. Most are considering something other than OpenStack. I expect that this space will keep changing very rapidly in 2015.
For the OpenContrail project this implies the need to integrate with multiple additional orchestration systems.

The “enterprise switching” space has also advanced significantly in 2014 when it comes to perception. I often explain to network engineers that OpenContrail implements the functionality that was traditionally present in the aggregation switch in a traditional 3-tier design. This is where, traditionally, access control policies and network based services where applied to traffic transitioning administrative domain.

The need to transition from an aggregation switch to a solution such as Contrail comes from the fact that increased bandwidth requirements force network engineers to opt for a CLOS fabric design. As the fabric bandwidth increases it is important to simplify the role of the fabric switch node. These switches are becoming increasingly commoditized to the point where most switch vendors offer pretty much the same product, with the variation being the software. Often network engineers attempt to reduce the functionality running in that software to the minimum.

2015 is likely to bring additional movement towards switch commoditization. There is still space for “premium” switching solutions but my understanding is that most industry observers would expect this to fall into a 80-20 rule with 80% of the market preferring an OCP-like switch.

Against this backdrop, the Contrail product is finally starting to excite the Juniper sales-teams. Taken on a per-server basis, potential revenue of the Juniper Contrail solution is in-line with selling switch ports, when taken at a 3-year time interval. Software has better margins and Contrail is a differentiated product with one viable competitor in each of the markets it plays in. My expectation is that we will see in 2015 a much greater interest from the parent company in the Contrail business unit; <smirk> which will undoubtably be a mixed blessing </smirk>.

Simultaneously, I believe that in 2015 OpenContrail will become much less of a Juniper project and much more of a partnership of different vendors and cloud operators. From the perspective of Juniper’s commercial interests that is not a bad thing. It is much preferable to have a smaller share of a bigger pie than 100% of a small one.

OpenContrail seems to be at a juncture where its starting to attract significant interest from people that have become disillusioned with other approaches that lack its problem statement and execution focus. The challenge will be to retain the later properties while creating a “bigger tent” where others can meaningfully participate and achieve both their technical as well a business goals. It promises to be both challenging as well as a great learning opportunity.