Kubernetes and OpenContrail

I’ve been working over the last couple of weeks in integrating OpenContrail as a networking implementation for Kubernetes and got to the point where i ‘ve a prototype working with a multi-tier application example.

Kubernetes provides 3 basic constructs used in deploying applications:

  • Pod
  • Replication Controller
  • Service

A Pod is a container environment that can execute one or more applications; each Pod executes on a host as one (typically) or more Docker processes sharing the same environment  (including networking). A Replication Controller (RC) is a collection of Pods with the same execution characteristics. RCs ensure that the specified number of replicas are executing for a given Pod template.

Services are collections of Pods that are consumable as a service. Through a single IP end point, typically load-balanced to multiple backends.

Kubernetes comes with several application deployment examples. For the purpose of prototyping, I decided to use the K8PetStore example. It creates a 4-tier example: load-generator, frontend, redis-master and redis-slave. Each of these tiers, except for the redis-master) can be deployed as multiple instances.

With OpenContrail, we decided to create a new daemon that listens to the kubernetes API using the kubernetes controller framework. This daemon creates virtual networks on demand, for each application tier and connects them together using the “Labels” present in the deployment template.

A plugin script running on the minion then connects the container veth-pair to the OpenContrail vrouter rather than the docker0 bridge.

The network manager daemon logic is doing the following:

  • For each collection (i.e. group of Pods managed by an RC) it creates a virtual-network. All these virtual networks are addressed out of the cloud private space (10.0.0.0/16 in my example).
  • Each Pod is assigned an unique address in the Private space (10.0.x.x) and by default can only communicate with other Pods in the same collection.
  • When a service is defined over a collection of Pods, that service implies the creation of a new virtual network in the services space (a.k.a Portal network in kubernetes).
  • Each pod in a service is assigned the floating-ip address corresponding to the PortalIP (i.e. the service VIP); thus traffic sent to the service will be equal cost load balanced across the multiple back ends.
  • In the k8petstore example, the collections use the kubernetes labels “name” and “uses” to specify what tiers communicate with each other; the network manager automatically creates network access control policies that allow the respective Pods to communicate. The network policies are being provisioned such that when a collection X has a deployment annotation that it “uses” collection Y, then X is allowed to communicate with Y’s virtual IP address.

The current prototype is very interesting in terms of highlighting how a tool like kubernetes makes application deployment easy, fast and reproducible; and how network micro-segmentation can fit in in a way that is transparent to the application owner and provides isolation and access control.

The OpenContrail kubernetes network-manager, can automate the deployment of the network since it is exposed to the collection (RC) and service definition. While advanced users may want to customize the settings; the defaults can be more useful and powerful when compared with an API such as AWS VPC and/or OpenStack Neutron.

One important difference from a usage perspective vs the tradition OpenStack + OpenContrail deployments, is that in the kubernetes prototype, the system is simply allocating private IP addresses that are unique within the cloud while isolating the pods of the same collection. For instance, in our example, if the frontend Pod has the address 10.0.0.2 and redis-master the private address 10.0.0.3 and the VIP (aka Portal IP) of 10.254.42.1 the topology is setup such that:

  • The frontend network contains 10.0.0.2 (but is enable to forward traffic to 10.0.0.3);
  • The frontend network is connected to the redis-master service network (which contains the floating-ip address 10.254.42.1).
  • The redis-master network contains 10.0.0.3.

Traffic from the front-end to the service VIP is forwarded to the OpenContrail vrouter in the minion where the service is executing (with an ECMP decision if multiple replicas are running). The destination IP is then translated to the private IP of the redis-master instance. Return traffic flows through the redis-master service network which has routing tables entries for the frontend network.

With OpenContrail, the kubernetes cloud no longer uses the “kube-proxy”. The default kubernetes deployments uses a TCP proxy between the host and the container running in a private address on the docker0 bridge. This creates a need for the service/pod definition to allocate host ports and prevents 2 docker containers that want the same host port to execute in the same minion. OpenContrail removes this restriction; the host network is completely out of the picture of the virtual-networks uses by the pods and services.

I’m looking forward to rewriting the prototype of the network-manager daemon over the next few weeks and add additional functionality such as the ability to deploy source-nat for optional outbound access from the containers to the internet as well as LBaaS, for scenarios where fine grain control of the load-balancing decision is desirable.