Authenticated access to Kubernetes pods

When running a micro-services style application in a public cloud, one of the problems to solve is how to provide access to debug information. At Laserlike, we run our application stack on GKE. Most of the stack consists of golang Pods that run an HTTP listener that serves /debug and /metrics handlers.

For metrics scrapping we use prometheus; and grafana for visualization. Our grafana server is nodePort service behind a GCE Load Balancer which uses oauth2 based authentication for access. This still leaves a gap in terms of access to the pod debug information such as /debug/vars or /debug/pprof.

In order to address this gap, we created a simple HTTP proxy for kubernetes services and endpoints. We deploy this proxy behind a oauth2 authenticator which is then exposed via an external load balancer.

The service proxy uses the kubernetes client library in order to consume annotations on the service objects. For example, the following annotation, instructs the service proxy to expose the debug port of the endpoints of the specified service:

metadata:
  annotations:
    k8s-svc-proxy.local/endpoint-port: "8080"

The landing page on the proxy then displays a set of endpoints:

screen-shot-2016-10-12-at-6-06-37-pm

 

k8s + opencontrail on AWS

For anyone interested in running a testbed with Kubernetes and OpenContrail on AWS i managed to boil down the install steps to the minimum:

  • Use AWS IAM to create a user and download a file “credentials.csv”
  • Checkout the scripts via `git clone https://github.com/Juniper/container-networking-ansible.git`
  • Change to the “test/ec2-k8s” directory.
  • Setup environment variables with your EC2 IAM role. The script credentials.sh can be used for this purpose.
  • Follow the steps in the test script:
    • ansible-playbook -i localhost playbook.yml
    • Extract the deployer hostname from the file `cluster.status`; this is the inventory entry in the `[management]` group.
    • Login into the deployer hostname and execute:
      • ansible-playbook -i src/contrib/ansible/inventory src/contrib/ansible/resolution.yml
      • ansible-playbook -i src/contrib/ansible/inventory src/contrib/ansible/cluster.yml
      • ansible-playbook -i src/contrib/ansible/inventory src/contrib/ansible/validate.yml
      • ansible-playbook -i src/contrib/ansible/inventory src/contrib/ansible/examples.yml

This will:

  • Create 5 VMs in a VPC on AWS;
  • Run the ansible provisioning script that installs the cluster;
  • Run a minimal sanity check on the cluster;
  • Launch an example;
  • Fetch the status page of the example app in order to check whether it is running successfully.

 

Container networking: To overlay or not to overlay

One of the key decisions in designing a compute infrastructure is how to handle networking.

For platforms that are designed to deliver applications, it is now common knowledge that application developers need a platform that can execute and manage containers (rather than VMs).

When it comes to networking, however, the choices are less clear. In what scenarios are designs based on single layer preferable vs. overlay networks ?

The answer to this question is not a simplistic one based on “encapsulation overhead”; while there are overlay networking projects that do exhibit poor performance, production ready solutions such as OpenContrail have performance characteristics on both throughput and PPS similar to the Linux kernel bridge implementation. When not using an overlay, it is still necessary to use an internal bridge to demux the container virtual-ethernet interface pairs.

The key aspect to consider is operational complexity!

From a bottoms-up perspective, one can build an argument that a network design with no encapsulation that simply uses an address prefix per host (e.g. a /22) provides the simplest possible solution to operate. And that is indeed the case if one assumes that discovery, failover and authentication can be handled completely at the “session” layer (OSI model).

I’m familiar with a particular compute infrastructure where this is the case: all communication between containers uses a common “RPC” layer which provides discovery, authentication and authorization and a common presentation layer.

In the scenario I’m familiar with, this works well because every single application component was written from scratch.

The key take away for me from observing this infrastructure operate was not really whether an overlay is used or not. The key lesson, in my mind, is that it is possible to operate the physical switching infrastructure independently of the problems of discovery, application failover and authentication. I believe that those with a background in operations in traditional enterprise networking environments can fully appreciate how decoupling the switching infrastructure from these problems can lead not only to simpler operations but also to driving up the throughput of the network fabric.

In environments where not all application components use the same common RPC layer or where a “belt and suspenders” approach is desirable, discovery, authorization and failover are a function that the infrastructure is expected to deliver to the applications.

Whenever that is the case, using an overlay network design provides the simplest (from an operations standpoint) way to implement this functionality because it offers a clear separation between the application layer and the physical infrastructure.

In the work we’ve been doing with OpenContrail around kubernetes/openshift, OpenContrail provides:

  •  access control between application tiers and services (a.k.a micro-segmentaion);
  •  service discovery and failover, by mapping the service’s virtual IP address (aka ClusterIP) to the instances (Pods) that implement the service;
  •  traffic monitoring and reporting on a per app-tier basis;

While OpenContrail introduces its own software stack which has an inherent operational complexity; this functionality can be operated independently of the switching infrastructure. And it is totally agnostic to the switching infrastructure. It can operate the same way
on a private cloud environment or on a public cloud; on any public cloud.

Using an overlay, allows OpenContrail to carry with the packet the information of the “network context” for purposes of both access control and traffic accounting and analysis. It also provides a clean separation between the virtual IP address space and the physical
infrastructure.

For instance, address pools do not have to be pre-allocated on a per node basis. And while IPv6 can bring an almost infinite address pool sizes, that doesn’t address the problem of discovery and failover for the virtual IP addresses associated with a service (a.k.a ClusterIPs in k8s).

In addition to this, standard based overlays such as OpenContrail can carry the information of the virtual network segment across multiple clusters or different infrastructure. A simple example: one can expose multiple external services (e.g. an Oracle DB and a DB2 system) to a single cluster while maintaining independent access-control (that doesn’t depend on listing individual IP addresses).

From an operational perspective, it is an imperative to separate the physical infrastructure, which is probably different on a cluster by cluster basis, from the services provided by the cluster. When discovery, authorization and failover are not a pure session layer
implementation it makes sense to use a network overlay.

kubernetes + opencontrail install

In this post we walk through the steps required to install a 2 node cluster running kubernetes that uses opencontrail as the network provider. In addition to the 2 compute nodes, we use a master and a gateway node. The master runs both the kubernetes api server and scheduler as well as the opencontrail configuration management and control plane.

OpenContrail implements an overlay network using standards based network protocols:

This means that, in production environments, it is possible to use existing network appliances from multiple vendors that can serve as the gateway between the un-encapsulated network (a.k.a. underlay) and the network overlay. However for the purposes of a test cluster we will use an extra node (the gateway) whose job is to provide access between the underlay and overlay networks.

For this exercise, I decided to use my MacBookPro which has 16G of RAM. However all the tools used are supported on Linux also; it should be relativly simple to reproduce the same steps on a Linux machine or on a cloud such as AWS or GCE.

The first step in the process is to obtain binaries for kubernetes release-1.1.1. I then unpacked the tar file into the ~/tmp and then extracted the linux binaries required to run the cluster using the command:


cd ~/tmp;tar zxvf kubernetes/server/kubernetes-server-linux-amd64.tar.gz

In order to create the 4 virtual-machines required for this scenario I used virtual-box and vagrant. Both are trivial to install on OSX.

In order to provision the virtual-machines we use ansible. Ansible can
be installed via “pip install ansible”. I then created a default
ansible.cfg that enables the pipelining option and disables ssh
connection sharing. The later was required to work around failures on
tasks that use “delegate_to” and run concurrently (i.e. run_once is
false). From a cursory internet search, it appears that the openssh
server that ships with ubuntu 14.04 has a concurrency issue when
handling multi-session.

 

[defaults]
pipelining=True

[ssh_connection]
ssh_args = -o ControlMaster=no -o ControlPersist=60s

With ansible and vagrant installed, we can proceed to create the VMs
used by this testbed. The vagrant configuration for this example is
available in github. The servers.yaml file lists the
names and resource requirements for the 4 VMs. Please note that if you
are adjusting this example to run in a different vagrant provider the
Vagrantfile needs to be edited to specify the resource requirements
for that provider.

After checking out this directory (or copying over the files) the VMs can be created by executing the command: vagrant up

Vagrant will automatically execute config.yaml which will configure the hostname on the VMs.

The Vagranfile used int this example will cause vagrant to create VMs
with 2 interfaces: a NAT interface (eth0) used for by the ssh
management sessions and external access and a private network
interface (eth1) providing a private network between the host and the
VMs. OpenContrail will use the private network interface; the
management interface is optional and may not exist in other
configurations (e.g. AWS, GCE).

After vagrant up completes, it is useful to add entries to /etc/hosts on all the VMs so that names can be resolved. For this purpose i used another ansible script invoked as:

ansible-playbook -u vagrant -i .vagrant/provisioners/ansible/inventory/vagrant_ansible_inventory resolution.yaml

This step must be executed independently of the ansible configuration
performed by vagrant since vagrant invokes ansible for each VM at a
time, while this playbook expects to be invoked for all hosts.

The command above dependens on the inventory file that vagrant creates
automatically when configuring the VMs. We will use the contents of
this inventory file in order to provision kubernetes and opencontrail
also.

With the VMs running, we need to checkout the ansible playbooks that
configure kubernetes + opencontrail. While an earlier version of the playbook is available upstream in the kubernetes contrib repository, the most recent version of the playbook is in a development branch on a fork of that repository. Checkout the repository via:


git clone https://github.com/pedro-r-marques/contrib/tree/opencontrail

The branch HEAD commit id, at the time of this post, is 15ddfd5.

UPDATE: The OpenContrail ansible playbook is now at https://github.com/Juniper/container-networking-ansible.

I will work to upstream the updated opencontrail playbook to both the
kubernetes and openshift provisioning repositories as soon as possible.

With the ansible playbook available on the contrib/ansible directory it is necessary to edit the file ansible/group_vars/all.yml replace the network provider:

# Network implementation (flannel|opencontrail)
networking: opencontrail

We then need to create an inventory file:

[opencontrail:children]
masters
nodes
gateways

[opencontrail:vars]
localBuildOutput=/Users/roque/tmp/kubernetes/server/bin
opencontrail_public_subnet=100.64.0.0/16
opencontrail_interface=eth1

[masters]
k8s-master ansible_ssh_user=vagrant ansible_ssh_host=127.0.0.1 ansible_ssh_port=2222 ansible_ssh_private_key_file=/Users/roque/k8s-provision/.vagrant/machines/k8s-master/virtualbox/private_key

[etcd]
k8s-master ansible_ssh_user=vagrant ansible_ssh_host=127.0.0.1 ansible_ssh_port=2222 ansible_ssh_private_key_file=/Users/roque/k8s-provision/.vagrant/machines/k8s-master/virtualbox/private_key

[gateways]
k8s-gateway ansible_ssh_user=vagrant ansible_ssh_host=127.0.0.1 ansible_ssh_port=2200 ansible_ssh_private_key_file=/Users/roque/k8s-provision/.vagrant/machines/k8s-gateway/virtualbox/private_key

[nodes]
k8s-node-01 ansible_ssh_user=vagrant ansible_ssh_host=127.0.0.1 ansible_ssh_port=2201 ansible_ssh_private_key_file=/Users/roque/k8s-provision/.vagrant/machines/k8s-node-01/virtualbox/private_key
k8s-node-02 ansible_ssh_user=vagrant ansible_ssh_host=127.0.0.1 ansible_ssh_port=2202 ansible_ssh_private_key_file=/Users/roque/k8s-provision/.vagrant/machines/k8s-node-02/virtualbox/private_key

This inventory file does the following:

  • Declares that hosts for the roles: masters, gateways, etcd, nodes;The ssh information is derived from the inventory created by vagrant.
  • Declares the location of the kubernetes binaries downloaded from the github release;
  • Defines the IP address prefix used for ‘External IPs’ by kubernetes services that require external access;
  • Instructs opencontrail to use the private network interface (eth1); without this setting the opencontrail playbook defaults to eth0.

Once this file is created, we can execute the ansible playbook by running the script "setup.sh" in the contrib/ansible directory.

This script will run through all the steps required to provision
kubernetes and opencontrail; it is not unusual for the script to fail
to perform some of network based operations (downloading the
repository keys for docker for instance or downloading a file from
github); the ansible playbook is ment to be declarative (i.e. define
the end state of the system) and it is supposed to be re-run if a
network based failure is encountered.

At the end of the script we should be able to login to the master via the command “vagrant ssh k8s-master” and observe the following:

  • kubectl get nodes
    This should show two nodes: k8s-node-01 and k8s-node-02.
  • kubectl --namespace=kube-system get podsThis command should show that the kube-dns pod is running; if this pod is in a restart loop that usually means that the kube2sky container is not able to reach the kube-apiserver.
  • curl http://localhost:8082/virtual-networks | python -m json.toolThis should display a list of virtual-networks created in the opencontrail api
  • netstat -nt | grep 5269
    We expect 3 established TCP sessions for the control channel (xmpp) between the master and the nodes/gateway.

On the host (OSX) one should be able to access the diagnostic web interface of the vrouter agent running on the compute nodes:

These commands show display the information regarding the interfaces attached to each pod.

Once the cluster is operational, one can start an example application such as “guestbook-go”. This example can be found in the kubernetes examples directory. In order for it to run successfully the following modifications are necessary:

    • Edit guestbook-controller.json, in order to add the labels “name” and “uses” as in:
"spec":{
[...]
"template":{
"metadata":{
"labels":{
"app":"guestbook",
"name":"guestbook",
"uses":"redis"
}
},
[...]
}
    • Edit redis-master-service.json and redis-slave-service.json in order to add a service name. The following is the configuration for the master:
"metadata": {
[...]
"labels" {
"app":"redis",
"role": "master",
"name":"redis"
}
}
  • Edit redis-master-controller.json and redis-slave-controller.json in order to add the “name” label to the pods. As in:
    "spec":{
    [...]
    "template":{
    "metadata":{
    "labels":{
    "app":"redis",
    "role":"master",
    "name":"redis"
    }
    },
    [...]
    }
    

After the example is started the guestbook service will be allocated an ExternalIP on the external subnet (e.g. 100.64.255.252).

In order to access the external IP network from the host one needs to add a route to 192.168.1.254 (the gateway address). Once that is done you should be able to access the application via a web browser via http://100.64.255.252:3000.

Ansible inventory, role variables and facts

I’ve been struggling a bit to understand how to use inventory, role variables and facts in the playbooks i’ve been working on (mostly around provisioning opencontrail on top of kubernetes/openshift-origin). I finally came up with a model that made sense to me. This is probably well understood by everyone else but i couldn’t quite grok it until i worked out the following example.

User configuration options should be set:
– In group_vars/all.yml for settings that affect all hosts;
– In the inventory file, for host and group variables;

As in this example:

[test]
localhost flag_a="foo"
[test:vars]
flag_b="bar"
view raw inventory hosted with ❤ by GitHub

It is useful to establish a convention for variables that are specific to the deployment (e.g. user settable variables). In this case i’m using flag_<var> as a convention for deployment specific variables.

Most of these would have defaults. In order to set the defaults, the playbook defines a role variable (flag_user_<var> in this example). The playbook role then uses flag_user_<var> rather than the original flag_<var>.

Role variables can use jinja template logic operations as well as filters. The most common operation is to use a <code>default</code> filter as in the example playbook bellow. But more complex logic can be built using {%if <expression> %}{% endif %} blocks.

Facts can then be used for variables that depend on the result of command execution. While it is possible to use set_fact in order to set variables, establishing a clear convention that facts are the result of command execution seems desirable.
While it may seem useful to use the set_fact action to set variables that have no dependencies on task execution
since it supports the when statement while role variables do not (although include_vars does), it helps to establish a simple convention that a “fact” is the result of a task observation.

---
- name: inventory, task variables and facts
hosts:
- test
tasks:
- name: example task
command: hostname
register: hostname_var
- name: facts can be determined by output of tasks
set_fact:
flag_task_a: "{{ hostname_var.stdout }}"
when: hostname_var.rc == 0
- debug: var=flag_task_a
- debug: var=flag_task_b
- debug: var=flag_user_a
- debug: var=flag_user_b
- debug: var=flag_user_c
vars:
flag_task_a: "a"
flag_task_b: "b"
flag_user_a: "{{ flag_a | default('a') }} "
flag_user_b: "{{ flag_b | default('b') }} "
flag_user_c: "{{ flag_c | default('c') }} "
view raw playbook.yml hosted with ❤ by GitHub

My conclusion is that, for the playbooks that i write/maintain, I’m going to try to establish a set of rules before starting to actually write the tasks.

  1. Naming convention for user settable variables.
  2. Naming convention for role variables (i.e. user setting + default value).
  3. Limit set_fact to variables that depend on the outcome of task execution.

Please leave a comment if you have a different suggestion that improves maintainability of a playbook’s role specifications.

Kubernetes networking with OpenContrail

OpenContrail can be used to provide network micro-segmentation to kubernetes, providing both network isolation as well as the ability to attach a pod to a network that may have endpoints in using different technologies  (e.g. bare-metal servers on VLANs or OpenStack VMs).

This post describes how the current prototype works and how packets flow between pods. For illustration purposes we will focus on 2 tiers of the k8petstore example on kubernetes: the web frontend and the redis-master tier that the frontend uses as a data store.

The OpenContrail integration works without modifications to the kubernetes code base (as off v1.0.0 RC2). An additional daemon, by the name of kube-network-manager, is started on the master. The kubelets are executed with the option: “–network_plugin=opencontrail”, which instructs the kubelet to execute the command:
/usr/libexec/kubernetes/kubelet-plugins/net/exec/opencontrail/opencontrail. The source code for both the network-manager and the kubelet plugin are publicly available.

When using OpenContrail as the network implementation the kube-proxy process is disabled and all pod connectivity is implemented via the OpenContrail vrouter module which implements an overlay network using MPLS over UDP as encapsulation. OpenContrail uses a standards based control plane in order to distribute the mapping between endpoint (i.e. pod) and location (k8s node). The fact that the implementation is standards compliant means that it can interoperate with existing network devices (from multiple vendors).

The kube-network-manager process uses the kubernetes controller framework to listening to changes in objects that are defined in the API and add annotations to some of these objects. It then creates a network solution for the application using the OpenContrail API to define objects such as virtual-networks, network interfaces and access control policies.

The kubernetes deployment configuration for this example application consists of a replication controller (RC) and a service object for the web-server and a pod and service object for the redis-master.

The web frontend RC contains the following metadata:

"labels": {
  "name": "frontend",
  "uses": "redis-master"
}

This metadata information is copied to each pod replica created by the kube-controller-manager. When the network-manager sees these pods it will:

  • Create a virtual-network with the name <namespace:frontend>
  • Connect this network with the network for the service <namespace:redis-master>
  • Create an interface per pod replica with a unique private IP address from a cluster-wide address block (e.g. 10.0/16).

The kube-network-manager also annotates the pods with the interface uuid created by OpenContrail as well as the allocated private IP address (and a mac-address). These annotations are then read by the kubelet.

When the pods are started by the respective kubelet invokes the plugin script. This script removes the veth-pair associated with the docker0 bridge and assigns it to the OpenContrail vrouter kernel module, executing on each node. The same script notifies the contrail-vrouter-agent of the interface uuid associated with the veth interface and configures the IP address inside the pod’s network namespace.

At this stage each pod has an unique IP address in the cluster but can only communicate with other pods within the same virtual-network. Subnet broadcast and IP link-local multicast packets will be forwarded to the group of pods that are present in the same virtual-network (defined by the “labels.name” tag).

OpenContrail assigns a private forwarding table to each pod interface. The veth-pair associated with the network namespace used by docker is mapped into a table which has routing entries for each of the other pod instances that are defined within the same network or networks this pod has authorized access to. The routing tables are computed centrally by the OpenContrail control-node(s) and distributed to each of the compute nodes where the vrouter is running.

The deployment defines a service associated with web frontend pods:

  "kind": "Service",
  "metadata": {
    "name": "frontend",
    "labels": {
      "name": "frontend"
    }
  },
  "spec": {
    "ports": [{
      "port": 3000
    }],
    "deprecatedPublicIPs":["10.1.4.89"],
    "selector": {
      "name": "frontend"
    }
  }

The “selector” tag specifies the pods that belong to the service. The service is then assigned a “ClusterIP” address by the kube-controller-manager. The ClusterIP is an unique IP address that can be used by other pods to consume the service. This particular service also allocates a PublicIP address that is accessible outside the cluster.

When the service is defined, the kube-network-manager creates a virtual-network for the service (with the name of <namespace:service-frontend>) and allocates a floating-ip address with the ClusterIP specified by kubernetes. The floating-ip address is then associated with each of the replicas.

In the k8petstore example, there is a load-generator tier defined by an RC with the following metadata:

        "labels": {
          "name": "bps",
          "uses": "frontend"
        }

The network-manager process interprets the “uses” tag as an implicit authorization for the “bps” network to access the “service-frontend” network which contains the ClusterIP. That is the mechanism that causes the ClusterIP address to be visible in the private routing tables that are associated with the load-generator pods.

When traffic is sent to this ClusterIP address, the sender has multiple feasible paths available (one per replica). It chooses one of these based on a hash on the 5-tuple of the packet (IP source, IP destination, protocol,  source port, destination port). Traffic is sent encapsulated to the destination node such that the destination IP address of the inner packet is the ClusterIP. The vrouter kernel module in destination node then performs a destination NAT operation on the ClusterIP and translates this address to the private IP of the specific pod.

A packet sent by a load-generator pod to the ClusterIP of the web frontend goes through the following steps:

  1. Packet is sent by the IP stack in the container with SourceIP=”load-gen private IP”, DestinationIP=ClusterIP. This packet is send to eth0 inside the container network namespace, which is a Linux veth-pair interface.
  2. The packet is delivered to the vrouter kernel module; a route lookup is performed for the destination IP address (ClusterIP) in the private forwarding table “bps”.
  3. This route lookup returns an equal cost load balancing next-hop (i.e. list of available path). The ECMP algorithm selects one of the available paths and encapsulates the traffic such that and additional IP header is added to the packet with SourceIP=”sender node address”, DestinationIP=”destination node address”; additionally an MPLS label is added to the packet corresponding to the destination pod.
  4. Packet travels in the underlay to the destination node.
  5. The destination node strips the outer headers and performs a lookup on the MPLS label and determines that the destination IP address is a “floating-ip” address and requires NAT translation.
  6. The destination node creates a flow-pair with the NAT mapping of the ClusterIP to the private IP of the destination pod and modifies the destination IP of the payload.
  7. Packet is delivered to the pod such that the source IP is the unique private IP of the source pod and the destination IP is the private IP of the local pod.

The service definition for the web front-end also specified a PublicIP. This address is implemented as a floating-ip address like the ClusterIP, except that the floating-ip is associated with a network that spans across the cluster and the outside world. Typically, OpenContrail deployments configure one or more “external” networks that map to virtual  network on external network devices such as a data-center router.

Traffic from the external network is also equal cost load balanced to the pod replicas of the    web frontend. The mechanism is the same as described above except that the ingress device is a router rather than a kubernetes node.

To finalize the walk-through of the k8petstore example, the redis-master service defines:

  "kind": "Service",

  "metadata": {
    "name": "redismaster",
    "labels": {
      "name": "redis-master"
    }
  },
  "spec": {
    "ports": [{
      "port": 6379
    }],
    "selector": {
      "name": "redis-master"
    }
  }

Since the web frontend pods contain the label "uses": "redis-master" the network-manager creates a policy that connects the clients (frontend pods) to the service ClusterIP. This policy can also limit the traffic to allow access to the ports specified in the service definition only.

There remains additional of work to be done in this integration, but i do believe that the existing prototype shows how OpenContrail can be used to provide an elegant solution for  micro-segmentation that can both provide connectivity outside the cluster as well as pass a security audit.

From an OpenContrail perspective, the delta between a kubernetes and an OpenStack deployment is that in OpenStack the Neutron plugin provides the mapping between Neutron and OpenContrail API objects while in kubernetes the network-manager translates the pod and service definitions into the same objects. The core functionality of the networking solution remains unchanged.

Static routes

OpenContrail allows the user to specify a static route with a next-hop of an instance interface. The route is advertised within the virtual-network that the interface is associated with. This script can be used to manipulate the static routes configured on an interface.

I wrote it in order to setup a cluster in which overlay networks are used hierarchically. The bare-metal nodes are running OpenStack using OpenContrail as the neutron plugin; a set of OpenStack VMs are running a second overlay network using OpenContrail which kubernetes as the compute scheduler.

In order to provide external access for the kubernetes cluster, one of the kubernetes node VMs was configured as an OpenContrail software gateway.

This is easily achievable by editing /etc/contrail/contrail-vrouter-agent.conf to include the following snippet:

# Name of the routing_instance for which the gateway is being configured
routing_instance=default-domain:default-project:Public:Public

# Gateway interface name
interface=vgw

# Virtual network ip blocks for which gateway service is required. Each IP
# block is represented as ip/prefix. Multiple IP blocks are represented by
# separating each with a space
ip_blocks=10.1.4.0/24

The vow interface can then be created via the following sequence of shell commands:

ip link add vgw type vhost
ip link set vgw address 00:00:5e:00:01:00
ip link set vgw up
ip route add 10.1.4.0/24 dev vgw

The interface-route script can then be used to add a static route to the IP prefix configured in the software gateway interface. This route should be added to an interface (e.g. neutron port) associated with the VM that is running the software gateway functionality and in a network that is externally connected.

This allows the nested overlay to be accessed from outside the cluster. For redundancy, multiple VMs can be configured with a gateway interface and the corresponding static route.

Kubernetes and OpenContrail

I’ve been working over the last couple of weeks in integrating OpenContrail as a networking implementation for Kubernetes and got to the point where i ‘ve a prototype working with a multi-tier application example.

Kubernetes provides 3 basic constructs used in deploying applications:

  • Pod
  • Replication Controller
  • Service

A Pod is a container environment that can execute one or more applications; each Pod executes on a host as one (typically) or more Docker processes sharing the same environment  (including networking). A Replication Controller (RC) is a collection of Pods with the same execution characteristics. RCs ensure that the specified number of replicas are executing for a given Pod template.

Services are collections of Pods that are consumable as a service. Through a single IP end point, typically load-balanced to multiple backends.

Kubernetes comes with several application deployment examples. For the purpose of prototyping, I decided to use the K8PetStore example. It creates a 4-tier example: load-generator, frontend, redis-master and redis-slave. Each of these tiers, except for the redis-master) can be deployed as multiple instances.

With OpenContrail, we decided to create a new daemon that listens to the kubernetes API using the kubernetes controller framework. This daemon creates virtual networks on demand, for each application tier and connects them together using the “Labels” present in the deployment template.

A plugin script running on the minion then connects the container veth-pair to the OpenContrail vrouter rather than the docker0 bridge.

The network manager daemon logic is doing the following:

  • For each collection (i.e. group of Pods managed by an RC) it creates a virtual-network. All these virtual networks are addressed out of the cloud private space (10.0.0.0/16 in my example).
  • Each Pod is assigned an unique address in the Private space (10.0.x.x) and by default can only communicate with other Pods in the same collection.
  • When a service is defined over a collection of Pods, that service implies the creation of a new virtual network in the services space (a.k.a Portal network in kubernetes).
  • Each pod in a service is assigned the floating-ip address corresponding to the PortalIP (i.e. the service VIP); thus traffic sent to the service will be equal cost load balanced across the multiple back ends.
  • In the k8petstore example, the collections use the kubernetes labels “name” and “uses” to specify what tiers communicate with each other; the network manager automatically creates network access control policies that allow the respective Pods to communicate. The network policies are being provisioned such that when a collection X has a deployment annotation that it “uses” collection Y, then X is allowed to communicate with Y’s virtual IP address.

The current prototype is very interesting in terms of highlighting how a tool like kubernetes makes application deployment easy, fast and reproducible; and how network micro-segmentation can fit in in a way that is transparent to the application owner and provides isolation and access control.

The OpenContrail kubernetes network-manager, can automate the deployment of the network since it is exposed to the collection (RC) and service definition. While advanced users may want to customize the settings; the defaults can be more useful and powerful when compared with an API such as AWS VPC and/or OpenStack Neutron.

One important difference from a usage perspective vs the tradition OpenStack + OpenContrail deployments, is that in the kubernetes prototype, the system is simply allocating private IP addresses that are unique within the cloud while isolating the pods of the same collection. For instance, in our example, if the frontend Pod has the address 10.0.0.2 and redis-master the private address 10.0.0.3 and the VIP (aka Portal IP) of 10.254.42.1 the topology is setup such that:

  • The frontend network contains 10.0.0.2 (but is enable to forward traffic to 10.0.0.3);
  • The frontend network is connected to the redis-master service network (which contains the floating-ip address 10.254.42.1).
  • The redis-master network contains 10.0.0.3.

Traffic from the front-end to the service VIP is forwarded to the OpenContrail vrouter in the minion where the service is executing (with an ECMP decision if multiple replicas are running). The destination IP is then translated to the private IP of the redis-master instance. Return traffic flows through the redis-master service network which has routing tables entries for the frontend network.

With OpenContrail, the kubernetes cloud no longer uses the “kube-proxy”. The default kubernetes deployments uses a TCP proxy between the host and the container running in a private address on the docker0 bridge. This creates a need for the service/pod definition to allocate host ports and prevents 2 docker containers that want the same host port to execute in the same minion. OpenContrail removes this restriction; the host network is completely out of the picture of the virtual-networks uses by the pods and services.

I’m looking forward to rewriting the prototype of the network-manager daemon over the next few weeks and add additional functionality such as the ability to deploy source-nat for optional outbound access from the containers to the internet as well as LBaaS, for scenarios where fine grain control of the load-balancing decision is desirable.

The trajectory of Open Daylight

When the Open Daylight project started, it was clear that the intent on the part of IBM and RedHat was to replicate the success of Linux.

Linux is today a de-facto monopoly on server operating systems. It is monetized by Redhat (and in smaller part by Canonical) and it essentially allowed the traditional I.T. companies such as IBM, Oracle, HP to neutralize Sun microsystems which was in the late 90s, early naughts, the platform of choice for Web application deployment.

Whether the initial target of these companies was Sun or Microsoft, the fact is that, by coming together in support of a open source project that had previously been an hobby of university students they inaugurated the era of corporate open source.

This was followed by a set of successful startup companies that used open source as a way to both create a much deeper engagement with their customers and of marketing their products. By originating and curating an open source project, a startup can achieve a much greater reach than before. The open source product becomes a free trial license, later monetized in
production deployments that typically need support. Open source also provides a way to engage with the consumers of the product by allowing them to become involved, contribute and help define the direction. It has been very effective in examples such as MySQL, Puppet Labs and Docker.

This landscape became more complex with the advent of OpenStack and the OpenStack foundation which not only includes a set of open source projects but also serves as a strong marketing organization. The OpenStack foundation, with its steep membership fees, was a mechanism by which, initially, Rackspace was able to share the costs of its marketing initiatives around its public cloud in its, until now, unsuccessful attempt to compete with Amazon Web Services (AWS).

It also created a group of very strange bedfellows. The companies investing the most on OpenStack are the giants of the I.T. world which, most likely, initially targeted AWS and VMWare as the common adversary. They are however
finding themselves incresingly in a situation where they are most interested in competing with each other in terms of providing private cloud solutions to enterprises, either hosted or on-premisse. This landscape keeps involving and the most interesting question at the moment is how are these companies going to be able to cooperate and compete with each other.

They must cooperate in order to create a bigger ecosystem around private cloud that can match the public cloud ecosystems of AWS, Azure and GCE. They must compete in order to differentiate their offerings. Otherwise we are left with two options for monetization: it is services services and Mirantis will get the cake OR (less likely) installing and operating a private cloud becomes a shrink wrapped product and Redhat wins.
Either way, non-differentiation implies a winner takes all monopoly.

It was at the height of the OpenStack euforia that Open Daylight was conceived. The assumption of its I.T. centric founders was that the networking vendors, which have the domain expertise, would collectively developed the software to be then monetized as part of an OpenStack distribution as software support.

I’d the opportunity to ask one of the people involved in the creation of the project for his thoughts on monetization and i got a clear answer: “we expect network vendors to monetize switches just like server manufacturers monetize NICs, video cards, storage”.

It is not surprising that the most technically savvy of the network vendors have adopted an approach of participating in ODL, in order to be aware of what is going on, but focus their energies on other strategic initiatives.

Network gear is a software business. Switches themselves are mostly manufactured by ODMs using third-party sillicon. Routers, specially service provider routers require much more flexible forwarding engines than switches, typically network processors. But those also are available from Broadcom (although lower capacity and capability than the special purpose sillicon of top vendors).

Being a software business allows network vendors to hit gross margins of 60%+. ODMs have margins that are much much lower. A networking company must have an independent (and differentiated) software strategy. This is a business imperative.

Some vendors, typically the weakest in their engineering resources, have tried to build such
a strategy on top of ODL. Use ODL as a marketing machine since it has a budget at least an order of magnitude higher than one would have for “SDN” in a networking vendor. At first glance this sounds like sensible a sessible strategy: build your own differentiated product wrapped in ODL-marketing aura.

The problem is that it transformed ODL into a frankenstein of multiple vendor projects with very little or not relationship to each other. The several controllers being driven by the multiple vendors participating in ODL are all different. One vendor had recent press where they where describing 3 distinct controllers. Given the lack of commonality between the goals of these different projects it is completely unclear what ODL is at the technical level. Which is not entirely surprising given that the term “SDN” has no technical definition in itself.

At this point ODL becomes a brand, not an open source project. And a tainted one at that. Given that none of the multiple controllers can be seen currently in any qualification testing in either carrier or enterprise one can only conclude that there isn’t a single one that quite works yet. Software engineering projects are built in timeframes that are longer than the hype cycles generated by a large marketing machine. This inevitably leads to disappointment.

That disappointment is growing and will likely snowball.

The meaning of Cloud

The term “Cloud” refers to a software development and delivery methodology that consists of decomposing applications into multiple services (a.k.a. “micro-services”) such that each service can be made resilient and scaled horizontally, by running multiple instances of each service. “Cloud” also implies a set of methodologies for the application delivery (how to host the application) and application management (how to ensure that each component is behaving appropriatly). The name, “Cloud”, seems to only capture the delivery piece of the equation, which is the proverbial tip of the iceberg.
An example that would help us break down the jargon into something a bit more concrete: a simple web application that lets a user add entries to a database table (e.g. “customers”) and perform some simple queries over this table. This is what is known as a CRUD application, from the initials of Create, Read, Update, Delete.
The “classic” version of this application would be a VisualBasic/Access (in the pre-Web days), .NET/SQLServer or Ruby On Rails/MySQL application. The software component is responsible to generate a set of forms/web pages for the user to input its data, execute some validation and access the database. In most application development frameworks (e.g. RoR), this example can be made to work in hours or a few days.
One minor issue with our example above is that, typically, not every user in an organization has the same level of access. Even when that is the case there is often the need to audit who does what. Thus our application needs to have also a “user” table and some sort of access control. Not a big deal: a couple of more forms and a new database table of users is created.
Until someone else creates another CRUD application (e.g. to manage inventory) that also needs users and access control rules. Clearly components that are common to multiple applications should be shared. Lets assume that our developers built a simple web-API that can use an LDAP backend for authentication and manages a set of access control list rules. Both our CRUD applications can use this authentication service to go from username/password to a cookie and then query the authorization information from cookie to access permissions with then application.
By now we have a reasonable description of what a simple “classic” application looks like from a development standpoint. In our “example, each of our CRUD applications and the authentication service consist of a single VM built manually by the development team. These VMs are then handed off to the system administration group which configures monitoring and network access.
The above roughly describes the state of the art in many enterprises; except that the number and complexity of the applications is significantly larger. And that “customer” and “inventory” applications are actually not typically developed in house; these are often components of CRM software suites built by third parties. They only serve in our story as examples.
The key issues with our “classic” application are:
  •   Reliability
  •   Scale
  •   Agility
Of these three, scale, is often not the major concern unless this application is being delivered to a large audience. That is the case in both consumer and SaaS markets but less so in enterprise. We can think of scale in terms of the number of concurrent sessions that the application needs to serve. In many cases this number is low and does not warrant a significant investment.
Reliability comes from two different vectors: the correctness of the software itself (which we can think of a function of the test coverage); and the availability of the infrastructure.
The “classical” approach to infrastructure availability has been to try to make the infrastructure as resilient to failure as possible. A significant factor behind this approach is the handoff point between software and infrastructure management. If those responsible for running the application (infrastructure teams) are not aware of the design or availability requirements of the application they can only assume worst case scenario.
For the infrastructure to completely mask power, network, server and disk failures without understanding the application semantics is so prohibitively expensive as to be considered practically impossible. Still, infrastructure teams are typically measured in terms of uptime. They attempt to mask single disk failures and single server failures with virtual machine restart, which does have impact to the application. Network card or switch failures can be masked with server link redundancy also. It is common to have a goal of 99.999% availability.
That begs the question of what happens 0.001% of the time, which corresponds statistically to roughly 8 hours per year. The problem with statistical averages is that as the number of application servers increase so do the failures. Assuming 1000 application servers and a perfect distribution of failure, one can assume that there is at least 1 failure occurring at any particular point in time, despite the significant resource and performance cost of infrastructure based availability.
It also turns out that masking failures also ends up making the impact of a failure worse from a software reliability perspective. Events that happen less frequently may not be tested; which then may lead to catastrophic failures such as an application loosing transactions or leaving data in invalid state.
The “cloud” approach to availability is to expose a (micro)service directly to infrastructure failures but hide them at the micro(service) level; this means that the authentication service in our example above would be responsible to serve its APIs successfully independent of data-center, power, server or disk failures. It goes further: it stipulates that one should purposely trigger failure on the production infrastructure in order to verify that the services are still operational.
Google, for instance, simulates large scale disasters yearly in order to ensure that its services are still operational in the event of a major disaster such as a earthquake or other large natural disaster that could affect its infrastructure. Netflix created a software tool called “chaos monkey” whose job is to randomly kill production servers as well as produce other types of havoc.
This is not as crazy as it seems: users care about the total availability of the system of which software reliability is the most important component. Application software is more complex in terms of functionality than the infrastructure it runs upon and thus more prone to failure.
The financial crisis of 2008 highlighted the “black swan” effect. The  consequences of events with very low probability but with catastrophic effects, which tend to disappear in statistical risk models such as 99.999% availability. The “cloud” philosophy is to increase the probability of failure in order to avoid “black swans”.
One reasonable criticism of this approach is that it creates more complex software architectures with additional failure modes.
Perhaps instead of discussing “cloud” one should focus on modern software engineering practices. These have the goal of taking software from a an ad-hoc artisan mindset and transforming it into a first class engineering discipline. Modern software engineering has the dual goals of maximizing agility and reliability; its corner stone is testing. And testing requires a very different hand-off and service model between developers and infrastructure.
Modern software engineering practices typically imply release cycles within a range from 2 weeks to 3 months. The intend is to release to production incremental sets of changes that are tested and from which one can gather real world feedback.
Software is expected to be tested in:
  • unit test
  • integration test
  • system test
  • Q/A and staging
  • A/B testing
While unit and integration test happen in developer workstations (or a cloud application that pre-verifies all proposed commits); system test, staging, A/B testing and troubleshooting require the ability to create production like application environments that are the exact mimic of the production configuration. Testing against triggered infrastructure failures is typically a requirement of both system and Q/A testing.
A software release cycle implies a carousel of application execution environments that execute simultaneously. If release X is the stable release in production, there may be release X+1 soaking in production for A/B testing; potentially multiple environments running release X+2 for a Q/A environment with simulated traffic and system test environments on the pre-released version X+3. Development may need to go back and create a system test environment for an arbitrary version previous to X in order to do troubleshooting.
This development methodology requires that all interactions with the infrastructure are based on version controlled deployment templates (e.g. CloudFormations, Heat, etc) that exercise an API. Trouble tickets or GUIs are not desirable ways to interact with the infrastructure because they do not provide a repeatable and version controlled method to describe the resources that are in play.
In summary, cloud is the result of the prescribed approach of modern software engineering practices that attempt to improve reliability and agility for software. The main driver to adopt a cloud infrastructure is to serve a community of application developers and their requirements.