Authenticated access to Kubernetes pods

When running a micro-services style application in a public cloud, one of the problems to solve is how to provide access to debug information. At Laserlike, we run our application stack on GKE. Most of the stack consists of golang Pods that run an HTTP listener that serves /debug and /metrics handlers.

For metrics scrapping we use prometheus; and grafana for visualization. Our grafana server is nodePort service behind a GCE Load Balancer which uses oauth2 based authentication for access. This still leaves a gap in terms of access to the pod debug information such as /debug/vars or /debug/pprof.

In order to address this gap, we created a simple HTTP proxy for kubernetes services and endpoints. We deploy this proxy behind a oauth2 authenticator which is then exposed via an external load balancer.

The service proxy uses the kubernetes client library in order to consume annotations on the service objects. For example, the following annotation, instructs the service proxy to expose the debug port of the endpoints of the specified service:

    k8s-svc-proxy.local/endpoint-port: "8080"

The landing page on the proxy then displays a set of endpoints:



k8s + opencontrail on AWS

For anyone interested in running a testbed with Kubernetes and OpenContrail on AWS i managed to boil down the install steps to the minimum:

  • Use AWS IAM to create a user and download a file “credentials.csv”
  • Checkout the scripts via `git clone`
  • Change to the “test/ec2-k8s” directory.
  • Setup environment variables with your EC2 IAM role. The script can be used for this purpose.
  • Follow the steps in the test script:
    • ansible-playbook -i localhost playbook.yml
    • Extract the deployer hostname from the file `cluster.status`; this is the inventory entry in the `[management]` group.
    • Login into the deployer hostname and execute:
      • ansible-playbook -i src/contrib/ansible/inventory src/contrib/ansible/resolution.yml
      • ansible-playbook -i src/contrib/ansible/inventory src/contrib/ansible/cluster.yml
      • ansible-playbook -i src/contrib/ansible/inventory src/contrib/ansible/validate.yml
      • ansible-playbook -i src/contrib/ansible/inventory src/contrib/ansible/examples.yml

This will:

  • Create 5 VMs in a VPC on AWS;
  • Run the ansible provisioning script that installs the cluster;
  • Run a minimal sanity check on the cluster;
  • Launch an example;
  • Fetch the status page of the example app in order to check whether it is running successfully.


Container networking: To overlay or not to overlay

One of the key decisions in designing a compute infrastructure is how to handle networking.

For platforms that are designed to deliver applications, it is now common knowledge that application developers need a platform that can execute and manage containers (rather than VMs).

When it comes to networking, however, the choices are less clear. In what scenarios are designs based on single layer preferable vs. overlay networks ?

The answer to this question is not a simplistic one based on “encapsulation overhead”; while there are overlay networking projects that do exhibit poor performance, production ready solutions such as OpenContrail have performance characteristics on both throughput and PPS similar to the Linux kernel bridge implementation. When not using an overlay, it is still necessary to use an internal bridge to demux the container virtual-ethernet interface pairs.

The key aspect to consider is operational complexity!

From a bottoms-up perspective, one can build an argument that a network design with no encapsulation that simply uses an address prefix per host (e.g. a /22) provides the simplest possible solution to operate. And that is indeed the case if one assumes that discovery, failover and authentication can be handled completely at the “session” layer (OSI model).

I’m familiar with a particular compute infrastructure where this is the case: all communication between containers uses a common “RPC” layer which provides discovery, authentication and authorization and a common presentation layer.

In the scenario I’m familiar with, this works well because every single application component was written from scratch.

The key take away for me from observing this infrastructure operate was not really whether an overlay is used or not. The key lesson, in my mind, is that it is possible to operate the physical switching infrastructure independently of the problems of discovery, application failover and authentication. I believe that those with a background in operations in traditional enterprise networking environments can fully appreciate how decoupling the switching infrastructure from these problems can lead not only to simpler operations but also to driving up the throughput of the network fabric.

In environments where not all application components use the same common RPC layer or where a “belt and suspenders” approach is desirable, discovery, authorization and failover are a function that the infrastructure is expected to deliver to the applications.

Whenever that is the case, using an overlay network design provides the simplest (from an operations standpoint) way to implement this functionality because it offers a clear separation between the application layer and the physical infrastructure.

In the work we’ve been doing with OpenContrail around kubernetes/openshift, OpenContrail provides:

  •  access control between application tiers and services (a.k.a micro-segmentaion);
  •  service discovery and failover, by mapping the service’s virtual IP address (aka ClusterIP) to the instances (Pods) that implement the service;
  •  traffic monitoring and reporting on a per app-tier basis;

While OpenContrail introduces its own software stack which has an inherent operational complexity; this functionality can be operated independently of the switching infrastructure. And it is totally agnostic to the switching infrastructure. It can operate the same way
on a private cloud environment or on a public cloud; on any public cloud.

Using an overlay, allows OpenContrail to carry with the packet the information of the “network context” for purposes of both access control and traffic accounting and analysis. It also provides a clean separation between the virtual IP address space and the physical

For instance, address pools do not have to be pre-allocated on a per node basis. And while IPv6 can bring an almost infinite address pool sizes, that doesn’t address the problem of discovery and failover for the virtual IP addresses associated with a service (a.k.a ClusterIPs in k8s).

In addition to this, standard based overlays such as OpenContrail can carry the information of the virtual network segment across multiple clusters or different infrastructure. A simple example: one can expose multiple external services (e.g. an Oracle DB and a DB2 system) to a single cluster while maintaining independent access-control (that doesn’t depend on listing individual IP addresses).

From an operational perspective, it is an imperative to separate the physical infrastructure, which is probably different on a cluster by cluster basis, from the services provided by the cluster. When discovery, authorization and failover are not a pure session layer
implementation it makes sense to use a network overlay.

kubernetes + opencontrail install

In this post we walk through the steps required to install a 2 node cluster running kubernetes that uses opencontrail as the network provider. In addition to the 2 compute nodes, we use a master and a gateway node. The master runs both the kubernetes api server and scheduler as well as the opencontrail configuration management and control plane.

OpenContrail implements an overlay network using standards based network protocols:

This means that, in production environments, it is possible to use existing network appliances from multiple vendors that can serve as the gateway between the un-encapsulated network (a.k.a. underlay) and the network overlay. However for the purposes of a test cluster we will use an extra node (the gateway) whose job is to provide access between the underlay and overlay networks.

For this exercise, I decided to use my MacBookPro which has 16G of RAM. However all the tools used are supported on Linux also; it should be relativly simple to reproduce the same steps on a Linux machine or on a cloud such as AWS or GCE.

The first step in the process is to obtain binaries for kubernetes release-1.1.1. I then unpacked the tar file into the ~/tmp and then extracted the linux binaries required to run the cluster using the command:

cd ~/tmp;tar zxvf kubernetes/server/kubernetes-server-linux-amd64.tar.gz

In order to create the 4 virtual-machines required for this scenario I used virtual-box and vagrant. Both are trivial to install on OSX.

In order to provision the virtual-machines we use ansible. Ansible can
be installed via “pip install ansible”. I then created a default
ansible.cfg that enables the pipelining option and disables ssh
connection sharing. The later was required to work around failures on
tasks that use “delegate_to” and run concurrently (i.e. run_once is
false). From a cursory internet search, it appears that the openssh
server that ships with ubuntu 14.04 has a concurrency issue when
handling multi-session.



ssh_args = -o ControlMaster=no -o ControlPersist=60s

With ansible and vagrant installed, we can proceed to create the VMs
used by this testbed. The vagrant configuration for this example is
available in github. The servers.yaml file lists the
names and resource requirements for the 4 VMs. Please note that if you
are adjusting this example to run in a different vagrant provider the
Vagrantfile needs to be edited to specify the resource requirements
for that provider.

After checking out this directory (or copying over the files) the VMs can be created by executing the command: vagrant up

Vagrant will automatically execute config.yaml which will configure the hostname on the VMs.

The Vagranfile used int this example will cause vagrant to create VMs
with 2 interfaces: a NAT interface (eth0) used for by the ssh
management sessions and external access and a private network
interface (eth1) providing a private network between the host and the
VMs. OpenContrail will use the private network interface; the
management interface is optional and may not exist in other
configurations (e.g. AWS, GCE).

After vagrant up completes, it is useful to add entries to /etc/hosts on all the VMs so that names can be resolved. For this purpose i used another ansible script invoked as:

ansible-playbook -u vagrant -i .vagrant/provisioners/ansible/inventory/vagrant_ansible_inventory resolution.yaml

This step must be executed independently of the ansible configuration
performed by vagrant since vagrant invokes ansible for each VM at a
time, while this playbook expects to be invoked for all hosts.

The command above dependens on the inventory file that vagrant creates
automatically when configuring the VMs. We will use the contents of
this inventory file in order to provision kubernetes and opencontrail

With the VMs running, we need to checkout the ansible playbooks that
configure kubernetes + opencontrail. While an earlier version of the playbook is available upstream in the kubernetes contrib repository, the most recent version of the playbook is in a development branch on a fork of that repository. Checkout the repository via:

git clone

The branch HEAD commit id, at the time of this post, is 15ddfd5.

UPDATE: The OpenContrail ansible playbook is now at

I will work to upstream the updated opencontrail playbook to both the
kubernetes and openshift provisioning repositories as soon as possible.

With the ansible playbook available on the contrib/ansible directory it is necessary to edit the file ansible/group_vars/all.yml replace the network provider:

# Network implementation (flannel|opencontrail)
networking: opencontrail

We then need to create an inventory file:



k8s-master ansible_ssh_user=vagrant ansible_ssh_host= ansible_ssh_port=2222 ansible_ssh_private_key_file=/Users/roque/k8s-provision/.vagrant/machines/k8s-master/virtualbox/private_key

k8s-master ansible_ssh_user=vagrant ansible_ssh_host= ansible_ssh_port=2222 ansible_ssh_private_key_file=/Users/roque/k8s-provision/.vagrant/machines/k8s-master/virtualbox/private_key

k8s-gateway ansible_ssh_user=vagrant ansible_ssh_host= ansible_ssh_port=2200 ansible_ssh_private_key_file=/Users/roque/k8s-provision/.vagrant/machines/k8s-gateway/virtualbox/private_key

k8s-node-01 ansible_ssh_user=vagrant ansible_ssh_host= ansible_ssh_port=2201 ansible_ssh_private_key_file=/Users/roque/k8s-provision/.vagrant/machines/k8s-node-01/virtualbox/private_key
k8s-node-02 ansible_ssh_user=vagrant ansible_ssh_host= ansible_ssh_port=2202 ansible_ssh_private_key_file=/Users/roque/k8s-provision/.vagrant/machines/k8s-node-02/virtualbox/private_key

This inventory file does the following:

  • Declares that hosts for the roles: masters, gateways, etcd, nodes;The ssh information is derived from the inventory created by vagrant.
  • Declares the location of the kubernetes binaries downloaded from the github release;
  • Defines the IP address prefix used for ‘External IPs’ by kubernetes services that require external access;
  • Instructs opencontrail to use the private network interface (eth1); without this setting the opencontrail playbook defaults to eth0.

Once this file is created, we can execute the ansible playbook by running the script "" in the contrib/ansible directory.

This script will run through all the steps required to provision
kubernetes and opencontrail; it is not unusual for the script to fail
to perform some of network based operations (downloading the
repository keys for docker for instance or downloading a file from
github); the ansible playbook is ment to be declarative (i.e. define
the end state of the system) and it is supposed to be re-run if a
network based failure is encountered.

At the end of the script we should be able to login to the master via the command “vagrant ssh k8s-master” and observe the following:

  • kubectl get nodes
    This should show two nodes: k8s-node-01 and k8s-node-02.
  • kubectl --namespace=kube-system get podsThis command should show that the kube-dns pod is running; if this pod is in a restart loop that usually means that the kube2sky container is not able to reach the kube-apiserver.
  • curl http://localhost:8082/virtual-networks | python -m json.toolThis should display a list of virtual-networks created in the opencontrail api
  • netstat -nt | grep 5269
    We expect 3 established TCP sessions for the control channel (xmpp) between the master and the nodes/gateway.

On the host (OSX) one should be able to access the diagnostic web interface of the vrouter agent running on the compute nodes:

These commands show display the information regarding the interfaces attached to each pod.

Once the cluster is operational, one can start an example application such as “guestbook-go”. This example can be found in the kubernetes examples directory. In order for it to run successfully the following modifications are necessary:

    • Edit guestbook-controller.json, in order to add the labels “name” and “uses” as in:
    • Edit redis-master-service.json and redis-slave-service.json in order to add a service name. The following is the configuration for the master:
"metadata": {
"labels" {
"role": "master",
  • Edit redis-master-controller.json and redis-slave-controller.json in order to add the “name” label to the pods. As in:

After the example is started the guestbook service will be allocated an ExternalIP on the external subnet (e.g.

In order to access the external IP network from the host one needs to add a route to (the gateway address). Once that is done you should be able to access the application via a web browser via

Ansible inventory, role variables and facts

I’ve been struggling a bit to understand how to use inventory, role variables and facts in the playbooks i’ve been working on (mostly around provisioning opencontrail on top of kubernetes/openshift-origin). I finally came up with a model that made sense to me. This is probably well understood by everyone else but i couldn’t quite grok it until i worked out the following example.

User configuration options should be set:
– In group_vars/all.yml for settings that affect all hosts;
– In the inventory file, for host and group variables;

As in this example:

It is useful to establish a convention for variables that are specific to the deployment (e.g. user settable variables). In this case i’m using flag_<var> as a convention for deployment specific variables.

Most of these would have defaults. In order to set the defaults, the playbook defines a role variable (flag_user_<var> in this example). The playbook role then uses flag_user_<var> rather than the original flag_<var>.

Role variables can use jinja template logic operations as well as filters. The most common operation is to use a <code>default</code> filter as in the example playbook bellow. But more complex logic can be built using {%if <expression> %}{% endif %} blocks.

Facts can then be used for variables that depend on the result of command execution. While it is possible to use set_fact in order to set variables, establishing a clear convention that facts are the result of command execution seems desirable.
While it may seem useful to use the set_fact action to set variables that have no dependencies on task execution
since it supports the when statement while role variables do not (although include_vars does), it helps to establish a simple convention that a “fact” is the result of a task observation.

My conclusion is that, for the playbooks that i write/maintain, I’m going to try to establish a set of rules before starting to actually write the tasks.

  1. Naming convention for user settable variables.
  2. Naming convention for role variables (i.e. user setting + default value).
  3. Limit set_fact to variables that depend on the outcome of task execution.

Please leave a comment if you have a different suggestion that improves maintainability of a playbook’s role specifications.

Kubernetes networking with OpenContrail

OpenContrail can be used to provide network micro-segmentation to kubernetes, providing both network isolation as well as the ability to attach a pod to a network that may have endpoints in using different technologies  (e.g. bare-metal servers on VLANs or OpenStack VMs).

This post describes how the current prototype works and how packets flow between pods. For illustration purposes we will focus on 2 tiers of the k8petstore example on kubernetes: the web frontend and the redis-master tier that the frontend uses as a data store.

The OpenContrail integration works without modifications to the kubernetes code base (as off v1.0.0 RC2). An additional daemon, by the name of kube-network-manager, is started on the master. The kubelets are executed with the option: “–network_plugin=opencontrail”, which instructs the kubelet to execute the command:
/usr/libexec/kubernetes/kubelet-plugins/net/exec/opencontrail/opencontrail. The source code for both the network-manager and the kubelet plugin are publicly available.

When using OpenContrail as the network implementation the kube-proxy process is disabled and all pod connectivity is implemented via the OpenContrail vrouter module which implements an overlay network using MPLS over UDP as encapsulation. OpenContrail uses a standards based control plane in order to distribute the mapping between endpoint (i.e. pod) and location (k8s node). The fact that the implementation is standards compliant means that it can interoperate with existing network devices (from multiple vendors).

The kube-network-manager process uses the kubernetes controller framework to listening to changes in objects that are defined in the API and add annotations to some of these objects. It then creates a network solution for the application using the OpenContrail API to define objects such as virtual-networks, network interfaces and access control policies.

The kubernetes deployment configuration for this example application consists of a replication controller (RC) and a service object for the web-server and a pod and service object for the redis-master.

The web frontend RC contains the following metadata:

"labels": {
  "name": "frontend",
  "uses": "redis-master"

This metadata information is copied to each pod replica created by the kube-controller-manager. When the network-manager sees these pods it will:

  • Create a virtual-network with the name <namespace:frontend>
  • Connect this network with the network for the service <namespace:redis-master>
  • Create an interface per pod replica with a unique private IP address from a cluster-wide address block (e.g. 10.0/16).

The kube-network-manager also annotates the pods with the interface uuid created by OpenContrail as well as the allocated private IP address (and a mac-address). These annotations are then read by the kubelet.

When the pods are started by the respective kubelet invokes the plugin script. This script removes the veth-pair associated with the docker0 bridge and assigns it to the OpenContrail vrouter kernel module, executing on each node. The same script notifies the contrail-vrouter-agent of the interface uuid associated with the veth interface and configures the IP address inside the pod’s network namespace.

At this stage each pod has an unique IP address in the cluster but can only communicate with other pods within the same virtual-network. Subnet broadcast and IP link-local multicast packets will be forwarded to the group of pods that are present in the same virtual-network (defined by the “” tag).

OpenContrail assigns a private forwarding table to each pod interface. The veth-pair associated with the network namespace used by docker is mapped into a table which has routing entries for each of the other pod instances that are defined within the same network or networks this pod has authorized access to. The routing tables are computed centrally by the OpenContrail control-node(s) and distributed to each of the compute nodes where the vrouter is running.

The deployment defines a service associated with web frontend pods:

  "kind": "Service",
  "metadata": {
    "name": "frontend",
    "labels": {
      "name": "frontend"
  "spec": {
    "ports": [{
      "port": 3000
    "selector": {
      "name": "frontend"

The “selector” tag specifies the pods that belong to the service. The service is then assigned a “ClusterIP” address by the kube-controller-manager. The ClusterIP is an unique IP address that can be used by other pods to consume the service. This particular service also allocates a PublicIP address that is accessible outside the cluster.

When the service is defined, the kube-network-manager creates a virtual-network for the service (with the name of <namespace:service-frontend>) and allocates a floating-ip address with the ClusterIP specified by kubernetes. The floating-ip address is then associated with each of the replicas.

In the k8petstore example, there is a load-generator tier defined by an RC with the following metadata:

        "labels": {
          "name": "bps",
          "uses": "frontend"

The network-manager process interprets the “uses” tag as an implicit authorization for the “bps” network to access the “service-frontend” network which contains the ClusterIP. That is the mechanism that causes the ClusterIP address to be visible in the private routing tables that are associated with the load-generator pods.

When traffic is sent to this ClusterIP address, the sender has multiple feasible paths available (one per replica). It chooses one of these based on a hash on the 5-tuple of the packet (IP source, IP destination, protocol,  source port, destination port). Traffic is sent encapsulated to the destination node such that the destination IP address of the inner packet is the ClusterIP. The vrouter kernel module in destination node then performs a destination NAT operation on the ClusterIP and translates this address to the private IP of the specific pod.

A packet sent by a load-generator pod to the ClusterIP of the web frontend goes through the following steps:

  1. Packet is sent by the IP stack in the container with SourceIP=”load-gen private IP”, DestinationIP=ClusterIP. This packet is send to eth0 inside the container network namespace, which is a Linux veth-pair interface.
  2. The packet is delivered to the vrouter kernel module; a route lookup is performed for the destination IP address (ClusterIP) in the private forwarding table “bps”.
  3. This route lookup returns an equal cost load balancing next-hop (i.e. list of available path). The ECMP algorithm selects one of the available paths and encapsulates the traffic such that and additional IP header is added to the packet with SourceIP=”sender node address”, DestinationIP=”destination node address”; additionally an MPLS label is added to the packet corresponding to the destination pod.
  4. Packet travels in the underlay to the destination node.
  5. The destination node strips the outer headers and performs a lookup on the MPLS label and determines that the destination IP address is a “floating-ip” address and requires NAT translation.
  6. The destination node creates a flow-pair with the NAT mapping of the ClusterIP to the private IP of the destination pod and modifies the destination IP of the payload.
  7. Packet is delivered to the pod such that the source IP is the unique private IP of the source pod and the destination IP is the private IP of the local pod.

The service definition for the web front-end also specified a PublicIP. This address is implemented as a floating-ip address like the ClusterIP, except that the floating-ip is associated with a network that spans across the cluster and the outside world. Typically, OpenContrail deployments configure one or more “external” networks that map to virtual  network on external network devices such as a data-center router.

Traffic from the external network is also equal cost load balanced to the pod replicas of the    web frontend. The mechanism is the same as described above except that the ingress device is a router rather than a kubernetes node.

To finalize the walk-through of the k8petstore example, the redis-master service defines:

  "kind": "Service",

  "metadata": {
    "name": "redismaster",
    "labels": {
      "name": "redis-master"
  "spec": {
    "ports": [{
      "port": 6379
    "selector": {
      "name": "redis-master"

Since the web frontend pods contain the label "uses": "redis-master" the network-manager creates a policy that connects the clients (frontend pods) to the service ClusterIP. This policy can also limit the traffic to allow access to the ports specified in the service definition only.

There remains additional of work to be done in this integration, but i do believe that the existing prototype shows how OpenContrail can be used to provide an elegant solution for  micro-segmentation that can both provide connectivity outside the cluster as well as pass a security audit.

From an OpenContrail perspective, the delta between a kubernetes and an OpenStack deployment is that in OpenStack the Neutron plugin provides the mapping between Neutron and OpenContrail API objects while in kubernetes the network-manager translates the pod and service definitions into the same objects. The core functionality of the networking solution remains unchanged.

Static routes

OpenContrail allows the user to specify a static route with a next-hop of an instance interface. The route is advertised within the virtual-network that the interface is associated with. This script can be used to manipulate the static routes configured on an interface.

I wrote it in order to setup a cluster in which overlay networks are used hierarchically. The bare-metal nodes are running OpenStack using OpenContrail as the neutron plugin; a set of OpenStack VMs are running a second overlay network using OpenContrail which kubernetes as the compute scheduler.

In order to provide external access for the kubernetes cluster, one of the kubernetes node VMs was configured as an OpenContrail software gateway.

This is easily achievable by editing /etc/contrail/contrail-vrouter-agent.conf to include the following snippet:

# Name of the routing_instance for which the gateway is being configured

# Gateway interface name

# Virtual network ip blocks for which gateway service is required. Each IP
# block is represented as ip/prefix. Multiple IP blocks are represented by
# separating each with a space

The vow interface can then be created via the following sequence of shell commands:

ip link add vgw type vhost
ip link set vgw address 00:00:5e:00:01:00
ip link set vgw up
ip route add dev vgw

The interface-route script can then be used to add a static route to the IP prefix configured in the software gateway interface. This route should be added to an interface (e.g. neutron port) associated with the VM that is running the software gateway functionality and in a network that is externally connected.

This allows the nested overlay to be accessed from outside the cluster. For redundancy, multiple VMs can be configured with a gateway interface and the corresponding static route.