k8s + opencontrail on AWS

For anyone interested in running a testbed with Kubernetes and OpenContrail on AWS i managed to boil down the install steps to the minimum:

  • Use AWS IAM to create a user and download a file “credentials.csv”
  • Checkout the scripts via `git clone https://github.com/Juniper/container-networking-ansible.git`
  • Change to the “test/ec2-k8s” directory.
  • Setup environment variables with your EC2 IAM role. The script credentials.sh can be used for this purpose.
  • Follow the steps in the test script:
    • ansible-playbook -i localhost playbook.yml
    • Extract the deployer hostname from the file `cluster.status`; this is the inventory entry in the `[management]` group.
    • Login into the deployer hostname and execute:
      • ansible-playbook -i src/contrib/ansible/inventory src/contrib/ansible/resolution.yml
      • ansible-playbook -i src/contrib/ansible/inventory src/contrib/ansible/cluster.yml
      • ansible-playbook -i src/contrib/ansible/inventory src/contrib/ansible/validate.yml
      • ansible-playbook -i src/contrib/ansible/inventory src/contrib/ansible/examples.yml

This will:

  • Create 5 VMs in a VPC on AWS;
  • Run the ansible provisioning script that installs the cluster;
  • Run a minimal sanity check on the cluster;
  • Launch an example;
  • Fetch the status page of the example app in order to check whether it is running successfully.

 

Container networking: To overlay or not to overlay

One of the key decisions in designing a compute infrastructure is how to handle networking.

For platforms that are designed to deliver applications, it is now common knowledge that application developers need a platform that can execute and manage containers (rather than VMs).

When it comes to networking, however, the choices are less clear. In what scenarios are designs based on single layer preferable vs. overlay networks ?

The answer to this question is not a simplistic one based on “encapsulation overhead”; while there are overlay networking projects that do exhibit poor performance, production ready solutions such as OpenContrail have performance characteristics on both throughput and PPS similar to the Linux kernel bridge implementation. When not using an overlay, it is still necessary to use an internal bridge to demux the container virtual-ethernet interface pairs.

The key aspect to consider is operational complexity!

From a bottoms-up perspective, one can build an argument that a network design with no encapsulation that simply uses an address prefix per host (e.g. a /22) provides the simplest possible solution to operate. And that is indeed the case if one assumes that discovery, failover and authentication can be handled completely at the “session” layer (OSI model).

I’m familiar with a particular compute infrastructure where this is the case: all communication between containers uses a common “RPC” layer which provides discovery, authentication and authorization and a common presentation layer.

In the scenario I’m familiar with, this works well because every single application component was written from scratch.

The key take away for me from observing this infrastructure operate was not really whether an overlay is used or not. The key lesson, in my mind, is that it is possible to operate the physical switching infrastructure independently of the problems of discovery, application failover and authentication. I believe that those with a background in operations in traditional enterprise networking environments can fully appreciate how decoupling the switching infrastructure from these problems can lead not only to simpler operations but also to driving up the throughput of the network fabric.

In environments where not all application components use the same common RPC layer or where a “belt and suspenders” approach is desirable, discovery, authorization and failover are a function that the infrastructure is expected to deliver to the applications.

Whenever that is the case, using an overlay network design provides the simplest (from an operations standpoint) way to implement this functionality because it offers a clear separation between the application layer and the physical infrastructure.

In the work we’ve been doing with OpenContrail around kubernetes/openshift, OpenContrail provides:

  •  access control between application tiers and services (a.k.a micro-segmentaion);
  •  service discovery and failover, by mapping the service’s virtual IP address (aka ClusterIP) to the instances (Pods) that implement the service;
  •  traffic monitoring and reporting on a per app-tier basis;

While OpenContrail introduces its own software stack which has an inherent operational complexity; this functionality can be operated independently of the switching infrastructure. And it is totally agnostic to the switching infrastructure. It can operate the same way
on a private cloud environment or on a public cloud; on any public cloud.

Using an overlay, allows OpenContrail to carry with the packet the information of the “network context” for purposes of both access control and traffic accounting and analysis. It also provides a clean separation between the virtual IP address space and the physical
infrastructure.

For instance, address pools do not have to be pre-allocated on a per node basis. And while IPv6 can bring an almost infinite address pool sizes, that doesn’t address the problem of discovery and failover for the virtual IP addresses associated with a service (a.k.a ClusterIPs in k8s).

In addition to this, standard based overlays such as OpenContrail can carry the information of the virtual network segment across multiple clusters or different infrastructure. A simple example: one can expose multiple external services (e.g. an Oracle DB and a DB2 system) to a single cluster while maintaining independent access-control (that doesn’t depend on listing individual IP addresses).

From an operational perspective, it is an imperative to separate the physical infrastructure, which is probably different on a cluster by cluster basis, from the services provided by the cluster. When discovery, authorization and failover are not a pure session layer
implementation it makes sense to use a network overlay.

kubernetes + opencontrail install

In this post we walk through the steps required to install a 2 node cluster running kubernetes that uses opencontrail as the network provider. In addition to the 2 compute nodes, we use a master and a gateway node. The master runs both the kubernetes api server and scheduler as well as the opencontrail configuration management and control plane.

OpenContrail implements an overlay network using standards based network protocols:

This means that, in production environments, it is possible to use existing network appliances from multiple vendors that can serve as the gateway between the un-encapsulated network (a.k.a. underlay) and the network overlay. However for the purposes of a test cluster we will use an extra node (the gateway) whose job is to provide access between the underlay and overlay networks.

For this exercise, I decided to use my MacBookPro which has 16G of RAM. However all the tools used are supported on Linux also; it should be relativly simple to reproduce the same steps on a Linux machine or on a cloud such as AWS or GCE.

The first step in the process is to obtain binaries for kubernetes release-1.1.1. I then unpacked the tar file into the ~/tmp and then extracted the linux binaries required to run the cluster using the command:


cd ~/tmp;tar zxvf kubernetes/server/kubernetes-server-linux-amd64.tar.gz

In order to create the 4 virtual-machines required for this scenario I used virtual-box and vagrant. Both are trivial to install on OSX.

In order to provision the virtual-machines we use ansible. Ansible can
be installed via “pip install ansible”. I then created a default
ansible.cfg that enables the pipelining option and disables ssh
connection sharing. The later was required to work around failures on
tasks that use “delegate_to” and run concurrently (i.e. run_once is
false). From a cursory internet search, it appears that the openssh
server that ships with ubuntu 14.04 has a concurrency issue when
handling multi-session.

 

[defaults]
pipelining=True

[ssh_connection]
ssh_args = -o ControlMaster=no -o ControlPersist=60s

With ansible and vagrant installed, we can proceed to create the VMs
used by this testbed. The vagrant configuration for this example is
available in github. The servers.yaml file lists the
names and resource requirements for the 4 VMs. Please note that if you
are adjusting this example to run in a different vagrant provider the
Vagrantfile needs to be edited to specify the resource requirements
for that provider.

After checking out this directory (or copying over the files) the VMs can be created by executing the command: vagrant up

Vagrant will automatically execute config.yaml which will configure the hostname on the VMs.

The Vagranfile used int this example will cause vagrant to create VMs
with 2 interfaces: a NAT interface (eth0) used for by the ssh
management sessions and external access and a private network
interface (eth1) providing a private network between the host and the
VMs. OpenContrail will use the private network interface; the
management interface is optional and may not exist in other
configurations (e.g. AWS, GCE).

After vagrant up completes, it is useful to add entries to /etc/hosts on all the VMs so that names can be resolved. For this purpose i used another ansible script invoked as:

ansible-playbook -u vagrant -i .vagrant/provisioners/ansible/inventory/vagrant_ansible_inventory resolution.yaml

This step must be executed independently of the ansible configuration
performed by vagrant since vagrant invokes ansible for each VM at a
time, while this playbook expects to be invoked for all hosts.

The command above dependens on the inventory file that vagrant creates
automatically when configuring the VMs. We will use the contents of
this inventory file in order to provision kubernetes and opencontrail
also.

With the VMs running, we need to checkout the ansible playbooks that
configure kubernetes + opencontrail. While an earlier version of the playbook is available upstream in the kubernetes contrib repository, the most recent version of the playbook is in a development branch on a fork of that repository. Checkout the repository via:


git clone https://github.com/pedro-r-marques/contrib/tree/opencontrail

The branch HEAD commit id, at the time of this post, is 15ddfd5.

UPDATE: The OpenContrail ansible playbook is now at https://github.com/Juniper/container-networking-ansible.

I will work to upstream the updated opencontrail playbook to both the
kubernetes and openshift provisioning repositories as soon as possible.

With the ansible playbook available on the contrib/ansible directory it is necessary to edit the file ansible/group_vars/all.yml replace the network provider:

# Network implementation (flannel|opencontrail)
networking: opencontrail

We then need to create an inventory file:

[opencontrail:children]
masters
nodes
gateways

[opencontrail:vars]
localBuildOutput=/Users/roque/tmp/kubernetes/server/bin
opencontrail_public_subnet=100.64.0.0/16
opencontrail_interface=eth1

[masters]
k8s-master ansible_ssh_user=vagrant ansible_ssh_host=127.0.0.1 ansible_ssh_port=2222 ansible_ssh_private_key_file=/Users/roque/k8s-provision/.vagrant/machines/k8s-master/virtualbox/private_key

[etcd]
k8s-master ansible_ssh_user=vagrant ansible_ssh_host=127.0.0.1 ansible_ssh_port=2222 ansible_ssh_private_key_file=/Users/roque/k8s-provision/.vagrant/machines/k8s-master/virtualbox/private_key

[gateways]
k8s-gateway ansible_ssh_user=vagrant ansible_ssh_host=127.0.0.1 ansible_ssh_port=2200 ansible_ssh_private_key_file=/Users/roque/k8s-provision/.vagrant/machines/k8s-gateway/virtualbox/private_key

[nodes]
k8s-node-01 ansible_ssh_user=vagrant ansible_ssh_host=127.0.0.1 ansible_ssh_port=2201 ansible_ssh_private_key_file=/Users/roque/k8s-provision/.vagrant/machines/k8s-node-01/virtualbox/private_key
k8s-node-02 ansible_ssh_user=vagrant ansible_ssh_host=127.0.0.1 ansible_ssh_port=2202 ansible_ssh_private_key_file=/Users/roque/k8s-provision/.vagrant/machines/k8s-node-02/virtualbox/private_key

This inventory file does the following:

  • Declares that hosts for the roles: masters, gateways, etcd, nodes;The ssh information is derived from the inventory created by vagrant.
  • Declares the location of the kubernetes binaries downloaded from the github release;
  • Defines the IP address prefix used for ‘External IPs’ by kubernetes services that require external access;
  • Instructs opencontrail to use the private network interface (eth1); without this setting the opencontrail playbook defaults to eth0.

Once this file is created, we can execute the ansible playbook by running the script "setup.sh" in the contrib/ansible directory.

This script will run through all the steps required to provision
kubernetes and opencontrail; it is not unusual for the script to fail
to perform some of network based operations (downloading the
repository keys for docker for instance or downloading a file from
github); the ansible playbook is ment to be declarative (i.e. define
the end state of the system) and it is supposed to be re-run if a
network based failure is encountered.

At the end of the script we should be able to login to the master via the command “vagrant ssh k8s-master” and observe the following:

  • kubectl get nodes
    This should show two nodes: k8s-node-01 and k8s-node-02.
  • kubectl --namespace=kube-system get podsThis command should show that the kube-dns pod is running; if this pod is in a restart loop that usually means that the kube2sky container is not able to reach the kube-apiserver.
  • curl http://localhost:8082/virtual-networks | python -m json.toolThis should display a list of virtual-networks created in the opencontrail api
  • netstat -nt | grep 5269
    We expect 3 established TCP sessions for the control channel (xmpp) between the master and the nodes/gateway.

On the host (OSX) one should be able to access the diagnostic web interface of the vrouter agent running on the compute nodes:

These commands show display the information regarding the interfaces attached to each pod.

Once the cluster is operational, one can start an example application such as “guestbook-go”. This example can be found in the kubernetes examples directory. In order for it to run successfully the following modifications are necessary:

    • Edit guestbook-controller.json, in order to add the labels “name” and “uses” as in:
"spec":{
[...]
"template":{
"metadata":{
"labels":{
"app":"guestbook",
"name":"guestbook",
"uses":"redis"
}
},
[...]
}
    • Edit redis-master-service.json and redis-slave-service.json in order to add a service name. The following is the configuration for the master:
"metadata": {
[...]
"labels" {
"app":"redis",
"role": "master",
"name":"redis"
}
}
  • Edit redis-master-controller.json and redis-slave-controller.json in order to add the “name” label to the pods. As in:
    "spec":{
    [...]
    "template":{
    "metadata":{
    "labels":{
    "app":"redis",
    "role":"master",
    "name":"redis"
    }
    },
    [...]
    }
    

After the example is started the guestbook service will be allocated an ExternalIP on the external subnet (e.g. 100.64.255.252).

In order to access the external IP network from the host one needs to add a route to 192.168.1.254 (the gateway address). Once that is done you should be able to access the application via a web browser via http://100.64.255.252:3000.