Open Infrastructure Summit , Berlin 2020

I’m pleased to serve on the programming committee of the  “Getting Started”  track in the upcoming Open Infra summit in Berlin. If you plan to submit a talk and have questions, looking for advice or just want to have a chat on your proposal. I will be happy to help you with crafting the proposal.

My office hours are:

11 – 12:30 UTC, Thursdays on Freenode IRC, #open-infra-summit-cfp

Good Luck !

Openstack UC and TC to Unite!

This coming month marks the last running session of the Openstack User Committee. Over the years, the OpenStack community has grown with many operators being directly involved in the development lifecycle. In efforts to cope with such change, we needed to adjust the governance model to remove the barriers between the governing bodies, and in turn enabling more involvement from operators in the various projects. Thus, starting on August 1st,  the UC will unite with TC to be a single governance body under TC.

I am honored to have been part of the UC and to have served as the chair in its last round. I’d like to thank all current and past UC members for their efforts that together have supported the Openstack user community over the past years. I’d like also to thank the user community for entrusting the UC and supporting its mission in serving and representing all Openstack users. I’m confident that the united body will do a great job and continue to provide a strong representation of the user community and serve its needs.

So long UC, and thanks for all the fish !

Cephadm: Good bye ceph-deploy

As you probably may know, ceph-deploy, the beloved deployment utility for CEPH, is no longer maintained. Cephadm is the new tool/package to deploy CEPH clusters.

CERN has a pretty good intro PDF to it.  Cephadm includes many nice features including the ability to adopt running CEPH clusters.

Two quick notes that’ll save you some time

  • While adding hosts using the
ceph orch host add hostname

You need to specify the IP of the host as follows

ceph orch host add hostname IP.IP.IP.IP

if you get the following error, despite injecting the ssh keys correctly

Error ENOENT: Failed to connect to hostname (hostname).  Check that the host is reachable and accepts connections using the cephadm SSH key
you may want to run: 
> ssh -F =(ceph cephadm get-ssh-config) -i =(ceph config-key get mgr/cephadm/ssh_identity_key) root@hostname
  • When adding OSDs

If you are deploying a cluster with a “relatively” moderate number of OSDs per host, you may run into the following error scenario while using:

 ceph orch apply osd --all-available-devices

The command basically adds the available hdds/ssds to be part of your cluster. Under the hood, this is done by running a docker container that’s in charge of that OSD. Basically the following command is run

/bin/bash /var/lib/ceph/{FSID}/osd.{NUM}/unit.run

It does that for every available OSD in your hosts. You may find that some of the OSDs don’t start and are stuck in error start despite your efforts to use

ceph orch daemon restart osd.xx

If you dig deeper (by executing the docker shell directly or looking into the logs) , you will find the following self-explanatory error

 /var/lib/ceph/osd/ceph-xx/block) _aio_start io_setup(2) failed with EAGAIN; try increasing /proc/sys/fs/aio-max-nr

The solution is simply to set the asynchronous non-blocking IO into a higher value using

sudo sysctl -w fs.aio-max-nr=1048576

If that solves your issue, apply it to sysctl.conf to persist

Happy cephadmining 🙂

 

 

 

Running pods on master nodes in RKE

You may run into situations where you need to run pods on your K8s master node. If you’r using RKE, you need to taint two labels on the controller/master node.

You can obtain the current labels using

kubectl describe node NODENAME

and look under the labels section

Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=lab-1
kubernetes.io/os=linux
node-role.kubernetes.io/controlplane=true
node-role.kubernetes.io/etcd=true

you then need the following two commands

kubectl taint nodes node-1 node-role.kubernetes.io/etcd-
kubectl taint nodes node-1 node-role.kubernetes.io/controlplane-

Happy Kuberneting!

Apache Openwhisk with Kubespray

If you’ looking into serverless computing, you probably have bumped into Apache Openwhisk and Knative. Both are the opensource frameworks for serverless computing that allow you to deploy event-driven microservices, called functions.

Apache Openwhisk deployment has many options, including on top of a running Kubernetes cluster. You may always use the Kubernetes deployment tools such as RKE or Kubespray to deploy Kubernetes and later use helm charts to deploy Apache Openwhisk. I found this the most consistent way of creating Apache Openwhisk deployments for evaluation and performance analysis.

If you used Kubespray to deploy Kubernetes,  note that starting version 2.9.0 , Kubespray no longer supports KubeDNS, so your Kubernetes deployment will be using CoreDNS instead. This will impact your Apache Openwhisk deployment, which by default uses KubeDNS for dns resolution. When you deploy the helm chart for openwhisk, you will get an error like this in the nginx pod

nginx: [emerg] 1#1: host not found in resolver "kube-dns.kube-system" in /etc/nginx/nginx.conf:41
nginx: [emerg] host not found in resolver "kube-dns.kube-system" in /etc/nginx/nginx.conf:41

What you need to do is to change the config to use the CoreDNS by updating the k8s section in the values.yaml to use coredns, like this:

k8s:
  domain: cluster.local
  dns: coredns.kube-system
  persistence:
    enabled: true
    hasDefaultStorageClass: true
    explicitStorageClass: nil

You will need to redeploy the helm chart.

Another option if you don’t want to redeploy is to edit the config map of nginx

kubectl edit configmap -n NAMESPACE nginx

you will have to update the resolver to be like this

resolver coredns.kube-system;

Then, you will need to restart the nginx, by scaling it down to 0 pods and then scaling it back up

kubectl scale deployment nginx -n NAMESPACE --replicas=0
kubectl scale deployment nginx -n NAMESPACE --replicas=1

This should fix the dns resolution issue

Good luck !

 

 

 

How NICs work ? a quick dive !

I’ve written this post as a draft sometime ago, but forgot to post it. The reason I looked into it was to find out how DPDK physically works as the OS/Device level and how it bypasses the network stack.

So, when you attach a PCIe NIC to your Linux server, you expect traffic will flow once your application send traffic. But how does it actually flow, and what components are invoked in Linux to get this traffic to flow, if you are thinking of these questions then this post is for you.

Let’s first discuss the main components that allow your NIC card to be recognized, registered and handled by Linux.

  • NIC FIFO buffers: From the name, this is a FIFO Hardware buffer that your NIC has in it, the purpose is simply to put the received data somewhere before passing it to the OS. The size is determined by the vendor. One tricky part here is that it’s frequently named in the driver code as the “ring” buffer, which is true. FIFO buffers are a ring implementation, i.e. you start overriding packets if you run out of space in the buffer
  • NIC driver: The driver is by default, a kernel module. This means it lives in the kernel memory space. The driver has three core functions
    • Allocate RX and TX queues: Those are queues in the memory of the host server (i.e. hardware, but on your server, not your NIC). The purpose of those queues is to contain pointers to your to-be-sent/received packets. The RX and TX queues are usually refereed to as , descriptor rings. The name descriptor comes from their core purpose: containing descriptors for packets and not the packets contents.
    • Initialize the NIC: Basically reach out to the hardware NIC registers, set their values appropriately, and during operations pass them some memory addresses (contained in RX/TX descriptor rings) on where data to be sent/received will exist
    • Handling interrupts: When you driver starts, it has to register a way to communicate with the host OS, to tell it it has received packets. Also the OS has to be able to “kick” the NIC to send ready-to-be-sent packets. This is where the driver has to register “Interrupt service handlers/routines”, aka ISR. Depending on the generation/architecture of the NIC, there are multiple kinds of Interrupts supported by Linux that the NIC may implement
  • NIC DMA Engine: Responsible for copying data in/out from the NIC FIFO buffers to your RAM. This is a physical part of the NIC
  • NAPI: New API, basically a polling mechanism that works on scheduled threads that handle new arriving data. The most common task is to get this data flowing through the network stack. NAPI relies on the concept of poll lists, that drivers register their interrupts to and then are harvested periodically, instead of continuous interrupt servicing which is CPU heavy.
  • Network Stack: Where all the OSI model happens. Except of course if you’r using DPDK, you dont want it to pass through the network stack all together

A summary diagram for that is

NIC.png

Let’s follow what happens when a NIC receives a packet:

  • First thing, the NIC will put the packet in the FIFO memory
  • Second, the NIC will use the dma engine to try to receive an RX descriptor. The RX descriptor will point to the location in memory to store the received packet. Descriptors only point to the location of the received data and do not contain the data itself
  • Thrid, once the NIC knows where to put the data. It will use the DMA engine again to write the received packet to the memory region specified in the descriptor.
  • Once the data is in the received memory region, the NIC will raise an RX interrupt to the host OS.
  • Depending on the type of enabled interrupts, the OS either stops what’s doing to handle this interrupt (CPU heaviy), or relies on a polling mechanism that works regularly to check for the new interrupts (NAPI.. aka New API) which is the default method in newer kernels
  • NAPI hand over the data in the memory region to the network stack and that’s when it goes through multiple layers of the OSI model to eventually reach a socket where your application is waiting in the user-space.

A very good go-to manual is located at:

https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data

Good luck !