Running pods on master nodes in RKE

You may run into situations where you need to run pods on your K8s master node. If you’r using RKE, you need to taint two labels on the controller/master node.

You can obtain the current labels using

kubectl describe node NODENAME

and look under the labels section

Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=lab-1
kubernetes.io/os=linux
node-role.kubernetes.io/controlplane=true
node-role.kubernetes.io/etcd=true

you then need the following two commands

kubectl taint nodes node-1 node-role.kubernetes.io/etcd-
kubectl taint nodes node-1 node-role.kubernetes.io/controlplane-

Happy Kuberneting!

Apache Openwhisk with Kubespray

If you’ looking into serverless computing, you probably have bumped into Apache Openwhisk and Knative. Both are the opensource frameworks for serverless computing that allow you to deploy event-driven microservices, called functions.

Apache Openwhisk deployment has many options, including on top of a running Kubernetes cluster. You may always use the Kubernetes deployment tools such as RKE or Kubespray to deploy Kubernetes and later use helm charts to deploy Apache Openwhisk. I found this the most consistent way of creating Apache Openwhisk deployments for evaluation and performance analysis.

If you used Kubespray to deploy Kubernetes,  note that starting version 2.9.0 , Kubespray no longer supports KubeDNS, so your Kubernetes deployment will be using CoreDNS instead. This will impact your Apache Openwhisk deployment, which by default uses KubeDNS for dns resolution. When you deploy the helm chart for openwhisk, you will get an error like this in the nginx pod

nginx: [emerg] 1#1: host not found in resolver "kube-dns.kube-system" in /etc/nginx/nginx.conf:41
nginx: [emerg] host not found in resolver "kube-dns.kube-system" in /etc/nginx/nginx.conf:41

What you need to do is to change the config to use the CoreDNS by updating the k8s section in the values.yaml to use coredns, like this:

k8s:
  domain: cluster.local
  dns: coredns.kube-system
  persistence:
    enabled: true
    hasDefaultStorageClass: true
    explicitStorageClass: nil

You will need to redeploy the helm chart.

Another option if you don’t want to redeploy is to edit the config map of nginx

kubectl edit configmap -n NAMESPACE nginx

you will have to update the resolver to be like this

resolver coredns.kube-system;

Then, you will need to restart the nginx, by scaling it down to 0 pods and then scaling it back up

kubectl scale deployment nginx -n NAMESPACE --replicas=0
kubectl scale deployment nginx -n NAMESPACE --replicas=1

This should fix the dns resolution issue

Good luck !

 

 

 

How NICs work ? a quick dive !

I’ve written this post as a draft sometime ago, but forgot to post it. The reason I looked into it was to find out how DPDK physically works as the OS/Device level and how it bypasses the network stack.

So, when you attach a PCIe NIC to your Linux server, you expect traffic will flow once your application send traffic. But how does it actually flow, and what components are invoked in Linux to get this traffic to flow, if you are thinking of these questions then this post is for you.

Let’s first discuss the main components that allow your NIC card to be recognized, registered and handled by Linux.

  • NIC FIFO buffers: From the name, this is a FIFO Hardware buffer that your NIC has in it, the purpose is simply to put the received data somewhere before passing it to the OS. The size is determined by the vendor. One tricky part here is that it’s frequently named in the driver code as the “ring” buffer, which is true. FIFO buffers are a ring implementation, i.e. you start overriding packets if you run out of space in the buffer
  • NIC driver: The driver is by default, a kernel module. This means it lives in the kernel memory space. The driver has three core functions
    • Allocate RX and TX queues: Those are queues in the memory of the host server (i.e. hardware, but on your server, not your NIC). The purpose of those queues is to contain pointers to your to-be-sent/received packets. The RX and TX queues are usually refereed to as , descriptor rings. The name descriptor comes from their core purpose: containing descriptors for packets and not the packets contents.
    • Initialize the NIC: Basically reach out to the hardware NIC registers, set their values appropriately, and during operations pass them some memory addresses (contained in RX/TX descriptor rings) on where data to be sent/received will exist
    • Handling interrupts: When you driver starts, it has to register a way to communicate with the host OS, to tell it it has received packets. Also the OS has to be able to “kick” the NIC to send ready-to-be-sent packets. This is where the driver has to register “Interrupt service handlers/routines”, aka ISR. Depending on the generation/architecture of the NIC, there are multiple kinds of Interrupts supported by Linux that the NIC may implement
  • NIC DMA Engine: Responsible for copying data in/out from the NIC FIFO buffers to your RAM. This is a physical part of the NIC
  • NAPI: New API, basically a polling mechanism that works on scheduled threads that handle new arriving data. The most common task is to get this data flowing through the network stack. NAPI relies on the concept of poll lists, that drivers register their interrupts to and then are harvested periodically, instead of continuous interrupt servicing which is CPU heavy.
  • Network Stack: Where all the OSI model happens. Except of course if you’r using DPDK, you dont want it to pass through the network stack all together

A summary diagram for that is

NIC.png

Let’s follow what happens when a NIC receives a packet:

  • First thing, the NIC will put the packet in the FIFO memory
  • Second, the NIC will use the dma engine to try to receive an RX descriptor. The RX descriptor will point to the location in memory to store the received packet. Descriptors only point to the location of the received data and do not contain the data itself
  • Thrid, once the NIC knows where to put the data. It will use the DMA engine again to write the received packet to the memory region specified in the descriptor.
  • Once the data is in the received memory region, the NIC will raise an RX interrupt to the host OS.
  • Depending on the type of enabled interrupts, the OS either stops what’s doing to handle this interrupt (CPU heaviy), or relies on a polling mechanism that works regularly to check for the new interrupts (NAPI.. aka New API) which is the default method in newer kernels
  • NAPI hand over the data in the memory region to the network stack and that’s when it goes through multiple layers of the OSI model to eventually reach a socket where your application is waiting in the user-space.

A very good go-to manual is located at:

https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data

Good luck !

PCI passthrough: Type-PF, Type-VF and Type-PCI

Passthrough has became more and more popular with time. It started initially for simple PCI device assignment to VMs and then grew to be part of high performance network realm in the Cloud such as SR-IOV, Host-level DPDK and VM-Level DPDK for NFV.

In Openstack, if you need to passthrough a device on your compute hosts to the VMs, you will need to specify that in the nova.conf via the passthrough_whitelist and the alias directives under the [pci] category. A typical configuration of nova.conf on the controller node will look like that

[pci]
alias = { "vendor_id":"1111", "product_id":"1111", "device_type":"type-PCI", "name":"a1"}
alias = { "vendor_id":"2222", "product_id":"2222", "device_type":"type-PCI", "name":"a2"}

while on the compute host it, nova.conf will look like that

[pci]
alias = { "vendor_id":"1111", "product_id":"1111", "device_type":"type-PCI", "name":"a1"}
alias = { "vendor_id":"2222", "product_id":"2222", "device_type":"type-PCI", "name":"a2"}
passthrough_whitelist = [{"vendor_id":"1111", "product_id":"1111"}, {"vendor_id":"2222", "product_id":"2222"}]

Each alias represents a device that nova-scheduler will be capable of scheduling againist using the PciPassthroughFilter filter. The more devices you want to pass through, the more alias lines you will have to create.

Alias syntax is quite self explanatory. vendor_id is unique for the device vendor, product_id is unique per device, name is an identifier that you specify of this device. Both vendor_id and product_id can be obtained via the command

lspci -nnn

You can deduce the vendor and product ids from the output as follows

000a:00:00.0 PCI bridge [0000]: Host Bridge  [1111:2222]

In this case, the vendor_id is 1111 and the product_id is 2222

But how about device_type in the alias definition ? . Well device_type can be one of three values: type-PCI, type-PF and type-VF

type-PCI is the most generic. What it does is pass-through the PCI card to the guest VM through the following mechanism:

  • IOMMU/VT-d will be used for memory mapping and isolation, such that the Guest OS can access the memory structures of the PCI device
  • No vendor driver will be loaded for the PCI device in the compute host OS
  • The Guest VM will handle the device directly using the vendor driver

When a PCI device gets attached to a qemu-kvm instance, the libvirt definition for that instance will include a hostdev for that device, for example:

   <hostdev mode='subsystem' type='pci' managed='yes'>
     <source>
        <address domain='0x1111' bus='0x11' slot='0x11' function='0x1'/>
      </source>      
        <address type='pci' domain='0x1111' bus='0x11' slot='0x1' function='0x0'/>
    </hostdev>

The next two types are more interesting. They originated for SR-IOV capable devices, where the notion of Physical function “PF” and Virtual Functions “VF”. There’s a core difference with those two types than the type-PCI which is

  • A PF driver is loaded for the SR-IOV device in the compute-host OS.

Let’s explain what the difference between type-VF and type-PF is, we will start with VFs first:

type-VF allows you to pass a Virtual Function, which is a lightweight PCIe device that has its own RX/TX queues in case of network devices. Your VM will be able to use the VF driver, provided by the vendor, to access the VF and deal with it as a regular device for IO. VFs generally have the same vendor_id as the hardware device vendor_id, but with a different product_id specified for the VFs.

type-PF on the other hand refers to a fully capable PCIe device, that can control the physical functions of an SR-IOV capable device, including the configuration of the Virtual functions. type-PF allows you to passthrough the PF to be controlled by the VMs. This is sometimes useful in NFV use-cases.

A simplified layout of PF/VF looks like this

SRIOV-KERNEL (2).png

PF driver is used to configure the SR-IOV functionality and partition the device into virtual functions accessed by the VM in userspace

A nice feature about nova-compute is that it does print out the Final resource view, which contains specifics of the passthroughed devices. It will look like that in the case of a PF passthrough

Final resource view: pci_stats=[PciDevicePool(count=2,numa_node=0,product_id='2222',tags={dev_type='type-PF'},vendor_id='1111')]

Which says there’r two devices in numa cell 0 with the specified vendor_id and product_id that are available for passthrough

In the case of VF passthrough:

Final resource view: pci_stats=[PciDevicePool(count=1,numa_node=0,product_id='3333',tags={dev_type='type-VF'},vendor_id='1111')]

In this case there’s only one VF with vendor_id 1111 and product_id 3333 that’s ready to be passthroughed on numa cell 0

The blueprint of PF passthrough type is here if you’r interested

https://blueprints.launchpad.net/nova/+spec/sriov-physical-function-passthrough

Good Luck !

VNI Ranges: What do they do ?

Deployment tools for Openstack have become very popular, including the very well known Openstack-Ansible. It makes deploying a Cloud an easy task, at the expense of losing access to the insights of “Behind the Scenes” of your your Cloud deployment. If you have had to configure neutron manually, you would have come across the following section in the ml2 configuration

[ml2_type_vxlan] # 
(ListOpt) Comma-separated list of <vni_min>:<vni_max>
tuples enumerating # ranges of VXLAN VNI IDs that are available for
tenant network allocation. #
# vni_ranges =

You probably have set it to a range, similar to 10:100 or 10:300 and so on

But what does this configuration mean ?

When you configure neutron to use VXLAN as the segmentation network, each tenant network gets assigned a Virtual Network Identifier “VNI”. VNIs are numeric values that you specify their range with the vni_ranges parameter. 

An advantage of having control on this parameter is that you can specify the maximum number of VXLANs that the ml2 agent can use. Although this seems like an advantage, it can also be a disadvantage in a dynamic environment as you can run into situations where your networks can not be created because all allowed VNIs are consumed. If that’s the case, you will get an error similar to the following in neutron logs

Unable to create the network. No tenant network is available for allocation."

If you get that error, it means you need to increase the available ranges and restart the services to get it updated. 

Best of Luck ! 

Port security in Openstack

Openstack Neutron provides by default some protections for your VMs’ communications, those protections verify that VMs can not impersonate other VMs. You can easily see how it does that by checking the flow rules in an OVS deployment using:

ovs-ofctl dump-flows br-int

If you look for a certain qvo port (or the port number, depending on the deployment), this will show the following lines

table=24, n_packets=1234, n_bytes=1234, priority=2,arp,in_port="qvo",arp_spa=10.10.10.10 actions=resubmit(,25)
table=24, n_packets=1234, n_bytes=1234, priority=0 actions=drop

Table 24 by default will drop all the packets originated from a VM unless they are resubmitted to table 25. The criteria for submitting to table 25 is simple: That the source IP for this traffic is the one that has been assigned to that VM, if not it will drop the packet at the end of table 24

In addition , there’s a protection from changing the MAC address of the interface, it’s implemented via the following rule

table=25, n_packets=1234, n_bytes=1234, priority=2,in_port="qvo",dl_src=aa:aa:aa:aa:aa:aa actions=resubmit(,60)

which basically compares the source MAC address of the packet with the expected MAC address of the VM.

In some use cases, you may want to drop this protection, it can be done using

neutron port-update $PORT_ID --port-security-enabled=false

This will ensure there’s no openflow rules in br-int that will drop your packets if they don’t adhere to the MAC/IP requirements

Good Luck !

 

Migrating VMs with attached RBDs

From the title, this is obviously a very common scenario that you may want to do. One thing that we rarely think about though is “backends” for the attached volumes when we create volumes.

When you create a volume, the volume is created on a cinder backend and kept attached to this backend until it’s deleted , or migrated to another backend. The backends are defined in cinder configuration and are provided by your host(s) running the cinder-volume service. To find your backends, run the following command

cinder get-pools

When you attach the volume to a VM,  the volume keeps its backend. It relies on this backend to do any operations to that volume. This includes migrating the VM from a host to another.

You may run into a scenario where you get this error when trying to migrate a VM with attached RBD

 CinderConnectionFailed: Connection to cinder host failed: Unable to establish connection to

But when you go and check, Cinder is working correctly. You are able to create new volumes and attach them to instances. But a particular VM is unable to migrate. You may find also you’r unable to snapshot the volume attached to the VM. The thing to check for here is the RBD backend of the volume

You can find this using

cinder show VOLUME_ID

this will show you alot of details on the volume including the following attribute

| os-vol-host-attr:host | HOSTNAME@ceph#RBD |

HOSTNAME will likely be “one” of your controllers. You will need to go and check that cinder-volume service is running correctly on that controller. If it’s down, you can’t operate that volume for anything (snapshots, attach/detach and migrate)

If you’ve lost your controller forever, or you were testing a new backend that no longer exists, then you might want to migrate the volume from the dead backend. This is detailed in the following manual

https://docs.openstack.org/cinder/pike/admin/blockstorage-volume-migration.html

Happy VM migrations !