How NICs work ? a quick dive !

I’ve written this post as a draft sometime ago, but forgot to post it. The reason I looked into it was to find out how DPDK physically works as the OS/Device level and how it bypasses the network stack.

So, when you attach a PCIe NIC to your Linux server, you expect traffic will flow once your application send traffic. But how does it actually flow, and what components are invoked in Linux to get this traffic to flow, if you are thinking of these questions then this post is for you.

Let’s first discuss the main components that allow your NIC card to be recognized, registered and handled by Linux.

  • NIC FIFO buffers: From the name, this is a FIFO Hardware buffer that your NIC has in it, the purpose is simply to put the received data somewhere before passing it to the OS. The size is determined by the vendor. One tricky part here is that it’s frequently named in the driver code as the “ring” buffer, which is true. FIFO buffers are a ring implementation, i.e. you start overriding packets if you run out of space in the buffer
  • NIC driver: The driver is by default, a kernel module. This means it lives in the kernel memory space. The driver has three core functions
    • Allocate RX and TX queues: Those are queues in the memory of the host server (i.e. hardware, but on your server, not your NIC). The purpose of those queues is to contain pointers to your to-be-sent/received packets. The RX and TX queues are usually refereed to as , descriptor rings. The name descriptor comes from their core purpose: containing descriptors for packets and not the packets contents.
    • Initialize the NIC: Basically reach out to the hardware NIC registers, set their values appropriately, and during operations pass them some memory addresses (contained in RX/TX descriptor rings) on where data to be sent/received will exist
    • Handling interrupts: When you driver starts, it has to register a way to communicate with the host OS, to tell it it has received packets. Also the OS has to be able to “kick” the NIC to send ready-to-be-sent packets. This is where the driver has to register “Interrupt service handlers/routines”, aka ISR. Depending on the generation/architecture of the NIC, there are multiple kinds of Interrupts supported by Linux that the NIC may implement
  • NIC DMA Engine: Responsible for copying data in/out from the NIC FIFO buffers to your RAM. This is a physical part of the NIC
  • NAPI: New API, basically a polling mechanism that works on scheduled threads that handle new arriving data. The most common task is to get this data flowing through the network stack. NAPI relies on the concept of poll lists, that drivers register their interrupts to and then are harvested periodically, instead of continuous interrupt servicing which is CPU heavy.
  • Network Stack: Where all the OSI model happens. Except of course if you’r using DPDK, you dont want it to pass through the network stack all together

A summary diagram for that is

NIC.png

Let’s follow what happens when a NIC receives a packet:

  • First thing, the NIC will put the packet in the FIFO memory
  • Second, the NIC will use the dma engine to try to receive an RX descriptor. The RX descriptor will point to the location in memory to store the received packet. Descriptors only point to the location of the received data and do not contain the data itself
  • Thrid, once the NIC knows where to put the data. It will use the DMA engine again to write the received packet to the memory region specified in the descriptor.
  • Once the data is in the received memory region, the NIC will raise an RX interrupt to the host OS.
  • Depending on the type of enabled interrupts, the OS either stops what’s doing to handle this interrupt (CPU heaviy), or relies on a polling mechanism that works regularly to check for the new interrupts (NAPI.. aka New API) which is the default method in newer kernels
  • NAPI hand over the data in the memory region to the network stack and that’s when it goes through multiple layers of the OSI model to eventually reach a socket where your application is waiting in the user-space.

A very good go-to manual is located at:

https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data

Good luck !

PCI passthrough: Type-PF, Type-VF and Type-PCI

Passthrough has became more and more popular with time. It started initially for simple PCI device assignment to VMs and then grew to be part of high performance network realm in the Cloud such as SR-IOV, Host-level DPDK and VM-Level DPDK for NFV.

In Openstack, if you need to passthrough a device on your compute hosts to the VMs, you will need to specify that in the nova.conf via the passthrough_whitelist and the alias directives under the [pci] category. A typical configuration of nova.conf on the controller node will look like that

[pci]
alias = { "vendor_id":"1111", "product_id":"1111", "device_type":"type-PCI", "name":"a1"}
alias = { "vendor_id":"2222", "product_id":"2222", "device_type":"type-PCI", "name":"a2"}

while on the compute host it, nova.conf will look like that

[pci]
alias = { "vendor_id":"1111", "product_id":"1111", "device_type":"type-PCI", "name":"a1"}
alias = { "vendor_id":"2222", "product_id":"2222", "device_type":"type-PCI", "name":"a2"}
passthrough_whitelist = [{"vendor_id":"1111", "product_id":"1111"}, {"vendor_id":"2222", "product_id":"2222"}]

Each alias represents a device that nova-scheduler will be capable of scheduling againist using the PciPassthroughFilter filter. The more devices you want to pass through, the more alias lines you will have to create.

Alias syntax is quite self explanatory. vendor_id is unique for the device vendor, product_id is unique per device, name is an identifier that you specify of this device. Both vendor_id and product_id can be obtained via the command

lspci -nnn

You can deduce the vendor and product ids from the output as follows

000a:00:00.0 PCI bridge [0000]: Host Bridge  [1111:2222]

In this case, the vendor_id is 1111 and the product_id is 2222

But how about device_type in the alias definition ? . Well device_type can be one of three values: type-PCI, type-PF and type-VF

type-PCI is the most generic. What it does is pass-through the PCI card to the guest VM through the following mechanism:

  • IOMMU/VT-d will be used for memory mapping and isolation, such that the Guest OS can access the memory structures of the PCI device
  • No vendor driver will be loaded for the PCI device in the compute host OS
  • The Guest VM will handle the device directly using the vendor driver

When a PCI device gets attached to a qemu-kvm instance, the libvirt definition for that instance will include a hostdev for that device, for example:

   <hostdev mode='subsystem' type='pci' managed='yes'>
     <source>
        <address domain='0x1111' bus='0x11' slot='0x11' function='0x1'/>
      </source>      
        <address type='pci' domain='0x1111' bus='0x11' slot='0x1' function='0x0'/>
    </hostdev>

The next two types are more interesting. They originated for SR-IOV capable devices, where the notion of Physical function “PF” and Virtual Functions “VF”. There’s a core difference with those two types than the type-PCI which is

  • A PF driver is loaded for the SR-IOV device in the compute-host OS.

Let’s explain what the difference between type-VF and type-PF is, we will start with VFs first:

type-VF allows you to pass a Virtual Function, which is a lightweight PCIe device that has its own RX/TX queues in case of network devices. Your VM will be able to use the VF driver, provided by the vendor, to access the VF and deal with it as a regular device for IO. VFs generally have the same vendor_id as the hardware device vendor_id, but with a different product_id specified for the VFs.

type-PF on the other hand refers to a fully capable PCIe device, that can control the physical functions of an SR-IOV capable device, including the configuration of the Virtual functions. type-PF allows you to passthrough the PF to be controlled by the VMs. This is sometimes useful in NFV use-cases.

A simplified layout of PF/VF looks like this

SRIOV-KERNEL (2).png

PF driver is used to configure the SR-IOV functionality and partition the device into virtual functions accessed by the VM in userspace

A nice feature about nova-compute is that it does print out the Final resource view, which contains specifics of the passthroughed devices. It will look like that in the case of a PF passthrough

Final resource view: pci_stats=[PciDevicePool(count=2,numa_node=0,product_id='2222',tags={dev_type='type-PF'},vendor_id='1111')]

Which says there’r two devices in numa cell 0 with the specified vendor_id and product_id that are available for passthrough

In the case of VF passthrough:

Final resource view: pci_stats=[PciDevicePool(count=1,numa_node=0,product_id='3333',tags={dev_type='type-VF'},vendor_id='1111')]

In this case there’s only one VF with vendor_id 1111 and product_id 3333 that’s ready to be passthroughed on numa cell 0

The blueprint of PF passthrough type is here if you’r interested

https://blueprints.launchpad.net/nova/+spec/sriov-physical-function-passthrough

Good Luck !

VNI Ranges: What do they do ?

Deployment tools for Openstack have become very popular, including the very well known Openstack-Ansible. It makes deploying a Cloud an easy task, at the expense of losing access to the insights of “Behind the Scenes” of your your Cloud deployment. If you have had to configure neutron manually, you would have come across the following section in the ml2 configuration

[ml2_type_vxlan] # 
(ListOpt) Comma-separated list of <vni_min>:<vni_max>
tuples enumerating # ranges of VXLAN VNI IDs that are available for
tenant network allocation. #
# vni_ranges =

You probably have set it to a range, similar to 10:100 or 10:300 and so on

But what does this configuration mean ?

When you configure neutron to use VXLAN as the segmentation network, each tenant network gets assigned a Virtual Network Identifier “VNI”. VNIs are numeric values that you specify their range with the vni_ranges parameter. 

An advantage of having control on this parameter is that you can specify the maximum number of VXLANs that the ml2 agent can use. Although this seems like an advantage, it can also be a disadvantage in a dynamic environment as you can run into situations where your networks can not be created because all allowed VNIs are consumed. If that’s the case, you will get an error similar to the following in neutron logs

Unable to create the network. No tenant network is available for allocation."

If you get that error, it means you need to increase the available ranges and restart the services to get it updated. 

Best of Luck ! 

Port security in Openstack

Openstack Neutron provides by default some protections for your VMs’ communications, those protections verify that VMs can not impersonate other VMs. You can easily see how it does that by checking the flow rules in an OVS deployment using:

ovs-ofctl dump-flows br-int

If you look for a certain qvo port (or the port number, depending on the deployment), this will show the following lines

table=24, n_packets=1234, n_bytes=1234, priority=2,arp,in_port="qvo",arp_spa=10.10.10.10 actions=resubmit(,25)
table=24, n_packets=1234, n_bytes=1234, priority=0 actions=drop

Table 24 by default will drop all the packets originated from a VM unless they are resubmitted to table 25. The criteria for submitting to table 25 is simple: That the source IP for this traffic is the one that has been assigned to that VM, if not it will drop the packet at the end of table 24

In addition , there’s a protection from changing the MAC address of the interface, it’s implemented via the following rule

table=25, n_packets=1234, n_bytes=1234, priority=2,in_port="qvo",dl_src=aa:aa:aa:aa:aa:aa actions=resubmit(,60)

which basically compares the source MAC address of the packet with the expected MAC address of the VM.

In some use cases, you may want to drop this protection, it can be done using

neutron port-update $PORT_ID --port-security-enabled=false

This will ensure there’s no openflow rules in br-int that will drop your packets if they don’t adhere to the MAC/IP requirements

Good Luck !

 

Private External Networks in Neutron

You might find yourself in a position where you need to restrict access by tenants to specific external networks. In Openstack there’s the notion that external networks are accessible by all tenants and anyone can attach their private router to it. This might not be the case if you want to only allow specific users to access a specific external networks.

There is no way to directly configure this in neutron. I.e. Any external network that you have in your deployment basically can have tenants attach their routers to it and make it their default gateway. In order to work around this, let’s look into how neutron saves router and ports in the neutron database schema , a router is defined as follows

 

MariaDB [neutron]> desc routers$$
+——————+————–+——+—–+———+——-+
| Field | Type | Null | Key | Default | Extra |
+——————+————–+——+—–+———+——-+
| project_id | varchar(255) | YES | MUL | NULL | |
| id | varchar(36) | NO | PRI | NULL | |
| name | varchar(255) | YES | | NULL | |
| status | varchar(16) | YES | | NULL | |
| admin_state_up | tinyint(1) | YES | | NULL | |
| gw_port_id | varchar(36) | YES | MUL | NULL | |
| enable_snat | tinyint(1) | NO | | 1 | |
| standard_attr_id | bigint(20) | NO | UNI | NULL | |
| flavor_id | varchar(36) | YES | MUL | NULL | |
+——————+————–+——+—–+———+——-+

each router has an id, name, project ID where it’s created under. You will notice also the field gateway_port_id. This is the port that connects the tenant router to its default gateway. i.e. your external network

Each router has a unique port for gateway. Tenant routers do not share a common port. Let’s look how a port looks like in the database schema

MariaDB [neutron]> desc ports$$
+——————+————–+——+—–+———+——-+
| Field | Type | Null | Key | Default | Extra |
+——————+————–+——+—–+———+——-+
| project_id | varchar(255) | YES | MUL | NULL | |
| id | varchar(36) | NO | PRI | NULL | |
| name | varchar(255) | YES | | NULL | |
| network_id | varchar(36) | NO | MUL | NULL | |
| mac_address | varchar(32) | NO | | NULL | |
| admin_state_up | tinyint(1) | NO | | NULL | |
| status | varchar(16) | NO | | NULL | |
| device_id | varchar(255) | NO | MUL | NULL | |
| device_owner | varchar(255) | NO | | NULL | |
| standard_attr_id | bigint(20) | NO | UNI | NULL | |
| ip_allocation | varchar(16) | YES | | NULL | |
+——————+————–+——+—–+———+——-+

As you can see , a port has an id and a network_id where it’s attached to. Note that in the ports table, network_id refer to both external and “tenant” networks.

If we know our external network ids, we can tell what ports are attached to them, and possibly enable/disable future attachments. To know our external network ids, it’s easy to run

(neutron) net-external-list

This will show you the IDs for the external networks and then with a simple query you can select from the ports table what ports are attached to your external network

select id from ports where network_id=$NETWORK_ID’ $$

This returns a list of the ports currently connected to your external network.

If you want to disable tenants from attaching anything (routers or floating IPs) to this external network, you can acheive this by using a BEFORE TRIGGER in mysql

DELIMITER $$

create trigger ports_insert before insert on ports for each row begin IF (new.network_id = ‘$NETWORK_ID’) then set new.id = NULL ; END IF ; END $$

 

This trigger basically changes the insert statement that neutron writes to the database when a tenant attaches a router to your external network. It sets the ID of the new port to NULL, which is invalid for this field as seen from the above description of the ports table. This effectively disables any routers/floating ips to be attached to the external network you choose. But remember , you’r also included in that, you can’t attach anything to this external network even as admin. You can always tweak the trigger to check project_id field and only restrict access to specific projects

 

 

 

 

 

 

 

 

 

 

 

 

VM getting a DHCP address

DHCP requests are broadcast requests sent by the VM to its boradcast domain. If a DHCP server exists in this domain, it will respond back providing a DHCP IP lease following the DHCP protocol. In openstack, the same procedure is followed. A VM starts by sending its DHCP request to its boardcast domain which goes through br-int. Since this is broadcast, it exists br-int as well to br-tun and gets sent to all hosts in the environment using the dedicated tunnel ID for the network.

Once the request reaches the network node, it then reaches a network namespace created specifically to allow the dhcp request to be handled. This DHCP namespace name is qdhcp-{UUID} . The qdhcp namespace looks as follows

Selection_011

Individually, it looks like this

Selection_010

As you can see, the dhcp namespace has a tap interface which is attached to the br-int bridge on the network node. The tap interface is attached to a dnsmasq process. dnsmasq is a service that does manythings (obviously dns included). But it also allows providing dhcp addresses when acting as a dhcp server

On the network node, if you do a ps -ef | grep dns you will see the following

Selection_012

If you would like to see the dhcp namespaces on the network node, you can use ip netns

Selection_013

and if you go inside any of these namespaces, you will see the tap interface that is attached to the dnsmasq process

Selection_014

The IP attached to the dhcp namespace is assigned by default to the tap interface. Note that when you look into the flow rules on br-tun for any compute host, you may find an entry for the MAC address of this tap interface. This is used to prevent sending the dhcp request to every compute host and network host in the environment. Since the flow rules will direct the dhcp request to the VXLAN port that is connecting the compute host to the network node only.

 

 

 

 

VM to VM communication: different networks

So far we have only spoken about VM communication when they belong to the same network. But what happens when the VM has to communicate with another VM on a different network. The common rule of networking is that changing networks requires routing. This is exactly what neutron does to allow those kinds of VMs to communicate

One thing to note here is that a VM in openstack is attached to a subnet, which specifies which IP block it will get an IP from. The rule of thumb is, if a subnet is changed then routing has to be involved. This is because an l2 switch doesn’t understand IPs, so changing IP blocks between subnets is not understood by a switch. Remeber switches understand MAC addresses, routers understand IPs

So let’s look at how a two VMs belonging to different networks actually communicate.

Selection_001

The logical diagram is similar to what we see above. Packets flow from VM1 down through the tap, qbr, qvb-qvo, br-int and br-tun. This time though they have to go through routing before reaching VM2. Routing is done in the router namespaces created by the l3-agent on the network node. In order to see it in more details, let’s look at a more realistic diagram of the communication

Selection_002

As you can see in the diagram above, packets flow through the br-tun from the first compute host to the network node. The network node has a very similar logical diagram as the compute node. It has a br-tun & br-int combination to allow it to establish tunnels to the compute hosts and network hosts in the environment and to VLAN tag local traffic. It has some new entities though which are the qrouter namespace and the qdhcp namespace. From their names it’s obvious they are responsible for routing and dhcp.

Traffic from VM1 reaches the br-tun on the network node on its dedicated tunnel (remember dedicated tunnel ID per network ). br-tun at the network node does the VXLAN tunnel ID to VLAN mapping and pushes the traffic up to the br-int on the network node. The br-int pushes the traffic to the router namespace named qrouter-uuid. This traffic received by the qruoter namespace is router and then pushed down again through the br-int and br-tun. This time it leaves the network node over a VXLAN dedicated tunnel ID, which is different from the VM1 tunnel ID (remember different networks so different tunnel ID). The packets are received by the br-tun of the VM2’s compute host and then goes up through br-int the same way till it reaches VM2

The only case where the tunnel IDs are the same is: If the router is connected to two subnets within the SAME network.

Let’s look more into how qrouter namespace is designed

Selection_004

The qrouter namespace has two kinds of interfaces.

  •  qr interfaces: Those are interfaces that connect the qrouter to the subnets that it is routing between
  • qg interface: This is the interface that connects the qrouter to the router’s gateway

 

If we are assuming a single subnet per network. We can then simplify the qrouter namespace diagram as follows

Selection_005

As you can see above , traffic arrives from a certain network on the qr interface. It gets to the routing table of the qrouter namespace and then either goes to the other qr interface if it’s destined to the other network or to the qg interface if it’s destined to the gateway (for example reaching the public network)

Let’s look at a practical scenario, two VMs connected to two different networks.

Selection_009

A qrouter namespace physically looks like this on the network node.

Selection_006

If we look inside the qrouter namespace we can see the qr and qg interfaces

Selection_007

qr interfaces are assigned the ips for the gateways of each network connected and if we look inside the qrouter name space routing table we will see that the qg interface is the default gateway device and that each qr interface is the gatway for the network it’s connected to

Selection_008

A couple of things to remember

  • qrouter namespaces are only created when subnets are attached to routers , not when routers are created. i.e. an empty router (no connections to gateway of networks) will have no namespace on the network nodes
  • qrouter namespaces are created by the l3-agent
  • Although there are more than one qr interface in the qruoter namespace, there’s only one qg interface. This is because the router will have one gateway
  • the network node will have as many qrouter namespaces as routers you have for your tenants. The traffic flowing those routers is totally isolated via (VXLAN Tunnel ID, VLAN tag, network namespaces combination)

In the next post we will explain the dhcp address acquiring process of  a VM

 

VM to VM communication, same network, different compute hosts

In the last post, we spoke about VM to VM communication when they belong to the same network and happen to get deployed to the same host. This is a good scenario, but in a big openstack deployment, it’s unlikely that all your VMs belonging to the same network will end up on the same compute host. The more likely scenario is that VMs will be deployed on multiple compute hosts.

When the VMs lived on the same host, unicast traffic got handled using br-int. But we have to remember that br-int is local to the compute host. so when VMs get deployed on multiple compute hosts another technique needs to be used. Traffic will have to flow between the compute hosts over an overlay network. br-tun is responsible for the overlay network establishing and handling.  The overlay network can be VXLAN or GRE depending on your choice

Let’s look at how this looks like

vm-t-vm2

When VM1 wants to send unicast traffic to VM2, traffi will have to flow down from vNIC to tap to qbr to qvb-qvo veth pair. This time it get VLAN tagged on the br-int but have to exit through the patch interface to the br-tun bridge. The br-tun bridge strips the VLAN ID out of the traffic and pushes the traffic to every compute host in the environement over a dedicated lane, called VXLAN tunnel ID. You can think of the VXLAN tunnel ID as a way to seggregate traffic from different networks when it enters the overlay network (VXLAN in our case)

Let’s look again into the same example of VMs test and test2, but this time they are on different compute hosts. Their logical diagram remains unchanged

vm1-vm23

Now let’s look at Compute host 1 that hosts the “test” VM.

33

We will focus on the qvo portion and the br-tun bridge since we already know how the traffic will flow until it reaches the br-int.  So let’s see the VLAN tag for the traffic from the “test” VM.

1

As you can see , from the br-int definition, the tag of this traffic is VLAN ID 1

We also know that this traffic will have to exit the compute host via br-tun. So, let’s look at the br-tun openflow rules. We’r expecting to see the VLAN tag being stripped out and the VXLAN Tunnel ID to be added before traffic is sent over the overlay network

2

As you can see, outbound traffic with VLAN tag 1 gets its VLAN ID stripped and the traffic gets loaded over 0x39 tunnel ID. So we know that the traffic will be sent to all compute hosts in the environment (and network hosts as well) over the tunnel ID 0x39

Let’s see how compute host 2, that hosts “test2” vm looks like

6

Let’s look at the qvo portion and the br-tun definition

4

We see that the VLAN tag for qvo interface is 2, which is different from compute host 1. This is expected. Since the two VMs live on different hosts, there is no guarantee that their qvo interfaces will have a common VLAN ID.

So let’s look at the br-tun flow rules

5

As you can see, incoming traffic on VXLAN tunnel ID 0x39 gets VLAN tagged with VLAN tag 2 and gets sent to the br-int. In other words a path is opened for it to reach the qvo of “test2” instance. While if the traffic is outbound from “test2” VM, i.e. with VLAN tag 2, its VLAN tag gets stripped and the traffic gets sent over the overlay network with VXLAN tunnel ID 0x39.

So the basic idea is that traffic gets VLAN tagged on the br-int, and if it is destined to leave the host it get sent through the br-tun. The br-tun deos a VLAN ID to VXLAN tunnel ID translation. Where a dedicated VXLAN tunnel ID is given for every tenant network in your environment.

One point to mention here is that br-tun is smart, so instead of sending the traffic over the overlay network to every compute host and network host in the environment, it slowly learns what sits where. In other words the next time the “test” instance will send traffic to “test2” instance, traffic will be sent only to compute host 2, not to all compute hosts in the environment. This is done by adding an openflow rule to the br-tun flows for the MAC address of “test2” instance interface

 

 

 

 

VM to VM communication: Same network & same compute host

In a physical world, machines communicate with each other without routers when they belong to the same network. This is the same case with openstack, VMs communicate over the same network without routers.

When two VMs belonging to the same network happen to get deployed on the same compute host, their logical diagram looks like this

vm-part8

As we can see above, each VM will have its own tap device, qbr bridge, qvb-qvo veth pair and they both connect to br-int. br-int is in charge of VLAN tagging the traffic, and in this case it will VLAN tag this traffic to the same VLAN, since they belong to the same network.

We can verify this in the following example: 2 VMs test and test belong to the same network and the same subnet.

vm-part9

One thing to mention here, VLAN tags for the same network on the same host are the same. This applies regardless whether the VMs are on the same subnet or different subnets. Now let’s look into the VMs test & test2 logical diagram and focus on the qbr bridges definitions and the integration bridges definitions

vm-part12

using br-ctl show , we can see the qbr bridge for every VM and the associated interfaces

vm-part10

now let’s look at the definition of integration bridge using ovs-vsctl showvm-part11

as we see in the previous image, there are two qvo interfaces with VLAN tag “1”. So the idea is that since the VMs are on the same network, their qvo interfaces belong to the same VLAN tag on the same host. This way traffic can flow normally as with physical world, where switch ports are segregated using VLAN tags.

Unicast traffic flows between test and test2 VMs within the same host using the br-int bridge over dedicated VLAN tag for this particular network.

In openstack, as in physical world, switches have no idea if your machines/VMs are on different IP subnets. Switches operate at layer 2 so for them subnets are not visible. This is the reason that VLAN tag IDs are dedicated per network, not per subnet. So if you have a network with 2 subnets and you have a VM on each, their qvo interfaces will have the VLAN tag if they end up on the same compute host

Next post will be about VM to VM communication, same network but different compute hosts

 

Traffic flows from an Openstack VM

As we mentioned in the last post, traffic flows through a set of Linux virtual devices/switches to reach its destination after leaving the VM. Outbound traffic goes downward while inbound traffic moves upwards.

The flow of traffic from the VM goes through the following steps

VM-network

  • A VM generates traffic that goes through its internal vNIC
  • Traffic reaches the tap device where it undergoes traffic filtering using iptables to implement the security group rules for the security group attached to this VM
  • traffic leaves the tap device to go through the qbr bridge
  • qbr bridge hands the traffic over to qvb
  • qvb hands the traffic over to qvo
  • Traffic reaches the br-int. br-int VLAN tags the traffic and either
    • sends it to another port on br-int if the traffic is destined locally
    • sends it through the patch-tun interface to br-tun if the traffic is destined outside the host
  • br-tun receives the traffic on patch-int interface. It sends it through the established tunnels to other compute hosts and network hosts in the environment
    • br-tun is smart, it will add to its openflow rules specific rules such that it reduces waste traffic (i.e. traffic that gets sent to every host). It instead learns where VMs and routers exist and send traffic specifically to hosts that have the VMs or the routers on them
    • traffic is seggregated in the tunnels using a dedicated tunnel_id for each tenant network. This way individual network’s traffic doesn’t get mixed with other tenant network’s traffic

The above is the logical layout of traffic flow. If we wan to map everything to physical on the compute host.  The first portion is the running VM. In order to view the running process of the VM, we can do a ps -ef

vm-part1vm-part2

So as we see above the VM is a qemu-kvm process ( if you are using this hypervisor) that is attached to a tap device with a certain MAC address.

If we go a bit further , we can see how the qbr bridge is implemented, we can use br-ctl show

vm-part3

vm-part4

As shown above , a qbr bridge exists with two interfaces the tap interface and tap device

To verify that iptables rules are implemented on the tap device to reflect the security group rules. We can use iptables -L | grep tap4874fb2a   which is the tap device name mentioned above

vm-part5

The last step is to view the OVS switches using ovs-vsctl show

vm-part6

The output of ovs-vsctl show

vm-part7

As shown above, two switches exist br-tun and br-int

  • br-int has two ports
    • patch-tun to connect br-int to br-tun
    • qvo interface which is VLAN tagged with tag 1
  • br-int has two ports
    • patch-int to connect br-tun to br-int
    • vxlan interface which establishes the VXLAN tunnel to other hosts in the environment . There should be one interface per VXLAN tunnel to every host in the environment.

One thing to remember is that a VXLAN tunnel is the highway that connects multiple compute and network hosts. The real segregation happens using tunnel IDs which act as lanes dedicated for every tenant network