I’ve written this post as a draft sometime ago, but forgot to post it. The reason I looked into it was to find out how DPDK physically works as the OS/Device level and how it bypasses the network stack.
So, when you attach a PCIe NIC to your Linux server, you expect traffic will flow once your application send traffic. But how does it actually flow, and what components are invoked in Linux to get this traffic to flow, if you are thinking of these questions then this post is for you.
Let’s first discuss the main components that allow your NIC card to be recognized, registered and handled by Linux.
- NIC FIFO buffers: From the name, this is a FIFO Hardware buffer that your NIC has in it, the purpose is simply to put the received data somewhere before passing it to the OS. The size is determined by the vendor. One tricky part here is that it’s frequently named in the driver code as the “ring” buffer, which is true. FIFO buffers are a ring implementation, i.e. you start overriding packets if you run out of space in the buffer
- NIC driver: The driver is by default, a kernel module. This means it lives in the kernel memory space. The driver has three core functions
- Allocate RX and TX queues: Those are queues in the memory of the host server (i.e. hardware, but on your server, not your NIC). The purpose of those queues is to contain pointers to your to-be-sent/received packets. The RX and TX queues are usually refereed to as , descriptor rings. The name descriptor comes from their core purpose: containing descriptors for packets and not the packets contents.
- Initialize the NIC: Basically reach out to the hardware NIC registers, set their values appropriately, and during operations pass them some memory addresses (contained in RX/TX descriptor rings) on where data to be sent/received will exist
- Handling interrupts: When you driver starts, it has to register a way to communicate with the host OS, to tell it it has received packets. Also the OS has to be able to “kick” the NIC to send ready-to-be-sent packets. This is where the driver has to register “Interrupt service handlers/routines”, aka ISR. Depending on the generation/architecture of the NIC, there are multiple kinds of Interrupts supported by Linux that the NIC may implement
- NIC DMA Engine: Responsible for copying data in/out from the NIC FIFO buffers to your RAM. This is a physical part of the NIC
- NAPI: New API, basically a polling mechanism that works on scheduled threads that handle new arriving data. The most common task is to get this data flowing through the network stack. NAPI relies on the concept of poll lists, that drivers register their interrupts to and then are harvested periodically, instead of continuous interrupt servicing which is CPU heavy.
- Network Stack: Where all the OSI model happens. Except of course if you’r using DPDK, you dont want it to pass through the network stack all together
A summary diagram for that is
Let’s follow what happens when a NIC receives a packet:
- First thing, the NIC will put the packet in the FIFO memory
- Second, the NIC will use the dma engine to try to receive an RX descriptor. The RX descriptor will point to the location in memory to store the received packet. Descriptors only point to the location of the received data and do not contain the data itself
- Thrid, once the NIC knows where to put the data. It will use the DMA engine again to write the received packet to the memory region specified in the descriptor.
- Once the data is in the received memory region, the NIC will raise an RX interrupt to the host OS.
- Depending on the type of enabled interrupts, the OS either stops what’s doing to handle this interrupt (CPU heaviy), or relies on a polling mechanism that works regularly to check for the new interrupts (NAPI.. aka New API) which is the default method in newer kernels
- NAPI hand over the data in the memory region to the network stack and that’s when it goes through multiple layers of the OSI model to eventually reach a socket where your application is waiting in the user-space.
A very good go-to manual is located at:
Good luck !