Accelerating the Suricata IDS/IPS with NVIDIA BlueField DPUs

Originally published at:

Deep packet inspection (DPI) is a critical technology for network security that enables the inspection and analysis of data packets as they travel across a network. By examining the content of these packets, DPI can identify potential security threats such as malware, viruses, and malicious traffic, and prevent them from infiltrating the network. However, the…

Are you interested in developing solutions for Suricata or have any questions? If so, drop a note below.

Hi @mgonen,
Thanks for the article; it was very interesting to read. I have some questions.

You mention in your post that if we have Bluefield-3 with an ARM subsystem of 8/16 cores, we can also run Suricata on the SmartNIC itself.

  1. Is there any specific (hardware-related) reason why you say Bluefield-3 only? Bluefield-2 already has an ARM subsystem with eight cores and 16GB of RAM.
  2. Do you have any figures for the case when Suricata is running on the Bluefield? Or Fig. 2. is already for that? Sorry, I could not convince myself whether Suricata is running on the host or on the bluefield.
  3. If Suricata is running on the Bluefield, can we still utilize the line-rate steering module to alleviate the load on the ARM cores (for certain traffic that we want to bypass Suricata)?
  4. In the summary, you mention further use cases wherein Bluefield could accelerate packet processing (e.g., IPsec, TLS acceleration). Do you have any pointers for those writeups (if they exist)?


Hi @cs.lev
I’m glad too see you find interest in my work.

  1. Suricata can be offloaded to both BF2 and BF3. Of course there will be a performance difference between the 2 DPUs due to the compute power of the ARM subsystem, as BF2 has 8 Armv8 A72 cores and BF3 have 16 Armv8.2+ A78 Hercules cores.
  2. The testing I have done that are published in this paper are for running Suricata on the ARM subsystem of the Bluefield DPU (and not on the host).
  3. Hardware offload for bypassed flows is one of the points address in this whitepaper. Suricata was running on the ARM subsystem of the Bliefield DPU and it accelerated and offloaded bypassed flows to the hardware steering module of the Bluefield, using DOCA Flow API, instead of using the SW kernel based firewall Suricata offers. This way we could achieve line rate for bypassed flows.
  4. Yes, Bluefield 2 and Bluefield 3 offer GA IPsec offload and TLS hardware accelerated offload. you can read about them both on the DOCA SDK
    for IPsec - IPsec Programming Guide :: NVIDIA DOCA SDK Documentation
    for TLS - TLS Offload :: NVIDIA DOCA SDK Documentation

Feel free to reach out for any question

Thanks for getting back to me.

I am fine now, but I will probably have further questions in the near future, when I try to do the same thing that you did :)

Thanks again

Hi @mgonen , I finally reached a stage where I have acquired Bluefields, managed to install all drivers and tools, and created an environment for benchmarking.
Now I have Suricata (v7.0.2), compiled from source to support all libraries I might ever need in the future (e.g., DPDK, eBPF/XDP), running on the SmartNIC (on a Bluefield-2, in particular). My server has another NIC also installed (happened to be another Bluefield doing just port forwarding), then I can hairpin the traffic back to the SmartNIC under test, and also measure latency (besides throughput).
Find the setup below:
measurement setup

I did a lot of measurements already with different use cases, and now I reached the stage where I can play around with the bypass rules.

I have a simple traffic trace for testing, which has 48 different flows (downloaded trPR48). According to the traffic trace, I installed an alert rule for each flows. Example of a rule looks like this:

alert ip 52466 -> 33660 (msg: "trPR test traffic row 19";classtype:not-suspicious;priority:1;metadata: trPR test; bypass;sid:1000020;rev:1;)
alert ip 12733 -> 18700 (msg: "trPR test traffic row 20";classtype:not-suspicious;priority:1;metadata: trPR test; bypass;sid:1000021;rev:1;)

Mind the bypass keyword at the back of each rule. I haven’t played around with any optimization for now. I evaluated the performance with and without the bypass rules and this is what I got.
Without the bypass using AF_PACKET, IPS-mode and 4 worker threads, find below my throughput measurement using all available packet sizes. Note, for each packet size, I ran an RFC2544-complaint measurement and the results shown are the averaged numbers with error-bars (though the deviation was not big enough to make them visible).

The performance is not too good, however, my intention with this post is rather to know more about the bypass feature. Hence, when I enable the bypass in the rules (as shown above), this is the throughput I get.

We can observe a slight increase when the bypass feature is used. The average difference using all packet sizes is somewhere around 22.84%, i.e., with the bypass feature I could reach more than 20% performance boost.

Now, finally I can get to my question haha. But first, I wanted to also provide something in change :P

Can you elaborate more on how I can do further performance tuning via offloading this bypass thingy to the SmartNIC HW itself?

Hi Moran,
would it be possible to open-source your work on this?

1 Like

Hey @cs.lev,

if you only use the AF_PACKET and not some other form of bypass (e.g. XDP/DOCA) then the packet travels to the ARM core. Ideally, you want to bypass the packets in the hardware. Even with XDP and likely driver-level support (I don’t think Bluefield 2 allows hardware-level support for XDP), it will not bring results as presented in this blog post. I believe you would need to have Suricata with DOCA support implemented to get close to the presented numbers.

Yes, I believe the same too :) That’s why asking @mgonen to provide more details on how to do it. I am also not sure whether Suricata with DOCA support is publicly available somewhere.

Hi @cs.lev,
First I would like to apologize for the long response time for getting back to you.
I see you have had much progress and am happy to see that you have a fully running setup.
When running Suricata on the ARM sub system of the BlueField DPU your data path is routed through the ARM processor so you are limited to the performance of the ARM processor.
Even when using the out of the box Suricata software bypass feature the data plane is still routed through the ARM processor. To get higher throughput you need Suricata bypass to offload the bypass rule to the HW data path of the Bluefield so the traffic is routed by hardware directly to and from the host.
In the whitepaper I describe how I used the hardware offload capacities of the BlueField using DOCA API.
Unfortunately, the source code currently cannot be published.
If you wish to use DOCA to implement the fully hardware offloaded data path you should refer to the DOCA SDK documentation for DOCA Flow.

Hi @puki7777,
You are correct, when running the upstream Suricata on the ARM cores, even when using the sw bypass the packets are still routed to the ARM cores, and this may cause the low throughput @cs.lev is getting. In order to achieve higher, near line rate, throughput you must use DOCA Flow to offload the bypass rules to the Bluefield’s internal hardware e-Switch.