batch processing of packets in GPU using CUDA

Hey guys,
Does any one have cuda/sample/ code to process packets in batch in GPU? Any help is appreciated

what do you mean by ‘batch’ and what do you mean by ‘packet’?

packet to mean network packet and batch means group of packets/large number of packets processed together.

noted

and what do you mean by ‘process’?

let me explain it a little bit the aim is this: I receive packets from network then store in buffer . After that there might be processes this is task of kernel and dont worry about what process. My aim is to store packets in buffer. so I need sample code how to store in buffer…

do you have a packet template? what does a packet contain/ look like?

Thanks little_jimmy for you response. I am too late to respond sorry for that. here is the structure of the packet if I understood your question.
typedef struct _Packet
{
const DAQ_PktHdr_t *pkth; // packet meta data
const uint8_t *pkt; // raw packet data

//vvv-----------------------------
EtherARP *ah;
const EtherHdr eh; / standard TCP/IP/Ethernet/ARP headers */
const VlanTagHdr *vh;
EthLlc *ehllc;
EthLlcOther *ehllcother;
const PPPoEHdr pppoeh; / Encapsulated PPP of Ether header */
const GREHdr *greh;
uint32_t *mpls;

const IPHdr *iph, *orig_iph;/* and orig. headers for ICMP_*_UNREACH family */
const IPHdr *inner_iph;     /* if IP-in-IP, this will be the inner IP header */
const IPHdr *outer_iph;     /* if IP-in-IP, this will be the outer IP header */
const TCPHdr *tcph, *orig_tcph;
const UDPHdr *udph, *orig_udph;
const UDPHdr *inner_udph;   /* if Teredo + UDP, this will be the inner UDP header */
const UDPHdr *outer_udph;   /* if Teredo + UDP, this will be the outer UDP header */
const ICMPHdr *icmph, *orig_icmph;

const uint8_t *data;        /* packet payload pointer */
const uint8_t *ip_data;     /* IP payload pointer */
const uint8_t *outer_ip_data;  /* Outer IP payload pointer */
//^^^-----------------------------

void *ssnptr;               /* for tcp session tracking info... */
void *fragtracker;          /* for ip fragmentation tracking info... */

//vvv-----------------------------
IP4Hdr *ip4h, *orig_ip4h;
IP6Hdr *ip6h, *orig_ip6h;
ICMP6Hdr *icmp6h, *orig_icmp6h;

IPH_API* iph_api;
IPH_API* orig_iph_api;
IPH_API* outer_iph_api;
IPH_API* outer_orig_iph_api;

int family;
int orig_family;
int outer_family;
//^^^-----------------------------

uint32_t preprocessor_bits; /* flags for preprocessors to check */
uint32_t preproc_reassembly_pkt_bits;

uint32_t packet_flags;      /* special flags for the packet */

uint32_t xtradata_mask;

uint16_t proto_bits;

//vvv-----------------------------
uint16_t dsize;             /* packet payload size */
uint16_t ip_dsize;          /* IP payload size */
uint16_t alt_dsize;         /* the dsize of a packet before munging (used for log)*/
uint16_t actual_ip_len;     /* for logging truncated pkts (usually by small snaplen)*/
uint16_t outer_ip_dsize;    /* Outer IP payload size */
//^^^-----------------------------

uint16_t frag_offset;       /* fragment offset number */
uint16_t ip_frag_len;
uint16_t ip_options_len;
uint16_t tcp_options_len;

//vvv-----------------------------
uint16_t sp;                /* source port (TCP/UDP) */
uint16_t dp;                /* dest port (TCP/UDP) */
uint16_t orig_sp;           /* source port (TCP/UDP) of original datagram */
uint16_t orig_dp;           /* dest port (TCP/UDP) of original datagram */
//^^^-----------------------------
// and so on ...

int16_t application_protocol_ordinal;

uint8_t frag_flag;          /* flag to indicate a fragmented packet */
uint8_t mf;                 /* more fragments flag */
uint8_t df;                 /* don't fragment flag */
uint8_t rf;                 /* IP reserved bit */

uint8_t ip_option_count;    /* number of options in this packet */
uint8_t tcp_option_count;
uint8_t ip6_extension_count;
uint8_t ip6_frag_index;

uint8_t error_flags;        /* flags indicate checksum errors, bad TTLs, etc. */
uint8_t encapsulated;
uint8_t GTPencapsulated;
uint8_t next_layer;         /* index into layers for next encap */

#ifndef NO_NON_ETHER_DECODER
const Fddi_hdr fddihdr; / FDDI support headers */
Fddi_llc_saps *fddisaps;
Fddi_llc_sna *fddisna;
Fddi_llc_iparp *fddiiparp;
Fddi_llc_other *fddiother;

const Trh_hdr *trh;         /* Token Ring support headers */
Trh_llc *trhllc;
Trh_mr *trhmr;

Pflog1Hdr *pf1h;            /* OpenBSD pflog interface header - version 1 */
Pflog2Hdr *pf2h;            /* OpenBSD pflog interface header - version 2 */
Pflog3Hdr *pf3h;            /* OpenBSD pflog interface header - version 3 */
Pflog4Hdr *pf4h;            /* OpenBSD pflog interface header - version 4 */

#ifdef DLT_LINUX_SLL
const SLLHdr sllh; / Linux cooked sockets header */
#endif
#ifdef DLT_IEEE802_11
const WifiHdr wifih; / wireless LAN header */
#endif
const EtherEapol eplh; / 802.1x EAPOL header */
const EAPHdr *eaph;
const uint8_t *eaptype;
EapolKey *eapolk;
#endif

// nothing after this point is zeroed ...
Options ip_options[IP_OPTMAX];         /* ip options decode structure */
Options tcp_options[TCP_OPTLENMAX];    /* tcp options decode struct */
IP6Option ip6_extensions[IP6_EXTMAX];  /* IPv6 Extension References */

const uint8_t *ip_frag_start;
const uint8_t *ip_options_data;
const uint8_t *tcp_options_data;

const IP6RawHdr* raw_ip6h;  // innermost raw ip6 header
Layer layers[LAYER_MAX];    /* decoded encapsulations */

IP4Hdr inner_ip4h, inner_orig_ip4h;
IP6Hdr inner_ip6h, inner_orig_ip6h;
IP4Hdr outer_ip4h, outer_orig_ip4h;
IP6Hdr outer_ip6h, outer_orig_ip6h;

MplsHdr mplsHdr;

PseudoPacketType pseudo_type;    // valid only when PKT_PSEUDO is set
uint16_t max_dsize;

/**policyId provided in configuration file. Used for correlating configuration
 * with event output
 */
uint16_t configPolicyId;

uint32_t iplist_id;
unsigned char iprep_layer;

uint8_t ps_proto;  // Used for portscan and unified2 logging

} Packet;

yes, quite a lovely packet indeed

not to deprive you of the overwhelming joy of solving your own design problems, i note a few design questions to perhaps consider

given axioms:

  • the cpu (rather than the gpu) catches and collects packets; the cpu is the first to have its hands on the packets

likely design questions:

  • can you move whole packets as-is, to the gpu, in a practical manner?
  • should you move whole packets as-is, to the gpu?
  • should you process packets individually or collectively?

with regards to the first question, i suppose you should be able to move whole packets as-is to the device, but i doubt whether it would be practical or optimal:
a) you likely are only interested in sections of packets, b) it might turn out to be a memory alignment nightmare, c) it would likely involve a lot of work - to meet memory alignment requirements, and to deep copy the entire packet, d) the device might be more optimally utilized when you strip packets and reassemble them in a format the device finds more suitable

this then suggests some form of packet pre-processing, which in turn leads to additional design questions:

what levels of pre-processing is optimal?
who should conduct pre-processing - cpu or gpu?

if you individually process packets, you could have the cpu strip packets of necessary content, pass (deep copy) that data to the device, and have it process the packet data
if you collectively process packets, you could again have the cpu strip packets of necessary data, and have either the cpu or gpu accumulate this in proper format, until the packet quantity is reached

the packet data/ packet processing you are interested in would likely also influence whether you process packets individually or collectively

Thanks very much it is interesting. I liked it. I think you have a lot of experience on it. I already start the research. What I have tried is this: I read packets(read packets from saved file DARPA data set) and store in a temporary buffer 0f size 1,000,000 bytes currently. When the buffer becomes full copy all the packets to GPU and then process in GPU. After that copy the result back to cpu. But I could not store all the packets in a buffer before copied to GPU. I do not know why I could not store all the packets. it store only some part of it. I could not consider the memory case because of these stuff. But I will do after I can properly buffer packets. I process packets collectively. So, how can I store in a buffer before copy to GPU? I will post the code if necessary to edit or if you have sample only to store in a buffer I really appreciate.
Thanks.

if i interpret your packet correctly, it is a structure containing a number of arrays

hence, whenever you copy/ transfer/ duplicate/ buffer a packet, you would have to deep copy it, as you would do for/ under a class copy/ assign constructor

thus, i am not altogether sure that you can then talk of a buffer of size 1M bytes

ok thanks let me try the deep copy. I have no idea before, thanks again.

Hey guys,
Do you have idea that how to process packet on CUDA? I want to offload different types of packet to CUDA GPU and retrieve result from GPU. Any help is appreciated.

Let’s assume a internet packet of 512 bytes for the sake of discussion… then 2 packets can be retrieved by memory request apperently… into the cache… so that’s about 8x2 packets = 16 packets.

So my prediction is that the gpu is more or less capable of processing 16 packets before some kind of cache trashing might take place… but that’s probably not correct… because nowadays it has somewhat larger caches…

Perhaps 32 KB caches… or perhaps second level 1 or 2 MB caches… let’s assume 1 MB or so…

Instead of 512 bytes I will use the more appriote 1500 bytes size of internet packets… maybe it’s even a little bit more but ok.

Currently the GTX 970/980 could hold something like 700 packets before it starts to “cache trash”.

Processing 700 packets in parallel is not to bad… then again the cpu can do it real fast… must keep that in mind… cpu has way more cache nowadays… anyway… question is what happens when it tries to process beyond 700 packets / sec… it will probably stall threads… and start to “thread context switch” waiting for memory loads to happen and such… but eventually the gpu will be bottlenecked by cache size.

I am not sure how to calculate exactly what the processing speed would be… for example let’s assume a situation where just 1 byte of each packet needs to be examined… what’s the maximum ammount of packets it can process/read from memory ?

Well ultimately… the memory speed of the GPU might not matter… because data needs to be uploaded to the gpu… and maybe downloaded again… which means PCI express bandwidth plays a roll… but… question is… is PCI bottleneck or actually the GPU ?

So must still examine GPU bottleneck potential…

Perhaps it’s easy… to predict GPU performance… let’s assume a somewhat standard 100 nanosecond access time per DRAM chip.

This means 2 packets can be retrieved per DRAM chip… so that means let’s calculate how much packets a DRAM chip can retrieve… I am not sure if this 100 nanoseconds applies if sequential access pattern is used… for now I will assume yes… any deviation from that is probably caused by cache effects which is somewhat we already examined somewhat.

1 second = 1000 milliseconds = 1000 000 microseconds = 1 000 000 000 nanoseconds.

1 000 000 000 / 100 = 10.000.000 memory requests possible per DRAM chip.

Assuming 8 DRAM chips this gives a performance of 80.000.000 memory requests… which funny enough reminds me of my GT 520 lol.

Perhaps the DRAM chip is faster than 100 nanoseconds that would have to be further looked into.

So my prediction for now is that GPU will be bottlenecked to 80.000.000 packets per second… which would be roughly: 120.000.000.000 = 120 GB/sec

So at least the speed of GPU seems sufficient to process packets on it. (My GT 520 actually can achieve 2 GB/sec or so… which is still more than my network chip). PCI is something like 16 GB or so… cpu is something like 16 GB/sec or so.

So I think it is safe to say that your idea of processing packets on a GPU is a viable idea.

Depending on which layers of the packets you want to process, ip/tcp/udp/application data etc…

You could use something like winsock… to retrieve/send packets and then perhaps collect some of them in batches and send them forth to the gpu.

Or you could use something like winpcap or so… or perhaps .NET methods or other hacking methods to get entire ip packets. I’ve seen some tool that does some kind of hacking into winsock I think it was to still get all data… kinda cool.

Getting all data from windows for packets is not that easy… sometimes windows will restrict access to certain parts of ip headers to prevent spoofing and that kind of thing… replay attacks etc… hmm I could look further into it… I will do so moment… packet editor it was called… maybe it has open source… or maybe you can contact developer and ask for help. Yes source seems available:

It’s in C# though… I think C# can be used to invoke cuda as well…

initially i thought it would matter, but now i think otherwise: does it actually matter whether the primary data originates from a packet - of course, assumed by Skybuck-et to be “a internet packet” - or an array?
or does it matter more
a) what is deemed as data within the data, and
b) how the data should be processed?

Skybuck-et has equated processing/ output to hardware and hardware givens, as well as data format, completely ignoring functionality - the core reason why code is written

“Processing 700 packets in parallel is not to bad”

is this a good design-idea…?

Thanks for your response. I am late to respond sorry for that. yes I am using winsock for sending /retrieving packet. main thing is to offload packet processing to gpu and what I want to retrieve is information about header of each layer like Ethernet, IP, tcp/udp, application layer like HTTP, FTP,DHCP

“to offload packet processing to gpu and what I want to retrieve is information about header of each layer like Ethernet, IP, tcp/udp, application layer like HTTP, FTP,DHCP”

would the cost of offloading not exceed the gains/ benefits, particularly given the type of information to be retrieved?

“information about header” seems ‘thin’, requiring very little processing, and would further be rather scattered
hence, overhead - or simply cost - would quickly rise; with the device not doing much, resulting in poor gains

if the packets required ‘thick’ processing, it may be a different story altogether…