CUDA and UDP packet loss

I have a sender and receiver application sending 500MB/sec of data using Myricom 10GigE NICs. The receiver box is running a Xeon X5570 and is connected to a Tesla S1070 with two host interface cards (each in a PCIe x16 slot) The Myricom card is in a PCIe x8 slot. I can run at this rate with no packet loss, but as soon as I place a CUDA runtime API call into my application at start-up, the receiver starts dropping UDP packets. It doesn’t really seem to matter which call I make, even a simple cudaGetDeviceCount call will trigger the packet loss to start occurring. I suspect there are various initializations being performed in the first CUDA run-time API call made, which is why it doesn’t matter which one. Anyone know what might be going on and how to go about fixing it?
I’ve theorized it might be some OS kernel level interaction between the NIC driver and the Nvidia driver, or it might be even more lower level than that. Perhaps at the PCIe level. Bottom line is something in CUDA is causing packet loss and I’m not sure what to do about it.

Additional info:
mobo: Supermicro
CPU: Xeon X5570
GPU: Tesla S1070
OS: Ubuntu 9.04
Nvidia 3.0 toolkit with 195.36.15 drivers
Myricom Drivers: Myri10GE_Linux_1.5.1
sysctl.conf changes for 16MB of read and write buffer space in the kernel
At 500MB/s 3.5% of the packets never make it to my application.
Lost packets are not being reported by netstat

Sounds like it might be interrupt related. If both devices wind up sharing the same IRQ, the latency for serving requests might be too high for the Myrinet card. Alternatively, it could be that IRQ balancing is poor or non-existent and one CPU core is being forced to handle all the interrupts from both devices.

Some inspection of the state of /proc/interrupts will probably tell all.

Will look into that, but it’s unclear to me what the difference would be before and after the cudaGetDeviceCount call. That is the only CUDA call I make. I’m not actually performing any CUDA processing at all (commented that all out while tracking down the problem.)

Only thing I can think of is that the driver is basically inactive and not touching the GPUs until you do that. Once you make that call, the GPUs are awake, the kernel driver is checking interrupts and the like, etc.

That is my guess too, We know that the driver unloads itself and leaves the card alone on non-display devices when there is no client connected (which is the basis of the nvidia-smi deamon mode trick to keep compute settings,etc). So it is likely that once a client connects to the driver, there is a steady stream of interrupt activity which is stepping on the toes of your Myrinet card.