GPUDirect

Hey everyone,

Can someone please explain to me how GPUDirect works a high level. From what I have gathered it allows certain devices to bypass going through Main Memory and instead having data piped directly to the GPU. Is this the case. Any information would be awesome as I am very new to CUDA but this looks promising to my application which needs to get large amounts of data processed at nearly real time. The data will be coming in on a GigE connection so bypassing main memory would be awesome.

Thanks

Hey everyone,

Can someone please explain to me how GPUDirect works a high level. From what I have gathered it allows certain devices to bypass going through Main Memory and instead having data piped directly to the GPU. Is this the case. Any information would be awesome as I am very new to CUDA but this looks promising to my application which needs to get large amounts of data processed at nearly real time. The data will be coming in on a GigE connection so bypassing main memory would be awesome.

Thanks

Would this technology allow for the following situation.

  1. Large amounts of data come in over a fiber channel
  2. The data goes through an infiniband chip
  3. Data is routed straight to GPU memory skipping main memory
  4. CPU executes kernel to manipulate data
  5. Data goes back over infiniband chip to next step of processing

*What if anything uses PCIe in this scenario.

Thanks

Would this technology allow for the following situation.

  1. Large amounts of data come in over a fiber channel
  2. The data goes through an infiniband chip
  3. Data is routed straight to GPU memory skipping main memory
  4. CPU executes kernel to manipulate data
  5. Data goes back over infiniband chip to next step of processing

*What if anything uses PCIe in this scenario.

Thanks

I was also looking for the same feature (see this post from me http://forums.nvidia.com/index.php?showtopic=202490)
But currently there does not seem like a way to bypass the memory transactions. GPUDirect2.0 has directpath between GPU-to-GPU, but is not yet open for other devices to access directly.

GPUDirect 1.0 (which removes the extra memory copy between Device memory and GPU memory) might also work for your application as it takes away the burden on CPU of copying. This uses pinned memory shared by both GPU and the device. And there are infiniband cards which uses this functionality (Qlogic/Mellanox)

I was also looking for the same feature (see this post from me http://forums.nvidia.com/index.php?showtopic=202490)
But currently there does not seem like a way to bypass the memory transactions. GPUDirect2.0 has directpath between GPU-to-GPU, but is not yet open for other devices to access directly.

GPUDirect 1.0 (which removes the extra memory copy between Device memory and GPU memory) might also work for your application as it takes away the burden on CPU of copying. This uses pinned memory shared by both GPU and the device. And there are infiniband cards which uses this functionality (Qlogic/Mellanox)

Is there a way to achieve this same performance boost without a specialized card.

Is there a way to achieve this same performance boost without a specialized card.

With SDK 4.0 Release, there are new functions which appears to do this. (For disclaimer, I never used it in a practical situation)

Check the CUDA_C_Programmin_Guide.pdf which comes with SDK.

Look for cudaHostRegister() [font=“Garamond,Garamond”][font=“Garamond,Garamond”]page-locks a range of memory allocated by [/font][/font]malloc()[font=“Garamond,Garamond”][font=“Garamond,Garamond”].

So if your device can use the same 'malloc()'ed memory to dump data, the GPU will be able to read it and use as a page-locked memory. So you can use any external cards to do that (not just the IB cards I mentioned), even the one you design.

[/font][/font]

With SDK 4.0 Release, there are new functions which appears to do this. (For disclaimer, I never used it in a practical situation)

Check the CUDA_C_Programmin_Guide.pdf which comes with SDK.

Look for cudaHostRegister() [font=“Garamond,Garamond”][font=“Garamond,Garamond”]page-locks a range of memory allocated by [/font][/font]malloc()[font=“Garamond,Garamond”][font=“Garamond,Garamond”].

So if your device can use the same 'malloc()'ed memory to dump data, the GPU will be able to read it and use as a page-locked memory. So you can use any external cards to do that (not just the IB cards I mentioned), even the one you design.

[/font][/font]

The solution is even simpler.
Keep using cudaMallocHost but set the flag CUDA_NIC_INTEROP=1

The solution is even simpler.
Keep using cudaMallocHost but set the flag CUDA_NIC_INTEROP=1

mfatica,

Thanks for the clarification.
I have been looking for more information on this. Can I use this with any external device? If we design an external PCIe card, would I be able to use this to share the same pinned memory between the GPU and the external card? Or is this flag intended only for Mellanox IB card?
And another question is if I can get a sneakpeek on if and when GPUDirect2.0 (peer-to-peer) might be available for use between GPU-to-thirdparty PCIe card. I can get rid off the memory access for data loading.

Appreciate the time. Thanks!

mfatica,

Thanks for the clarification.
I have been looking for more information on this. Can I use this with any external device? If we design an external PCIe card, would I be able to use this to share the same pinned memory between the GPU and the external card? Or is this flag intended only for Mellanox IB card?
And another question is if I can get a sneakpeek on if and when GPUDirect2.0 (peer-to-peer) might be available for use between GPU-to-thirdparty PCIe card. I can get rid off the memory access for data loading.

Appreciate the time. Thanks!