I noticed if you call cudaMalloc from a thread ( and save the pointer ) and then you use/access the returned pointer from other CPU thread returns cudaBadDevicePointer error on cudaMemcpy… any explanation why pls?
Also have other question… can I create the CUDA device ( well, cudaSetDevice really ) from a thread and then call cudaMemcpy/fire kernel/etc from other thread?
CUDA 2.0 now has a Context Management API (section 188.8.131.52 of the CUDA Programming Guide).
A CUDA context can be bound/unbound to different host threads. But note that all resources and actions (i.e. CUDA API calls or launches) still must be performed within the same CUDA context that allocates the resources. Passing a CUdeviceptr between two CUDA contexts is not currently supported. For more information, refer to the threadMigration CUDA SDK example for an example on how to use this API.
Long story short from my digging, its the fact that CUDA just doesn’t implement this functionality.
I’ve run into issues as well while trying to get the same data to multiple cards, and I don’t want to allocated pinned memory (nor can I given the size of the data). From what I can tell on Linux, the CUDA runtime library has an open file descriptor to /dev/nvidiactl and one to whatever the thread’s current device is, say /dev/nvidia0 if you did cudaSetDevice(0).
When you call cudaMallocHost, it performs one undocumented ioctl to the device and two undocumented ioctls to nvidiactl. Then it calls mmap(2) specifying the file descriptor of the device and an offset which I’m assuming is returned by one of the ioctl’s.
Where fd 5 is /dev/nvidia0 and fd 3 is /dev/nvidiactl, and cudaMallocHost was asked to allocate 256mb.
Now it would seem that adding to the cuda runtime to request an I/O region just as the 3 ioctls do (or what I’m guessing they do) then doing an mmap and specifying a base address which is supplied by the user (the one that is desired to be shared) the driver could then say “hey I’m reusing this address and its all setup for me so I’ll use it”. If anyone at nVidia on the CUDA team wants pointers as to how I’d see doing this, feel free to email me. I’d really like a clean fix for this.
Time to get preachy. At this point I’ve done a lot of dev on CUDA and I’ve got an ever growing list of feature requests. Namely the design of CUDA seems far too abstracted from the hardware. There is a fine balance of non-disclosure of IP and functionality for performance and usability that has to be maintained. CUDA leans too far to the former.
What CUDA needs to make it scalable for scientific computing:
Shared pinned host memory for multi-GPU (the problem mentioned)
Broadcast from host memory to multi-GPU (nVidia even advertises this in their 790i Ultra SLI docs)
GPU <-> GPU DMA
Slightly larger shared memory. While the increase to 16k is nice and I understand GPU’s lean towards logic over memory, its less than one blockram on a Spartan-3 FPGA. Many times I’ve looked at using GPGPU to solve a problem out for rapid development purposes, I find myself going back to FPGA’s for almost every case. Especially as the problem scales.
I mean look at it this way, a PC has at best 12.5 GB/s from host memory to the northbridge. DDR3 1800+ host memory has at most 25.6 GB/s to the northbridge. PCIe 2.0 gives 5 Gbit/s per lane, with 32 lanes on most modern boards to the NB. 2GB/s to the southbridge from the northbridge. And forget anything on the southbridge, its bottlenecked by the NB <-> SB.
So from this model one could utilize 2 cards at bus speed fairly well provided the CPU was somewhat quiet. With broadcast data, the CPU could go ape with requests to host memory and the cards would still be fully utilized. Need to get data card to card? If GPU <-> GPU DMA was over the PCIe bus and most northbridges cough advertised in the 790i Ultra SLI docs cough could handle that and the CPU going ape with requests to host memory.