I understand how the simpleMultiGPU example that comes with CUDA 2.0 works, but what happens if a device variable array is declared at file scope?
Instead of partitioning the data on the host and passing the partitioned data to the relevant kernel, why mot just pass the device number and shift the thread number? I assume that each kernel would start its thread id at 0 so that with K devices with the same grid spec of N blocks of M threads we would have N grids of thread numbers 0 to NM-1 so that if I wanted each of the devices to refer to the second element of their portion of the device array I would use the index [1 + devicenumber*(NM-1)] ie shift by devicenumber*(NM-1) ?
What happens if I declare device variables when using multiple devices?
To borrow OOP terminology, you get a separate “instance” of that device variable in each CUDA context. The same goes for constant and texture declarations. So it won’t work exactly for what you want…
However, you can implement your offset idea if you allocate the full sized array on each device. Sure, it is wasteful of device memory, but it allows for extremely rapid prototyping of multi-GPU code from existing single-GPU code: just malloc the same array on all devices and split the grid across the devices. If the memory usage becomes a problem, one can always go back and partition the data on the host to save memory.
But I thought that the memory structure was that there is one large device global memory of approx 1.6Gb, with each multiprocessor (with local and shared memory) connected to that one large global memory so that it acted as a large and very slow shared memory?
If it is the case that you state it could explain why I am getting segmentation faults when I try to run code based on device variables.
I’m trying to use device variables to minimize data transfer between host and devices.
Correct. On a single GPU, all multiprocessors share the same global memory and can read the same value from a device variable (or a malloc’d global mem array)
When you have 2 GPUs, multiprocessors on one GPU cannot read the device memory of another. Each CUDA context (each host thread) has a separate memory space, so device variables (or malloced global memory) cannot be shared between two CUDA contexts.
If you are getting seg faults with device variables even on a single GPU, it is likely because you are doing cudaMemcpy(&deviceVar, …). This seg faults because you can’t take the address of a device variable on the host (again, different memory spaces). To copy to device vars, use cudaMemcpyToSymbol or cudaGetSymbolAddress and use that address in cudaMemcpy.
Personally, I have never used a device variable. They have the same evils that global variables do and OOP programmers like me hate that. There is no performance or functionality difference to just cudaMalloc’ing every large array that you use, except that you have to pass the pointers around.