I am using a D870. Running the deviceQuery example in the SDK tells me there are three CUDA devices;
device 0 = Quadro FX 1700
device 1 = Tesla C870
device 2 = Tesla C870
The program I am experimenting with declares several device arrays, and in the main function I set the device with the cudaSetDevice(device) function where device = 0,1,2.
Now when I run the program on each device I get three different behaviours.
device = 0 → two small blocks of nan in the output
device = 1 → the output contains no nan but is wrong
device = 2 → every kernel has an unspecified launch failure
The arrays are large but not too large to occupy all global memory on each device.
I have been using device arrays for months, setting device = 1, and results have been always been good. But since moving to larger arrays this trouble is occuring.
What is happening? Are the arrays being allocated across more than one device? Is there a default device number used for __device-_variables?
device variables will be allocated on whichever GPU your context is associated with, as will constant and texture references.
Your problem is probably from something else… How are you accessing the device variable on the host? cudaMemcpy(To/From)Symbol is the only way. (or… cudaGetSymbolAddress…)
Not that the contexts in the worker threads will be able to read what main wrote… as I said and tmurray confirmed: these variables are instanced per context.
I’ve been following this thread with interest, but am still a bit confused. If a variable is declared globally as below, what “context” is it in?
__device__ int myGlobalData[100];
int main(int argc, char *argv)
{
... launch two threads, each of which initializes a different device
}
It seems to me that you end up with two Cuda contexts in this example, neither of which is clearly a “default” context.
I suppose the question is purely academic. It seems that at most it could belong to only one device and therefore declaring variables globally doesn’t seem like a good idea when dealing with multiple devices.
There is no such thing as a “default” context. Any thread (including the main thread) which makes a call to the runtime API implicitly creates a CUDA context for itself. If you spawn two threads then each one will see its own copy of myGlobalData.
Moreover, if the main thread also calls a CUDA function, it gets yet another CUDA context, and will see a distinct copy of myGlobalData, separate from the worker threads.
In this way, OS threads are treated like separate processes by the device. (Following this model, I assume this means that you can’t exchange device pointers between contexts, even on the same device? Is this true?)
Not necessarily correct. You can use global variables to exchange data between threads. You will need some thread syncroniation mechanisms to achieve this, but nevertheless, it’s doable.
You took the quote out of context. In the context (ok, bad pun!) of this forum thread (…), we were talking about host threads sharing device pointers. I am 100% positive that you cannot share a device pointer between two contexts (=host threads in the runtime API). If you even try, the GPUs memory protection mechanisms kick in and you get an “invalid device pointer” error.
Any communication between two contexts on the host must be through host memory and of course can be accomplished with the appropriate synchronization mechanisms.
Yes, I was talking about host threads sharing device pointers. When I said global variables, I meant global variables on the host.
I am using the Windows API functions to create threads and manage contexts, and I am able to make the threads communicate with each other. I haven’t tried launching kernels or calling cuda functions with pointers created in other contexts, so I assume that that is where the problem would occur. That seems a to be a wierd limitation of the CUDA API.
I apologize if I jumped to conclusions too quickly, but it seems limiting to not be able to exchange device pointers between host threads. It would be useful to be able to allocate and copy the memory in a worker thread, while the master does something else, then I invoke the kernel(s) in a different worker thread.
No problem at all. I was just trying to clear it up so anyone else reading this thread doesn’t confuse themselves down the road (people quite often dig up 2 year old threads and ask about some subtle comment made in them…)
And yes, there are cases where it could indeed be very useful to shared device pointers among contexts. Unfortunately, this is not the way NVIDIA chose to set it up :(