On which device are __device__ variables allocated?

I am using a D870. Running the deviceQuery example in the SDK tells me there are three CUDA devices;
device 0 = Quadro FX 1700
device 1 = Tesla C870
device 2 = Tesla C870

The program I am experimenting with declares several device arrays, and in the main function I set the device with the cudaSetDevice(device) function where device = 0,1,2.

Now when I run the program on each device I get three different behaviours.
device = 0 -> two small blocks of nan in the output
device = 1 -> the output contains no nan but is wrong
device = 2 -> every kernel has an unspecified launch failure

The arrays are large but not too large to occupy all global memory on each device.

I have been using device arrays for months, setting device = 1, and results have been always been good. But since moving to larger arrays this trouble is occuring.

What is happening? Are the arrays being allocated across more than one device? Is there a default device number used for __device-_variables?

This is a very very valid interesting question. Never discussed before…

With a multi-GPU setup, this can really be confusing…

Me 2 Awaiting an answer for this… Thanks

device variables will be allocated on whichever GPU your context is associated with, as will constant and texture references.

Your problem is probably from something else… How are you accessing the device variable on the host? cudaMemcpy(To/From)Symbol is the only way. (or… cudaGetSymbolAddress…)

But device variables are NOT associated with any context…

For example:

I could declare

__device__ float mydata[1000]

On a multi-GPU application, where will this data be ??

Can some1 answer this question please?

Wherever the context is? Consider it thread-local storage.

Arrays declared as “device float mydata[4000]” have been used like global memory even by the SDK kernels AFAIK.

Say –

  1. I spawn 2 threads in my application

  2. Each thread does a cudaSetDevice to separate CUDA GPU (GPU0 and GPU1)

  3. Each thread spawns the same kernel that accesses mydata

Will “mydata” be present on GPU0 and GPU1?

Best Regards,


why not write a quick experiment.

declare mydata in the preprocessor

in the main function populate mydata

declare a cutthread function and create two instances for GPU0 and GPU1

in that function pass an array for each GPU to write mydata into and get that array back and print it out.

what do you get?

main() cannot populate mydata because mydata is a “device” array. Only kernels can do that.

I am not doing that experiment because it really does not matter to me as we dont have any

such thing in our code. But yes, it is always good to know to avoid pitfalls.

Please do post your results if you experiment… Thanks!

Of course it can. use cudaMemcpyToSymbol.

Not that the contexts in the worker threads will be able to read what main wrote… as I said and tmurray confirmed: these variables are instanced per context.

I’ve been following this thread with interest, but am still a bit confused. If a variable is declared globally as below, what “context” is it in?

__device__ int myGlobalData[100];

int main(int argc, char *argv)


   ... launch two threads, each of which initializes a different device


It seems to me that you end up with two Cuda contexts in this example, neither of which is clearly a “default” context.

I suppose the question is purely academic. It seems that at most it could belong to only one device and therefore declaring variables globally doesn’t seem like a good idea when dealing with multiple devices.

There is no such thing as a “default” context. Any thread (including the main thread) which makes a call to the runtime API implicitly creates a CUDA context for itself. If you spawn two threads then each one will see its own copy of myGlobalData.

Moreover, if the main thread also calls a CUDA function, it gets yet another CUDA context, and will see a distinct copy of myGlobalData, separate from the worker threads.

In this way, OS threads are treated like separate processes by the device. (Following this model, I assume this means that you can’t exchange device pointers between contexts, even on the same device? Is this true?)

Thank you, this is what I was looking for. I suppose in retrospect that seems to be the only reasonable answer.


This was a very very useful thread! Thanks to every1 for participating and sharing the knowledge!

Not necessarily correct. You can use global variables to exchange data between threads. You will need some thread syncroniation mechanisms to achieve this, but nevertheless, it’s doable.

You took the quote out of context. In the context (ok, bad pun!) of this forum thread (…), we were talking about host threads sharing device pointers. I am 100% positive that you cannot share a device pointer between two contexts (=host threads in the runtime API). If you even try, the GPUs memory protection mechanisms kick in and you get an “invalid device pointer” error.

Any communication between two contexts on the host must be through host memory and of course can be accomplished with the appropriate synchronization mechanisms.

Yes, I was talking about host threads sharing device pointers. When I said global variables, I meant global variables on the host.

I am using the Windows API functions to create threads and manage contexts, and I am able to make the threads communicate with each other. I haven’t tried launching kernels or calling cuda functions with pointers created in other contexts, so I assume that that is where the problem would occur. That seems a to be a wierd limitation of the CUDA API.

I apologize if I jumped to conclusions too quickly, but it seems limiting to not be able to exchange device pointers between host threads. It would be useful to be able to allocate and copy the memory in a worker thread, while the master does something else, then I invoke the kernel(s) in a different worker thread.



No problem at all. I was just trying to clear it up so anyone else reading this thread doesn’t confuse themselves down the road (people quite often dig up 2 year old threads and ask about some subtle comment made in them…)

And yes, there are cases where it could indeed be very useful to shared device pointers among contexts. Unfortunately, this is not the way NVIDIA chose to set it up :(

I think we need some text-parsing AI tool that will show old relevant threads as a user is typing a new topic.

Before registering the new topic, the site should pop “Have u seen these threads?” and allow the user to launch new windows on clicking such links…

Then the user can decide whether he wants to post or not.

All we need is a high-speed text searchable engine – how about implementing that in the GPU?? oh…hoooo…

Oh well. most times google does a better job. I find Alex’s google tip “site:forums.nvidia.com” very useful to search forums…