On which device are device variables allocated?

chrismc · March 9, 2009, 9:51am

I am using a D870. Running the deviceQuery example in the SDK tells me there are three CUDA devices;
device 0 = Quadro FX 1700
device 1 = Tesla C870
device 2 = Tesla C870

The program I am experimenting with declares several device arrays, and in the main function I set the device with the cudaSetDevice(device) function where device = 0,1,2.

Now when I run the program on each device I get three different behaviours.
device = 0 → two small blocks of nan in the output
device = 1 → the output contains no nan but is wrong
device = 2 → every kernel has an unspecified launch failure

The arrays are large but not too large to occupy all global memory on each device.

I have been using device arrays for months, setting device = 1, and results have been always been good. But since moving to larger arrays this trouble is occuring.

What is happening? Are the arrays being allocated across more than one device? Is there a default device number used for __device-_variables?

Sarnath · March 9, 2009, 9:53am

This is a very very valid interesting question. Never discussed before…

With a multi-GPU setup, this can really be confusing…

Me 2 Awaiting an answer for this… Thanks

MisterAnderson42 · March 9, 2009, 11:49am

device variables will be allocated on whichever GPU your context is associated with, as will constant and texture references.

Your problem is probably from something else… How are you accessing the device variable on the host? cudaMemcpy(To/From)Symbol is the only way. (or… cudaGetSymbolAddress…)

Sarnath · March 9, 2009, 11:52am

But device variables are NOT associated with any context…

For example:

I could declare

__device__ float mydata[1000]

On a multi-GPU application, where will this data be ??

Sarnath · March 10, 2009, 4:42am

Can some1 answer this question please?

tmurray · March 10, 2009, 6:01am

Wherever the context is? Consider it thread-local storage.

Sarnath · March 10, 2009, 6:10am

Arrays declared as “device float mydata[4000]” have been used like global memory even by the SDK kernels AFAIK.

Say –

I spawn 2 threads in my application
Each thread does a cudaSetDevice to separate CUDA GPU (GPU0 and GPU1)
Each thread spawns the same kernel that accesses mydata

Will “mydata” be present on GPU0 and GPU1?

Best Regards,

Sarnath

chrismc · March 10, 2009, 10:08am

why not write a quick experiment.

declare mydata in the preprocessor

in the main function populate mydata

declare a cutthread function and create two instances for GPU0 and GPU1

in that function pass an array for each GPU to write mydata into and get that array back and print it out.

what do you get?

Sarnath · March 10, 2009, 10:48am

main() cannot populate mydata because mydata is a “device” array. Only kernels can do that.

I am not doing that experiment because it really does not matter to me as we dont have any

such thing in our code. But yes, it is always good to know to avoid pitfalls.

Please do post your results if you experiment… Thanks!

MisterAnderson42 · March 10, 2009, 12:14pm

Of course it can. use cudaMemcpyToSymbol.

Not that the contexts in the worker threads will be able to read what main wrote… as I said and tmurray confirmed: these variables are instanced per context.

CudaSpeak · March 10, 2009, 1:13pm

I’ve been following this thread with interest, but am still a bit confused. If a variable is declared globally as below, what “context” is it in?

__device__ int myGlobalData[100];

int main(int argc, char *argv)

{

   ... launch two threads, each of which initializes a different device

}

It seems to me that you end up with two Cuda contexts in this example, neither of which is clearly a “default” context.

I suppose the question is purely academic. It seems that at most it could belong to only one device and therefore declaring variables globally doesn’t seem like a good idea when dealing with multiple devices.

seibert · March 10, 2009, 4:24pm

I’ve been following this thread with interest, but am still a bit confused. If a variable is declared globally as below, what “context” is it in?
__device__ int myGlobalData[100];

int main(int argc, char *argv)

{

   ... launch two threads, each of which initializes a different device

}
It seems to me that you end up with two Cuda contexts in this example, neither of which is clearly a “default” context.

There is no such thing as a “default” context. Any thread (including the main thread) which makes a call to the runtime API implicitly creates a CUDA context for itself. If you spawn two threads then each one will see its own copy of myGlobalData.

Moreover, if the main thread also calls a CUDA function, it gets yet another CUDA context, and will see a distinct copy of myGlobalData, separate from the worker threads.

In this way, OS threads are treated like separate processes by the device. (Following this model, I assume this means that you can’t exchange device pointers between contexts, even on the same device? Is this true?)

CudaSpeak · March 10, 2009, 4:58pm

Thank you, this is what I was looking for. I suppose in retrospect that seems to be the only reasonable answer.

MisterAnderson42 · March 10, 2009, 4:59pm

Correct.

Sarnath · March 11, 2009, 3:33am

This was a very very useful thread! Thanks to every1 for participating and sharing the knowledge!

Mr_Nuke · March 11, 2009, 9:58am

Not necessarily correct. You can use global variables to exchange data between threads. You will need some thread syncroniation mechanisms to achieve this, but nevertheless, it’s doable.

MisterAnderson42 · March 11, 2009, 11:37am

You took the quote out of context. In the context (ok, bad pun!) of this forum thread (…), we were talking about host threads sharing device pointers. I am 100% positive that you cannot share a device pointer between two contexts (=host threads in the runtime API). If you even try, the GPUs memory protection mechanisms kick in and you get an “invalid device pointer” error.

Any communication between two contexts on the host must be through host memory and of course can be accomplished with the appropriate synchronization mechanisms.

Mr_Nuke · March 11, 2009, 8:49pm

Yes, I was talking about host threads sharing device pointers. When I said global variables, I meant global variables on the host.

I am using the Windows API functions to create threads and manage contexts, and I am able to make the threads communicate with each other. I haven’t tried launching kernels or calling cuda functions with pointers created in other contexts, so I assume that that is where the problem would occur. That seems a to be a wierd limitation of the CUDA API.

I apologize if I jumped to conclusions too quickly, but it seems limiting to not be able to exchange device pointers between host threads. It would be useful to be able to allocate and copy the memory in a worker thread, while the master does something else, then I invoke the kernel(s) in a different worker thread.

MisterAnderson42 · March 12, 2009, 11:52am

Yep.

indeed

No problem at all. I was just trying to clear it up so anyone else reading this thread doesn’t confuse themselves down the road (people quite often dig up 2 year old threads and ask about some subtle comment made in them…)

And yes, there are cases where it could indeed be very useful to shared device pointers among contexts. Unfortunately, this is not the way NVIDIA chose to set it up :(

Sarnath · March 12, 2009, 12:07pm

I think we need some text-parsing AI tool that will show old relevant threads as a user is typing a new topic.

Before registering the new topic, the site should pop “Have u seen these threads?” and allow the user to launch new windows on clicking such links…

Then the user can decide whether he wants to post or not.

All we need is a high-speed text searchable engine – how about implementing that in the GPU?? oh…hoooo…

Oh well. most times google does a better job. I find Alex’s google tip “site:forums.nvidia.com” very useful to search forums…

Topic		Replies	Views
__device__variables and multiple devices CUDA Programming and Performance	4	2651	September 11, 2008
Multiple GPUs, multiple applications CUDA Programming and Performance	10	9992	April 22, 2009
MultiGPU start help CUDA Programming and Performance	8	10522	August 10, 2010
MultiGPUs newbie question Data transformation problem CUDA Programming and Performance	12	5152	March 18, 2008
Device Memory Mangement CUDA Programming and Performance	14	3418	December 5, 2008
Weird behaviour while using textures tex1Dfetch error CUDA Programming and Performance	11	3774	July 8, 2008
A few thaughts about CUDA CUDA Programming and Performance	8	7758	January 7, 2010
CUDA Memory Consistency CUDA Programming and Performance	23	55438	March 8, 2007
Quick Question on cudaSetDevice()? It does not work in my case. CUDA Programming and Performance	5	11543	November 20, 2009
Mapped memory across multiple GPUs CUDA Programming and Performance	3	8737	October 28, 2010

On which device are __device__ variables allocated?

Related topics

On which device are device variables allocated?