can I manage cuda threads through main?

Hi all,

can I manage cuda threads through main function after they have been dispatched? ( I feel that it’s a stupid question, but I have to ask)

another question, what happens if two threads were accessing the same element in a global or shared memory array at the same time? this should introduce a RAW hazard right? is there anyway I can force mutual exclusion?


when you specify execution configuration of one kernel, then you cannot change the configuration,

what do you mean “manage” ?

use Atomic functions, please see B.10 in programming guide.

Not really. There is some scope for having the host write into memory which device kernels can read, but there are no guarantees about coherency, etc, so for most practical purposes, the device is a black box once a kernel is launched.

Indeed. There are both global and shared memory versions of atomic memory access primitives in the most recent hardware which serialize memory access (and by extension execution speed, so there is a large performance penalty). There are also block level synchronization barriers and thread voting functions which can be used to manage execution flow and engineer out many correctness problems which can effect block level resources, including shared memory. Global memory is limited to atomics at the moment.

Indirectly, in theory, you might. Try mapping host RAM memory to device address space and then modify the content of the memory after kernel is launched. I believe this could be used to steer the execution of the kernel, but since you have absolutely no information on how fast kernel threads run that may be expensive or unreliable.

In my code, I do it in the opposite direction: while debugging I write some data to pinned, mapped memory, so that even if kernel crashes and GPU resets itself, I can still read the debug data on the host. For me it was very, very useful!

I was mostly thinking about zero copy memory, which can also be very handy because it lets you follow what the kernel is doing asynchronously, although I still fret about coherency a bit. The reverse could be true too, there isn’t theoretically a reason why the host couldn’t set some values which a set of persistent threads read and act on in a kernel. Of course the practicalities of doing so (latency, coherence, etc) might make it of limited use.

Sorry I wasn’t clear enough.

I mean in ordinary multithreading I can use functions like signal(), wait(), etc to manipulate a specific thread. The problem is that the work load done by each thread (in my application) is different, in the sense that the first one takes a lot of time to finish, while the late ones finish very fast, so I want to take some of the load from the heavy loaded threads and give it to the lightly loaded ones. To do so, I may require the main function to organize this process, that’s what I meant by “managing threads”.

Thanks :)

Thanks for your reply :)

That’s not what I meant by “managing threads”, but it’s interesting. Thanks for your reply :)

I am trying to use atomicSub() as follows, but it’s giving me this error: “identifier “atomicSub” is undefined”, should I use a specific library or sth?

__device__ int dN=100000000;

__device__ int limit2=dN/2;

__global__ kernel (.....){

int limit=atomicSub(&limit2,10);

//use limit



You need to tell nvcc to compile with either compute 1.1 or 1.3 instructions, depending on what card you have. Add -arch=sm_11 or -arch=sm_13 to your nvcc agument list and it should work.

I am compiling through visual c++ 2008, how can I do that there? I am actually using the built in template project that came with the SDK (my GPU has 1.3 computing capability )


This is what’s written in the command line place, I added -arch=sm_13

(CUDA_BIN_PATH)\nvcc.exe" -ccbin "(VCInstallDir)bin” -c -D_DEBUG -DWIN32 -D_CONSOLE -arch=sm_13 -D_MBCS -Xcompiler /EHsc,/W3,/nologo,/Wp64,/Od,/Zi,/RTC1,/MTd -I"(CUDA_INC_PATH)" -I./ -I../../common/inc -o (ConfigurationName)$(InputName).obj $(InputFileName)

and now I got another error “can’t generate code for non empty constructors or destructors on device” for this line:

__device__ int dN=100000000;

__device__ int limit2=dN/2;		  //ERROR

__global__ kernel (){


how can I do the division above only once (outside kernel)? I want limit2 to be accessed by a lot of different threads

Constant memory should work (and if you want the results to be determined at runtime then have the host code do the computation and write the results to the gpu symbol), although I wonder whether it is actually that empty kernel it is complaining about.

the kernel is not empty, I just put it here like that for simplicity

can I access limit2 from host like this for example?


I mean I can use the variable’s name directly?


Obviously not like that.

Declare it as a constant and use cudaMemcpyToSymbol to write the value into the GPU memory.

thanks a lot :)