can I manage cuda threads through main?

ghandurah · December 15, 2009, 11:47am

Hi all,

can I manage cuda threads through main function after they have been dispatched? ( I feel that it’s a stupid question, but I have to ask)

another question, what happens if two threads were accessing the same element in a global or shared memory array at the same time? this should introduce a RAW hazard right? is there anyway I can force mutual exclusion?

Thanks

LSChien · December 15, 2009, 12:37pm

when you specify execution configuration of one kernel, then you cannot change the configuration,

what do you mean “manage” ?

use Atomic functions, please see B.10 in programming guide.

avidday · December 15, 2009, 12:47pm

Not really. There is some scope for having the host write into memory which device kernels can read, but there are no guarantees about coherency, etc, so for most practical purposes, the device is a black box once a kernel is launched.

Indeed. There are both global and shared memory versions of atomic memory access primitives in the most recent hardware which serialize memory access (and by extension execution speed, so there is a large performance penalty). There are also block level synchronization barriers and thread voting functions which can be used to manage execution flow and engineer out many correctness problems which can effect block level resources, including shared memory. Global memory is limited to atomics at the moment.

Cygnus_X1 · December 15, 2009, 3:00pm

Indirectly, in theory, you might. Try mapping host RAM memory to device address space and then modify the content of the memory after kernel is launched. I believe this could be used to steer the execution of the kernel, but since you have absolutely no information on how fast kernel threads run that may be expensive or unreliable.

In my code, I do it in the opposite direction: while debugging I write some data to pinned, mapped memory, so that even if kernel crashes and GPU resets itself, I can still read the debug data on the host. For me it was very, very useful!

avidday · December 15, 2009, 3:14pm

I was mostly thinking about zero copy memory, which can also be very handy because it lets you follow what the kernel is doing asynchronously, although I still fret about coherency a bit. The reverse could be true too, there isn’t theoretically a reason why the host couldn’t set some values which a set of persistent threads read and act on in a kernel. Of course the practicalities of doing so (latency, coherence, etc) might make it of limited use.

ghandurah · December 15, 2009, 4:31pm

Sorry I wasn’t clear enough.

I mean in ordinary multithreading I can use functions like signal(), wait(), etc to manipulate a specific thread. The problem is that the work load done by each thread (in my application) is different, in the sense that the first one takes a lot of time to finish, while the late ones finish very fast, so I want to take some of the load from the heavy loaded threads and give it to the lightly loaded ones. To do so, I may require the main function to organize this process, that’s what I meant by “managing threads”.

Thanks :)

ghandurah · December 15, 2009, 4:32pm

Thanks for your reply :)

ghandurah · December 15, 2009, 4:36pm

That’s not what I meant by “managing threads”, but it’s interesting. Thanks for your reply :)

ghandurah · December 16, 2009, 1:38pm

I am trying to use atomicSub() as follows, but it’s giving me this error: “identifier “atomicSub” is undefined”, should I use a specific library or sth?

__device__ int dN=100000000;

__device__ int limit2=dN/2;

__global__ kernel (.....){

int limit=atomicSub(&limit2,10);

//use limit

}

Thanks

avidday · December 16, 2009, 1:41pm

You need to tell nvcc to compile with either compute 1.1 or 1.3 instructions, depending on what card you have. Add -arch=sm_11 or -arch=sm_13 to your nvcc agument list and it should work.

ghandurah · December 16, 2009, 1:44pm

I am compiling through visual c++ 2008, how can I do that there? I am actually using the built in template project that came with the SDK (my GPU has 1.3 computing capability )

Thanks

ghandurah · December 16, 2009, 1:55pm

This is what’s written in the command line place, I added -arch=sm_13

“$(CUDA_BIN_PATH)\nvcc.exe” -ccbin “$(VCInstallDir)bin” -c -D_DEBUG -DWIN32 -D_CONSOLE -arch=sm_13 -D_MBCS -Xcompiler /EHsc,/W3,/nologo,/Wp64,/Od,/Zi,/RTC1,/MTd -I"$(CUDA_INC_PATH)" -I./ -I…/…/common/inc -o $(ConfigurationName)$(InputName).obj $(InputFileName)

and now I got another error “can’t generate code for non empty constructors or destructors on device” for this line:

__device__ int dN=100000000;

__device__ int limit2=dN/2;		  //ERROR

__global__ kernel (){

}

how can I do the division above only once (outside kernel)? I want limit2 to be accessed by a lot of different threads

avidday · December 16, 2009, 2:01pm

Constant memory should work (and if you want the results to be determined at runtime then have the host code do the computation and write the results to the gpu symbol), although I wonder whether it is actually that empty kernel it is complaining about.

ghandurah · December 16, 2009, 2:04pm

the kernel is not empty, I just put it here like that for simplicity

can I access limit2 from host like this for example?

limit2+=2;

I mean I can use the variable’s name directly?

Thanks

avidday · December 16, 2009, 2:10pm

Obviously not like that.

Declare it as a constant and use cudaMemcpyToSymbol to write the value into the GPU memory.

ghandurah · December 16, 2009, 2:15pm

thanks a lot :)