can I manage cuda threads through main function after they have been dispatched? ( I feel that it’s a stupid question, but I have to ask)
another question, what happens if two threads were accessing the same element in a global or shared memory array at the same time? this should introduce a RAW hazard right? is there anyway I can force mutual exclusion?
Not really. There is some scope for having the host write into memory which device kernels can read, but there are no guarantees about coherency, etc, so for most practical purposes, the device is a black box once a kernel is launched.
Indeed. There are both global and shared memory versions of atomic memory access primitives in the most recent hardware which serialize memory access (and by extension execution speed, so there is a large performance penalty). There are also block level synchronization barriers and thread voting functions which can be used to manage execution flow and engineer out many correctness problems which can effect block level resources, including shared memory. Global memory is limited to atomics at the moment.
Indirectly, in theory, you might. Try mapping host RAM memory to device address space and then modify the content of the memory after kernel is launched. I believe this could be used to steer the execution of the kernel, but since you have absolutely no information on how fast kernel threads run that may be expensive or unreliable.
In my code, I do it in the opposite direction: while debugging I write some data to pinned, mapped memory, so that even if kernel crashes and GPU resets itself, I can still read the debug data on the host. For me it was very, very useful!
I was mostly thinking about zero copy memory, which can also be very handy because it lets you follow what the kernel is doing asynchronously, although I still fret about coherency a bit. The reverse could be true too, there isn’t theoretically a reason why the host couldn’t set some values which a set of persistent threads read and act on in a kernel. Of course the practicalities of doing so (latency, coherence, etc) might make it of limited use.
I mean in ordinary multithreading I can use functions like signal(), wait(), etc to manipulate a specific thread. The problem is that the work load done by each thread (in my application) is different, in the sense that the first one takes a lot of time to finish, while the late ones finish very fast, so I want to take some of the load from the heavy loaded threads and give it to the lightly loaded ones. To do so, I may require the main function to organize this process, that’s what I meant by “managing threads”.
I am trying to use atomicSub() as follows, but it’s giving me this error: “identifier “atomicSub” is undefined”, should I use a specific library or sth?
__device__ int dN=100000000;
__device__ int limit2=dN/2;
__global__ kernel (.....){
int limit=atomicSub(&limit2,10);
//use limit
}
You need to tell nvcc to compile with either compute 1.1 or 1.3 instructions, depending on what card you have. Add -arch=sm_11 or -arch=sm_13 to your nvcc agument list and it should work.
I am compiling through visual c++ 2008, how can I do that there? I am actually using the built in template project that came with the SDK (my GPU has 1.3 computing capability )
Constant memory should work (and if you want the results to be determined at runtime then have the host code do the computation and write the results to the gpu symbol), although I wonder whether it is actually that empty kernel it is complaining about.