I need help with CUDA. My graphics card is the 250 GTS GPU and has the computing capability of 1.1. This only for information…
Now I have programmed some cuda programs where I used the key word shared and the problem is the compiler which needs the sm_12 architecture. I would like to use the advantage of the shared key. I know that variables which are declared with shared are available for threads within the same block. Is there a way to use this shared key? My installation is a 64 bit driver and I am using the 64 bit compiler with tools. Maybe is there a problem?
Uh, shared variables are available to every compute capable device (1.0, 1.1, whatever). The only thing 1.2 adds is some atomic operations for shared memory.
But why the compiler is telling me that I need the sm_12 architecture? And when I am using shared (switching between sm_11 and sm_12) than my code doesn’t work properly. So I can compile with sm_12 but for example the write operations on data doesn’t work even when i compille with sm_11 than the write operations work good.
Here the error message:
nvcc -I"/usr/local/cuda/include/" -c -arch sm_11 -g main.cu
ptxas /tmp/tmpxft_0000179a_00000000-2_main.ptx, line 1161; error : Instruction '{atom,red}.shared' requires .target sm_12 or higher
ptxas fatal : Ptx assembly aborted due to errors
make: *** [main.o] Fehler 255
OK I see!
You are right… it works (compiling) without the atomic function but I need the function because i want to put the threads in a concurrent state in that way that every thread is taking one element which it have to compute.
Do I have any alternatives to that atomic function? Is it maybe a problem that I am using 64 bit compiler?
for (int myElement=threadIdx.x; myElement<first+npts; myElement+=blockDim.x) {
[...]
}
Note the above for loop is branching, not good if you need to use some __syncthreads() inside. If you need uniform loop you can do the following:
for (int baseElement=0; baseElement<first+npts; baseElement+=blockDim.x) {
int myElement=baseElement+threadIdx.x;
bool workToDo=(myElement<first+npts);
if (workToDo) {
[...]
}
__syncthreads();
if (workToDo) {
[...]
}
}
If your are concerned about register usage in second example, you can move some variables to shared memory, (e.g. baseElement, terminating expression (first+npts)) and/or recompute (workToDo) in each if statement. If you do that however, replace for with while loop and put counter incrementation in “if(threadIdx.x==0)” since you want to increment it only once, not with every thread!
Yes! That’s it! You are right! Thanks a lot to all! Now i know how to solve my problem.
It’s a pity that I can’t use atomic functions. I wanted to do it with the atomic function so that I forgot that there are other solutions. External Image