Computing capability 1.1 Problems with the key word __shared__

Hello everybody…

I need help with CUDA. My graphics card is the 250 GTS GPU and has the computing capability of 1.1. This only for information…

Now I have programmed some cuda programs where I used the key word shared and the problem is the compiler which needs the sm_12 architecture. I would like to use the advantage of the shared key. I know that variables which are declared with shared are available for threads within the same block. Is there a way to use this shared key? My installation is a 64 bit driver and I am using the 64 bit compiler with tools. Maybe is there a problem?

Thanks for your help!

Here my example code:

__global__ void kernelSpline(float2 *ret,

					const int resN,

					float *cfA,

					float *cfB,

					const float2 *iPts,

					const int baseN)

{

	int id = threadIdx.x;

	__shared__ int iElement, first;

	__shared__ int npts;

	int myElement=0;

	if(id == 0) {

		int rest = resN % blockDim.x;

		first = iElement =  blockIdx.x * resN/blockDim.x;

		npts = resN / blockDim.x;

		if(blockIdx.x == blockDim.x-1) npts+= rest;

//	if(blockIdx.x == 0) kernelCoefficients(cfA, cfB, iPts, baseN);

		__threadfence();

	}

	__syncthreads();  

	while(myElement < first+npts) {

		myElement=atomicAdd(&iElement, 1);

		if(myElement < resN) {

			float2 pmin = iPts[0];

			float2 pmax = iPts[baseN-1];

			float2 p1, p2, pt;

			float t = (pmax.x-pmin.x) / resN;

			pt.x = pmin.x + myElement*t;

			int i = findInterval(p1, p2, pt.x, iPts, baseN);

			pt.y = func(pt.x, p1.x, p2.x, cfB[i], cfB[i+1], cfA[i+1]);

			ret[myElement] = pt;

		}

	}

	__syncthreads();

}

Uh, shared variables are available to every compute capable device (1.0, 1.1, whatever). The only thing 1.2 adds is some atomic operations for shared memory.

Yes… I think so, too!

But why the compiler is telling me that I need the sm_12 architecture? And when I am using shared (switching between sm_11 and sm_12) than my code doesn’t work properly. So I can compile with sm_12 but for example the write operations on data doesn’t work even when i compille with sm_11 than the write operations work good.

Here the error message:

nvcc -I"/usr/local/cuda/include/" -c -arch sm_11 -g main.cu

ptxas /tmp/tmpxft_0000179a_00000000-2_main.ptx, line 1161; error   : Instruction '{atom,red}.shared' requires .target sm_12 or higher

ptxas fatal   : Ptx assembly aborted due to errors

make: *** [main.o] Fehler 255

myElement=atomicAdd(&iElement, 1); while iElement is in shared memory. If you don’t use atomics there, it works.

OK I see!
You are right… it works (compiling) without the atomic function but I need the function because i want to put the threads in a concurrent state in that way that every thread is taking one element which it have to compute.
Do I have any alternatives to that atomic function? Is it maybe a problem that I am using 64 bit compiler?

Why can’t you do the following? :

for (int myElement=threadIdx.x; myElement<first+npts; myElement+=blockDim.x) {

  [...]

}

Note the above for loop is branching, not good if you need to use some __syncthreads() inside. If you need uniform loop you can do the following:

for (int baseElement=0; baseElement<first+npts; baseElement+=blockDim.x) {

  int myElement=baseElement+threadIdx.x;

  bool workToDo=(myElement<first+npts);

  if (workToDo) {

	[...]

  }

  __syncthreads();

  if (workToDo) {

	[...]

  }

}

If your are concerned about register usage in second example, you can move some variables to shared memory, (e.g. baseElement, terminating expression (first+npts)) and/or recompute (workToDo) in each if statement. If you do that however, replace for with while loop and put counter incrementation in “if(threadIdx.x==0)” since you want to increment it only once, not with every thread!

Yes! That’s it! You are right! Thanks a lot to all! Now i know how to solve my problem.
It’s a pity that I can’t use atomic functions. I wanted to do it with the atomic function so that I forgot that there are other solutions. External Image