Computing capability 1.1 Problems with the key word shared

lnxsurf · January 19, 2010, 12:18am

Hello everybody…

I need help with CUDA. My graphics card is the 250 GTS GPU and has the computing capability of 1.1. This only for information…

Now I have programmed some cuda programs where I used the key word shared and the problem is the compiler which needs the sm_12 architecture. I would like to use the advantage of the shared key. I know that variables which are declared with shared are available for threads within the same block. Is there a way to use this shared key? My installation is a 64 bit driver and I am using the 64 bit compiler with tools. Maybe is there a problem?

Thanks for your help!

Here my example code:

__global__ void kernelSpline(float2 *ret,

					const int resN,

					float *cfA,

					float *cfB,

					const float2 *iPts,

					const int baseN)

{

	int id = threadIdx.x;

	__shared__ int iElement, first;

	__shared__ int npts;

	int myElement=0;

	if(id == 0) {

		int rest = resN % blockDim.x;

		first = iElement =  blockIdx.x * resN/blockDim.x;

		npts = resN / blockDim.x;

		if(blockIdx.x == blockDim.x-1) npts+= rest;

//	if(blockIdx.x == 0) kernelCoefficients(cfA, cfB, iPts, baseN);

		__threadfence();

	}

	__syncthreads();  

	while(myElement < first+npts) {

		myElement=atomicAdd(&iElement, 1);

		if(myElement < resN) {

			float2 pmin = iPts[0];

			float2 pmax = iPts[baseN-1];

			float2 p1, p2, pt;

			float t = (pmax.x-pmin.x) / resN;

			pt.x = pmin.x + myElement*t;

			int i = findInterval(p1, p2, pt.x, iPts, baseN);

			pt.y = func(pt.x, p1.x, p2.x, cfB[i], cfB[i+1], cfA[i+1]);

			ret[myElement] = pt;

		}

	}

	__syncthreads();

}

tmurray · January 19, 2010, 12:41am

Uh, shared variables are available to every compute capable device (1.0, 1.1, whatever). The only thing 1.2 adds is some atomic operations for shared memory.

lnxsurf · January 19, 2010, 12:59am

Yes… I think so, too!

But why the compiler is telling me that I need the sm_12 architecture? And when I am using shared (switching between sm_11 and sm_12) than my code doesn’t work properly. So I can compile with sm_12 but for example the write operations on data doesn’t work even when i compille with sm_11 than the write operations work good.

Here the error message:

nvcc -I"/usr/local/cuda/include/" -c -arch sm_11 -g main.cu

ptxas /tmp/tmpxft_0000179a_00000000-2_main.ptx, line 1161; error   : Instruction '{atom,red}.shared' requires .target sm_12 or higher

ptxas fatal   : Ptx assembly aborted due to errors

make: *** [main.o] Fehler 255

tmurray · January 19, 2010, 1:05am

myElement=atomicAdd(&iElement, 1); while iElement is in shared memory. If you don’t use atomics there, it works.

lnxsurf · January 19, 2010, 8:28am

OK I see!
You are right… it works (compiling) without the atomic function but I need the function because i want to put the threads in a concurrent state in that way that every thread is taking one element which it have to compute.
Do I have any alternatives to that atomic function? Is it maybe a problem that I am using 64 bit compiler?

Cygnus_X1 · January 19, 2010, 1:16pm

Why can’t you do the following? :

for (int myElement=threadIdx.x; myElement<first+npts; myElement+=blockDim.x) {

  [...]

}

Note the above for loop is branching, not good if you need to use some __syncthreads() inside. If you need uniform loop you can do the following:

for (int baseElement=0; baseElement<first+npts; baseElement+=blockDim.x) {

  int myElement=baseElement+threadIdx.x;

  bool workToDo=(myElement<first+npts);

  if (workToDo) {

	[...]

  }

  __syncthreads();

  if (workToDo) {

	[...]

  }

}

If your are concerned about register usage in second example, you can move some variables to shared memory, (e.g. baseElement, terminating expression (first+npts)) and/or recompute (workToDo) in each if statement. If you do that however, replace for with while loop and put counter incrementation in “if(threadIdx.x==0)” since you want to increment it only once, not with every thread!

lnxsurf · January 19, 2010, 6:06pm

Yes! That’s it! You are right! Thanks a lot to all! Now i know how to solve my problem.
It’s a pity that I can’t use atomic functions. I wanted to do it with the atomic function so that I forgot that there are other solutions. External Image

Topic		Replies	Views
Shared Memory Compilation Error CUDA Programming and Performance	2	687	November 10, 2009
SM13: 64-bit atomic functions on shared memory? Is this really supported? CUDA Programming and Performance	0	3508	December 7, 2008
Questions with shared memory CUDA Programming and Performance	3	1670	June 21, 2011
Compilation Error: Instruction '{atom,red}.shared' requires .target sm_12 or higher CUDA Programming and Performance	0	595	August 1, 2014
Shared mem atomics Repeat topic CUDA Programming and Performance	47	8791	December 1, 2009
Atomic operation in shared memory CUDA Programming and Performance	1	3829	August 12, 2008
atomicAdd with shared memory failing last err "invalid device function" CUDA Programming and Performance	3	6674	July 20, 2008
Some confusion on using shared memory. CUDA Programming and Performance	26	9165	June 2, 2009
pointer to shared memory compiler problems CUDA Programming and Performance	19	14658	June 7, 2008
Shared memory coherance problem within a thread CUDA Programming and Performance	3	771	November 17, 2010

Computing capability 1.1 Problems with the key word __shared__

Related topics

Computing capability 1.1 Problems with the key word shared