kernel using either shared or global memory, according to the problem size

After some further debugging, I found that the problem I asked about on the other thread (“__syncthreads() and global memory”) actually comes from the shared memory utilization strategy that I’m trying to put to use. Namely, depending on the problem size, the kernel in question could use either shared or global memory for given temporary array. I’m calculating the shared memory usage before launching kernel, and if decide that the problem is too big to use shared memory for the temporarily array, I allocate an array in global memory, and pass the pointer to the kernel. On the other side, if I decide that the problem is small enough for shared memory to be used for the temporary array, I pass 0 as this kernel argument value. The kernel itself then starts with code like this:

extern  __shared__  char sharedData[];

float* temp = (tempd == 0) ? (float*) sharedData : tempd; // "tempd" is the kernel argument mentioned above

After looking through programming gude further, I realized that this is the case when nvcc cannot decide which kind of pointer (to shared or global memory) is used, and it seems to me that it decided for shared memory pointer interpretation (strange that no warning is emitted by the compiler) , thus __syncthreads() seems to be synchronizing for shared memory writes only.

Now, I guess my question would be: how best to code over the problem (one obvious solution is to have 2 versions of the kernel, that would differ only in the initialization of the “temp” pointer above)? Also, the pattern I tried to describe (use shared memory for intermediate results is possible, otherwise use global memory) seems like it has to be common enough to me, thus I’m wondering is there any other better approach?

Thanks.

Fundamentally, the PTX instructions must change depending on which memory space you access. (I know, the G80 architecture is so clever!) So I think you will always need two versions of your kernel.

You can automate the code generation process a little bit by making a templatized device function, then having a global function that if()s between the two versions, like so:

__global__ entryPoint(int version)

{

  if(version==1)

	 kernel<1>();

  else

	 kernel<2>();

}

template<int version>

__device__ kernel()

{

	__shared__ float * array1;

	__device__ float * array2;

	#define ARRAY  (  (version==1) ? array1 : array2  )

	ARRAY[0] = 0.0f;

	ARRAY[1] = 1.0f;

	...

}

It’s not pretty, but I think this is the correct way. You can’t just use the macro, because you don’t want a million if()s in your code (during templetization these get optimized out), and you need to have the if statement in your entryPoint because templates have to be fully instantiated at compile time.